Skip to content

Conversation

rabumaabraham
Copy link

@rabumaabraham rabumaabraham commented Oct 5, 2025

  • Ensure h2 headings with meaningful content are not removed even if parent div contains edit/talk links
  • Updated _cleanHeaders, _cleanConditionally, and _prepArticle to handle this scenario
  • Added test case in test/wikipedia-h2.js to verify fix
  • Existing functionality and other tests remain unaffected

Fixes #986

- Ensure <h2> headings with meaningful content are not removed even if parent div contains edit/talk links
- Updated _cleanHeaders, _cleanConditionally, and _prepArticle to handle this scenario
- Added test case in test/wikipedia-h2.js to verify fix
- Existing functionality and other tests remain unaffected

Fixes mozilla#986
@rabumaabraham
Copy link
Author

Hi! This PR fixes issue #986 by preserving Wikipedia h2 headings even if the parent div contains edit/talk links.

  • Updated _cleanHeaders, _cleanConditionally, and _prepArticle
  • Added test wikipedia-h2.js, which passes locally

Note: The CI failure is in an unrelated test (nytimes-2). All tests for this change pass locally and linting has been applied.

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Hi! This PR fixes issue #986 by preserving Wikipedia h2 headings even if the parent div contains edit/talk links.

* Updated _cleanHeaders, _cleanConditionally, and _prepArticle

Can you explain why/if we really need to update all 3 of these, or is updating just 1 or 2 of them sufficient? If it really needs to be all 3, can we share more code between these, instead of copy-pasting the exception for these h2s?

Or, alternatively, can we come up with a different way to avoid the wikipedia headings being affected here? Perhaps like the "convert divs to Ps" logic, we could use similar logic to replace the parent div with just the inner h2?

* Added test wikipedia-h2.js, which passes locally

Please can you instead use an actual wikipedia page's markup using generate-testcase to generate a similar testcase to all the other testcases we already have?

Note: The CI failure is in an unrelated test (nytimes-2). All tests for this change pass locally and linting has been applied.

Well, you're changing readability and this test shows that as a result the output for other pages is also changing, which is causing the test failures. In that sense they are not really "unrelated" - the failures are a consequence of your changes.

If you run the generate-testcase script after your changes for the 2 nytimes testcases, it will change the relevant expected.html files to reflect this differing output. Can you update the PR with this so we can see how the output changes for those inputs?

Readability.js Outdated
Comment on lines 828 to 829
var h2Text = this._getInnerText(h2Elements[i], false).trim();
if (h2Text.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check if the h2 is non-empty?

…editsection spans

Remove Wikipedia edit section spans (mw-editsection class) early in _prepDocument()
to prevent them from adding negative class weight that would cause h2 headings to
be incorrectly removed.

This is a minimal, focused fix that only modifies _prepDocument() rather than
adding exceptions in multiple cleaning functions.

- Updated test cases: wikipedia, wikipedia-3, wikipedia-4, nytimes-1, nytimes-2
- All 1984 tests passing

Fixes mozilla#986
@rabumaabraham
Copy link
Author

rabumaabraham commented Oct 11, 2025

Hi @gijsk 👋

Thanks a lot for your detailed feedback!

I’ve gone through your review carefully and refactored the implementation to fully address all your points:

✅ Simplified the fix: Only _prepDocument() is modified now (no longer touching _cleanHeaders, _cleanConditionally, or _prepArticle).
✅ Root cause fix: The problematic Wikipedia edit section spans (mw-editsection) are now removed early in the pipeline, similar to how tags are handled, preventing them from affecting h2 headings downstream.
✅ Cleaner code: The solution is just an 8-line addition, with no duplicate logic.
✅ Proper tests: Removed the custom test file and regenerated wikipedia, wikipedia-3, wikipedia-4, nytimes-1, and nytimes-2 test cases using the official generate-testcase script.
✅ Simplified logic: Removed the unnecessary if (h2Text.length) check, since empty headings are already covered by _isElementWithoutContent().
✅ Verified everything: All 1984 tests pass locally (npm test), and both ESLint and Prettier checks are clean.

This approach fixes the root cause rather than working around it, keeping the codebase consistent and easier to maintain.

Ready for re-review whenever you get a chance.
Thanks again for your guidance!

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slow response. The test changes look good to me. I do still have a question about the implementation.

]);

for (var i = 0; i < embeds.length; i++) {
for (var k = 0; k < embeds.length; k++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated, can you revert it?

Comment on lines +681 to +688
// Fix for issue #986: Remove Wikipedia edit section links that cause headings
// to be improperly removed. These spans contain "edit" links and add negative
// weight to otherwise good headings.
this._forEachNode(this._getAllNodesWithTag(doc, ["span"]), function (span) {
if (span.className && span.className.includes("mw-editsection")) {
span.remove();
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, at this point I have 2 concerns:

  1. this is going to be really slow and accomplish almost nothing for most documents, as they will have lots of spans that don't fit this criterium anyway. So from a performance PoV it'd be better to filter for header elements first and then check if they include these spans. For that, it seems like it'd be cleaner to just update _cleanHeaders to account for this? That already looks at h1/h2 headers, and is presumably where this is being removed - is only updating that method insufficient?
  2. This seems hyper-focused on Wikipedia (or mediawiki) itself. Is there any way to make this more generic? Perhaps the technique in WIP permalink anchor stripping #982 could be generalized?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wikipedia's h2s aren't included.

2 participants