-
Notifications
You must be signed in to change notification settings - Fork 688
Fix #986: Preserve Wikipedia <h2> headings #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix #986: Preserve Wikipedia <h2> headings #987
Conversation
- Ensure <h2> headings with meaningful content are not removed even if parent div contains edit/talk links - Updated _cleanHeaders, _cleanConditionally, and _prepArticle to handle this scenario - Added test case in test/wikipedia-h2.js to verify fix - Existing functionality and other tests remain unaffected Fixes mozilla#986
|
Hi! This PR fixes issue #986 by preserving Wikipedia h2 headings even if the parent div contains edit/talk links.
Note: The CI failure is in an unrelated test ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Hi! This PR fixes issue #986 by preserving Wikipedia h2 headings even if the parent div contains edit/talk links.
* Updated _cleanHeaders, _cleanConditionally, and _prepArticle
Can you explain why/if we really need to update all 3 of these, or is updating just 1 or 2 of them sufficient? If it really needs to be all 3, can we share more code between these, instead of copy-pasting the exception for these h2s?
Or, alternatively, can we come up with a different way to avoid the wikipedia headings being affected here? Perhaps like the "convert divs to Ps" logic, we could use similar logic to replace the parent div with just the inner h2?
* Added test wikipedia-h2.js, which passes locally
Please can you instead use an actual wikipedia page's markup using generate-testcase to generate a similar testcase to all the other testcases we already have?
Note: The CI failure is in an unrelated test (
nytimes-2). All tests for this change pass locally and linting has been applied.
Well, you're changing readability and this test shows that as a result the output for other pages is also changing, which is causing the test failures. In that sense they are not really "unrelated" - the failures are a consequence of your changes.
If you run the generate-testcase script after your changes for the 2 nytimes testcases, it will change the relevant expected.html files to reflect this differing output. Can you update the PR with this so we can see how the output changes for those inputs?
Readability.js
Outdated
| var h2Text = this._getInnerText(h2Elements[i], false).trim(); | ||
| if (h2Text.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check if the h2 is non-empty?
…editsection spans Remove Wikipedia edit section spans (mw-editsection class) early in _prepDocument() to prevent them from adding negative class weight that would cause h2 headings to be incorrectly removed. This is a minimal, focused fix that only modifies _prepDocument() rather than adding exceptions in multiple cleaning functions. - Updated test cases: wikipedia, wikipedia-3, wikipedia-4, nytimes-1, nytimes-2 - All 1984 tests passing Fixes mozilla#986
|
Hi @gijsk 👋 Thanks a lot for your detailed feedback! I’ve gone through your review carefully and refactored the implementation to fully address all your points: ✅ Simplified the fix: Only _prepDocument() is modified now (no longer touching _cleanHeaders, _cleanConditionally, or _prepArticle). This approach fixes the root cause rather than working around it, keeping the codebase consistent and easier to maintain. Ready for re-review whenever you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow response. The test changes look good to me. I do still have a question about the implementation.
| ]); | ||
|
|
||
| for (var i = 0; i < embeds.length; i++) { | ||
| for (var k = 0; k < embeds.length; k++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated, can you revert it?
| // Fix for issue #986: Remove Wikipedia edit section links that cause headings | ||
| // to be improperly removed. These spans contain "edit" links and add negative | ||
| // weight to otherwise good headings. | ||
| this._forEachNode(this._getAllNodesWithTag(doc, ["span"]), function (span) { | ||
| if (span.className && span.className.includes("mw-editsection")) { | ||
| span.remove(); | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, at this point I have 2 concerns:
- this is going to be really slow and accomplish almost nothing for most documents, as they will have lots of spans that don't fit this criterium anyway. So from a performance PoV it'd be better to filter for header elements first and then check if they include these spans. For that, it seems like it'd be cleaner to just update
_cleanHeadersto account for this? That already looks at h1/h2 headers, and is presumably where this is being removed - is only updating that method insufficient? - This seems hyper-focused on Wikipedia (or mediawiki) itself. Is there any way to make this more generic? Perhaps the technique in WIP permalink anchor stripping #982 could be generalized?
Fixes #986