Skip to content

Conversation

tuyafeng
Copy link

Some pages have long quoted sections but only a few lines of commentary or intro. Right now, isProbablyReaderable ignores blockquote, which can make the examples below score 0 (the default options), whereas including blockquote would give a score of 9.

Example from W3Schools:

<!DOCTYPE html>
<html>
<body>

<h1>The blockquote element</h1>

<p>Here is a quote from WWF's website:</p>

<blockquote cite="http://www.worldwildlife.org/who/index.html">
For 50 years, WWF has been protecting the future of nature. The world's leading conservation organization, WWF works in 100 countries and is supported by 1.2 million members in the United States and close to 5 million globally.
</blockquote>

</body>
</html>

I’m thinking it might make sense to factor blockquote into the scoring. If not, I’d love to hear the reasoning—thanks!

@gijsk
Copy link
Contributor

gijsk commented Sep 24, 2025

Apologies for the slow response - I was out on PTO and then heads down on some other work for a while.

@tuyafeng Can you provide a "real life" example (rather than w3schools) of pages that currently don't get detected as readerable as a result of this omission?

For context, the only real tradeoff here is for performance - the more tags included, the longer the "readerable" determination takes, and it's designed to be faster (and more "rough guess") than running a full readability pass, if that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants