Skip to content

Conversation

luciansmith
Copy link
Contributor

@luciansmith luciansmith commented Apr 30, 2025

Description of the change

Don't replace internal underscores with emphasis tags. This was particularly important for URLs, which are often littered with them, and none of them should instead have <em> in them; that's invalid HTML.

The same change should be made for strong tags (like this) but the same fix there broke overall parsing. Stubs provided; maybe someone can figure out what went wrong.

Checklist for contributor

  • if you want to be mentioned in the AUTHORS file, you added yourself
  • added an entry to CHANGELOG.md at "Upcoming"
  • if any Markdown definition changed, you updated the definition docs
  • your C++ code change is accommodated with a unit/integration test where it makes sense
  • your code meets the code format style (clang-format) of the project (tools/format.py)

I was getting failures where URLs with underscores were getting <em> blocks thrown in them, creating invalid HTML.
Most don't pass (and are disabled), but should.  Unfortunately, adding a word boundary (\b) to the strong regex works for these tests, but somehow the full parser then breaks.  The regex that I believed should work is added as a comment, for anyone wishing to make things work going forward.
// match.

std::string text = "some ___text testing it__ out";
std::string expected = "some <strong>_text testing it</strong> out";
Copy link
Owner

@progsource progsource Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CommonMark Spec, there is this example:

markdown	"___foo__\n"
html	"<p>_<strong>foo</strong></p>\n"
example	456

Also GitHub spec: https://github.github.com/gfm/#example-465

So I guess, the _ in the expected string has to come before the strong tag.

void Parse(std::string& line) override
{
// Modifed from previous version, with help from
// https://stackoverflow.com/questions/61346949/regex-for-markdown-emphasis
Copy link
Owner

@progsource progsource Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to put this comment into your commit message instead and remove it from here, because as a comment it could become outdated at some point.

// it then passes all the 'disabled' tests in the 'strong parser'
// test, but then it fails general parsing. For some reason,
// "__text__" translates "<i></i>text<i></i>" even though there
// are no word boundaries at the correct places. It's weird!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strong parser is usually handled before the emphasized one: https://github.com/progsource/maddy/blob/master/include/maddy/parser.h#L195
This is so, that double _ or * can be easier determined as strong. Therefor I guess, that is why you only get italic tags, if you do not first run it through the strong parser.

@progsource
Copy link
Owner

Thank you for your contribution and sorry for the delay in me checking the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants