Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Oct 10, 2025

Towards #997 and #1528. This PR implements non-recursive non-greedy matching for simple loops. Since non-greedy matching minimizes the amount of storage we will need in _Frames (at least for now) and leftmost-longest matching basically means exhaustively trying all possible trajectories, we also do non-greedy matching of simple loops in leftmost-longest mode. This also keeps memory usage before and after this PR roughly the same in leftmost-longest mode.

Non-greedy matching means that after each repetition at or beyond the minimum number of reps (or before the first repetition if the minimum number of reps is zero), we first try to match the remainder of the pattern and if it doesn't match (or matching is in leftmost-longest mode), we try another rep if permitted.

This means: After reaching the minimum number of repetitions, we first set up tail-matching (by assigning the _Next member of the _N_end_rep node to the _Next variable) after each repetition or when encountering the _N_rep node and push a frame to potentially match one more repetition if the maximum number of repetitions hasn't been reached yet.

Because of the properties of simple loops (see #5762), we don't have to worry about resetting _Sav._Loop_index during stack unwinding, but have to increment it. (We also avoid potential UB due to signed integer flow if _REGEX_MAX_COMPLEXITY_COUNT is set to 0.)
During stack unwinding, we also have to reset the next position to match in the input and the set of valid capture groups (but not the limits of capture groups because the loop is not contained in any loops that are repeated more than once).

Again, I do not try to imitate _Do_rep0()'s increment of the stack usage count by one while evaluating the simple loop because it adds a lot of complexity for basically no gain.

I added a few more simple tests because there was basically no test coverage for non-greedy matching of simple loops that have an upper bound on the number of repetitions.

@muellerj2 muellerj2 requested a review from a team as a code owner October 10, 2025 17:57
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 10, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Oct 10, 2025
@StephanTLavavej StephanTLavavej self-assigned this Oct 10, 2025
@StephanTLavavej StephanTLavavej removed their assignment Oct 16, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Oct 16, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved regex meow is a substring of homeowner

Projects

Status: Merging

Development

Successfully merging this pull request may close these issues.

2 participants