<regex>
: Process non-greedy and longest-mode simple loops non-recursively
#5774
+81
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Towards #997 and #1528. This PR implements non-recursive non-greedy matching for simple loops. Since non-greedy matching minimizes the amount of storage we will need in
_Frames
(at least for now) and leftmost-longest matching basically means exhaustively trying all possible trajectories, we also do non-greedy matching of simple loops in leftmost-longest mode. This also keeps memory usage before and after this PR roughly the same in leftmost-longest mode.Non-greedy matching means that after each repetition at or beyond the minimum number of reps (or before the first repetition if the minimum number of reps is zero), we first try to match the remainder of the pattern and if it doesn't match (or matching is in leftmost-longest mode), we try another rep if permitted.
This means: After reaching the minimum number of repetitions, we first set up tail-matching (by assigning the
_Next
member of the_N_end_rep
node to the_Next
variable) after each repetition or when encountering the_N_rep
node and push a frame to potentially match one more repetition if the maximum number of repetitions hasn't been reached yet.Because of the properties of simple loops (see #5762), we don't have to worry about resetting
_Sav._Loop_index
during stack unwinding, but have to increment it. (We also avoid potential UB due to signed integer flow if_REGEX_MAX_COMPLEXITY_COUNT
is set to0
.)During stack unwinding, we also have to reset the next position to match in the input and the set of valid capture groups (but not the limits of capture groups because the loop is not contained in any loops that are repeated more than once).
Again, I do not try to imitate
_Do_rep0()
's increment of the stack usage count by one while evaluating the simple loop because it adds a lot of complexity for basically no gain.I added a few more simple tests because there was basically no test coverage for non-greedy matching of simple loops that have an upper bound on the number of repetitions.