Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize symbolic regex Antimirov mode to avoid so much allocation #60918

Closed
stephentoub opened this issue Oct 4, 2021 · 4 comments · Fixed by #65637
Closed

Optimize symbolic regex Antimirov mode to avoid so much allocation #60918

stephentoub opened this issue Oct 4, 2021 · 4 comments · Fixed by #65637
Assignees
Milestone

Comments

@stephentoub
Copy link
Member

When in Antimirov mode, we logically maintain a list of states we're currently in and, for each step, generate a new list of states based on looking at each current state and determining all the states it could lead us to. However, rather than something like:

HashSet<Node> nextStates = _nextStates;
HashSet<Node> currentStates = _currentStates;
foreach (Node current in currentStates)
{
    foreach (Node next in Transition(current))
    {
        nextStates.Add(next);
    }
}
currentStates.Clear();
_nextStates = currentStates;
_currentStates = nextStates;

it's implemented using an immutable SymbolicRegexNode, where every additional state found involves unioning in the new state into the next states node, which means generating one or more new nodes for each additional state we union in.

@stephentoub stephentoub transferred this issue from dotnet/runtimelab Oct 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Oct 27, 2021
@ghost
Copy link

ghost commented Oct 27, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

When in Antimirov mode, we logically maintain a list of states we're currently in and, for each step, generate a new list of states based on looking at each current state and determining all the states it could lead us to. However, rather than something like:

HashSet<Node> nextStates = _nextStates;
HashSet<Node> currentStates = _currentStates;
foreach (Node current in currentStates)
{
    foreach (Node next in Transition(current))
    {
        nextStates.Add(next);
    }
}
currentStates.Clear();
_nextStates = currentStates;
_currentStates = nextStates;

it's implemented using an immutable SymbolicRegexNode, where every additional state found involves unioning in the new state into the next states node, which means generating one or more new nodes for each additional state we union in.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: -

@stephentoub stephentoub added this to the 7.0.0 milestone Oct 27, 2021
@stephentoub stephentoub added tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Oct 27, 2021
@stephentoub
Copy link
Member Author

@veanes, I believe you're working on this or a very related optimization, right?

@veanes
Copy link
Contributor

veanes commented Dec 10, 2021

Yes, started but got sidetracked a bit, will try to get this done soon (within a few days), essentially improving the overhead in the Antimirov mode, regarding state-set representation (using a more lightweight def, initially just HashSet<int>, Olli has then in mind a specialized very efficient "small integer set" data structure (similar to what is in use for NFA simulation ie Antimirov-like mode, in other tools like re2) to optimize further and can swap that out then.)

@veanes
Copy link
Contributor

veanes commented Feb 21, 2022

The PR #65637 should fix this issue, once this has been validated.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 26, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants