-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another poor performance about expression parsing #994
Comments
Your should use two-stage strategy as decribed in pointed issue: #192 (comment) |
Two stage parsing does resolve the issue when the input is correct, but error recovery results in a fallback to the second stage where it goes slow again. The solution to this is making every precedence rule transition in the prediction algorithm follow a precedence-dependent edge, as opposed to the current behavior where precedence is only considered for the DFA start state. It would be a very complicated change to the way the DFA cache works, but obviously be a big win in some real scenarios. |
Anyway, for this grammar performance is good with two-stage parsing even for input file with errors (Maybe EOF or some input market missing in this grammar). |
A JUnit test with a similar grammar shows the following behavior for adding "AND" clauses in a loop and parsing ever growing AND expressions: @Test
public void testNestedAndCode() throws Exception {
// RuleLanguageParser.ONE_STAGE=false;
String andClause="VALUE=0";
for (int i=1;i<=12;i++) {
long start = System.currentTimeMillis();
andClause+="\n AND VALUE"+i+"=0\n";
doTestRule("if\n" + andClause+
"\nendif");
long stop = System.currentTimeMillis();
System.out.println(String.format("%3d %4d msecs",i,stop - start));
}
} With One Stage Parsing: 1 418 msecs 2 33 msecs 3 39 msecs 4 75 msecs 5 189 msecs 6 424 msecs 7 943 msecs 8 869 msecs 9 2714 msecs 10 6346 msecs 11 14929 msecs 12 39246 msecs With Two Stage Parsing: 1 362 msecs 2 16 msecs 3 37 msecs 4 63 msecs 5 151 msecs 6 396 msecs 7 623 msecs 8 1039 msecs 9 3154 msecs 10 6924 msecs 11 18117 msecs 12 47501 msecs In both cases the parsing time roughly doubles with every additional "AND". Since this is obviously inacceptable it would be good to have a solution or workaround. I have now added a JUnit test to my environment that fails if the average performance ratio for adding an additional "AND" is higher than 1.2. I'd rather see a linear behavior, though. @Test
public void testNestedAndCode() throws Exception {
// RuleLanguageParser.ONE_STAGE=false;
String andClause="VALUE=0";
long prevduration=0;
double ratiosum=0;
int ratiocount=0;
for (int i=1;i<=6;i++) {
long start = System.currentTimeMillis();
andClause+="\n AND VALUE"+i+"=0\n";
doTestRule("if\n" + andClause+
"\nendif");
long stop = System.currentTimeMillis();
long duration=stop-start;
System.out.println(String.format("%3d %5d msecs",i,duration));
if (i>2) {
double ratio=duration/prevduration;
ratiosum+=ratio;
ratiocount++;
}
prevduration=duration;
}
double averageRatio=ratiosum/ratiocount;
System.out.println(String.format("ratio: %3.2f",averageRatio));
assertTrue(averageRatio<1.2);
} |
Here is a Junit Test for the grammar originally specified for this issue: import static org.junit.Assert.*;
import java.io.IOException;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.junit.Test;
/**
* Test the Issue 994 performance
* @author wf
*
*/
public class TestIssue994 {
boolean debug=false;
/**
* see https://github.com/antlr/antlr4/issues/994
* @throws Exception
*/
@Test
public void testIssue994() throws Exception {
String andClause="not X0";
long prevduration=0;
double ratiosum=0;
int ratiocount=0;
for (int i=1;i<=20;i++) {
long start = System.currentTimeMillis();
if (i%4==1) {
andClause+=" or X"+i;
} else {
andClause+=" and not X"+i;
}
doTestExpr(andClause);
long stop = System.currentTimeMillis();
long duration=stop-start;
if (i>=2) {
double ratio=duration*1.0/(prevduration*1.0);
System.out.println(String.format("%3d %5d msecs %5.1f",i,duration,ratio));
if (duration>25) {
ratiosum+=ratio;
ratiocount++;
}
}
prevduration=duration;
}
double averageRatio=ratiosum/ratiocount;
System.out.println(String.format("ratio: %3.2f",averageRatio));
assertTrue("Performance issue https://github.com/antlr/antlr4/issues/994",averageRatio<1.2);
}
/**
* do parse the given expression
* @param andClause
* @throws IOException
*/
private void doTestExpr(String andClause) throws IOException {
if (debug)
System.out.println(andClause);
ANTLRInputStream in = RuleLanguageParser.streamForText(andClause);
exprLexer lexer = new exprLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
exprParser parser = new exprParser(tokens);
parser.expr();
}
} it has the following time behaviour: len time ratio 2 12 msecs 0,1 3 4 msecs 0,3 4 8 msecs 2,0 5 8 msecs 1,0 6 9 msecs 1,1 7 11 msecs 1,2 8 20 msecs 1,8 9 31 msecs 1,6 10 47 msecs 1,5 11 73 msecs 1,6 12 135 msecs 1,8 13 230 msecs 1,7 14 189 msecs 0,8 15 203 msecs 1,1 16 246 msecs 1,2 17 452 msecs 1,8 18 934 msecs 2,1 19 1723 msecs 1,8 20 2702 msecs 1,6 ratio: 1,55 |
Performance tests that measure just a single iteration of anything are I don't dispute your thoughts but you may find that the performance Jim On Mon, Apr 4, 2016 at 5:03 PM, Wolfgang Fahl notifications@github.com
|
@jimidle - thank you for your comment. Please note that I stopped the loop at 20 on purpose. If you keep almost doubling the timing from there you will end up with 20 minutes parsing time after some 10 more loops and some 300 hours parsing time after additional 20 loops. Keep adding and you'll basically wait for ever. This is called exponential growth and should not happen. Running this a few thousand iterations is prohibitive since I do not need a more precise result I need a different time behaviour see https://en.wikipedia.org/wiki/Big_O_notation In the ANTLR techreport http://www.antlr.org/papers/allstar-techreport.pdf ALL() is O(n_*4) in theory but consistently per- forms linearly on grammars used in practice, outperforming general strategies such as GLL and GLR by orders of magnitude._ The basis of this statement seems to be that the the two stage parsing approach as outlined in paragraph 3.2 Two-stage ALL(*) parsing is used. This means the SLL and LL strategies have to be used one after another. It also means that you might have to adapt your grammar. The grammar of this Issue e.g. is fit for the two stage strategy. Unfortunately a bigger grammar I am currently analyzing is not and I still have to find out how to make it SLL - this is the reason why I posted my results here in the first place. One way to get my grammar more SLL like was to change the order of grammar rules putting possibly "longer" matches before shorter matches. E.g.
is better than
|
Youre rule can be just: |
Fixes antlr#1398 Fixes antlr#994
Fixes antlr#1398 Fixes antlr#994
Fixes antlr#1398 Fixes antlr#994
…ery large expression input phrases; builds off of @sharwell solution that explicitly checks for key return states in expr rules
Fixes #994; poor left-recursive rule performance
Ok, @sharwell came up with an optimization that gives like 10,000x performance on big expression input such as this. Just merged a slight variation on Sam's implementation. @sufail @danielsun1106 @WolfgangFahl @KvanTTT can you try out the latest antlr/antlr4 master? |
@danielsun1106 I've merged this into my optimized branch (which you are using now) as well in sharwell/antlr4@f5776dd4, so if you have the ability to build from source you could try that already. Otherwise I'll be pushing a patch release soon (next couple of days) and you can try that. |
@sharwell I'd like to have a try in a day or two via building from source. As you know, groovy-parser project contains about 13000 test cases. We should see massive performance improvement after using the recently optimized version. That would be really great :) |
Thanx - this looks good. My checking will have to wait until end of december. I am definitely curious to see the effects and see that the other testcases in my project are not effected. |
@sufail |
i’ll push a snapshot today.
|
@danielsun1106 @sufail Can you try mvn for 4.6-SNAPSHOT? I did a deploy which looks like it put stuff here: https://oss.sonatype.org/content/repositories/snapshots/org/antlr/antlr4/4.6-SNAPSHOT/ etc... |
@parrt @danielsun1106 @sharwell Write now I am generating code for the grammars using antlr-maven-4.5 plugin for a runtime of antlr-runtime-4.6-SNAPSHOT version. Hope this okay , I am getting a warning message for this though which is coming from the RuntimeMetaData class. The performance is really good with the new jar in place :) . So no more need to worry about the left recursive rules span over all expression rules? Still I need to break the left recursive rules to more primitive rules for files with insane String manipulations? |
@parrt @danielsun1106 @sharwell The new builds are only targetted for java >=7 ?? Cause it's not getting compiled in java 6. The tool and antlr-4 -runtime sources have something like this It's saying Diamond operator not supported in java 6. I had to fill in the appropriate types there to get this compiled. |
@sufail I have moved the project to java 7 now. |
@sufail The reference release of ANTLR now targets Java 7+, since it makes things easier for everyone trying to work on this repository. My fork has more enterprise users that care more about compatibility, so it will continue to support Java 6 throughout the life of ANTLR 4. |
## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (antlr/antlr4#994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#19042 from hvanhovell/SPARK-21830.
There is now an open source project with some helper function that also includes the Unit Test for this issue see |
## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (antlr/antlr4#994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes apache#19042 from hvanhovell/SPARK-21830. (cherry picked from commit 05af2de) (cherry picked from commit 3d4892f1d4fe8556bdd010056df645f2f1aafe56)
I'm using ANTLR-4.4 with a grammar file modified from sqlite.g4,and find some complex expressions cost too much time to parse.And then I found issue#192 .
That grammar file works fast with the antlr version I'm using now.But when I add some rules I used in my grammar file,it get slow down.
To make the issue simple,the grammar I use is just like this:
and the sample file is just same like
I tried this with the antlr plugin in IDEA,just 3 lines of the expressions will make the parsing time unacceptable.
The text was updated successfully, but these errors were encountered: