-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix JDK regex support #888
Fix JDK regex support #888
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some context. Still thinking about the "trailing newline" cases, but didn't find any solution yet.
This PR resolves #872.
Suggested commit message:
Fix JDK regex support
Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
fields, such that the test framework does not skip them with a "Not a
valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
regexes to match their intended semantics and (b) include a few
negative test cases.
- Partially revert the change introduced by #783: the use of
`Matcher#find()` is correct, but the `hasStartAnchor` and
`hasEndAnchor` logic introduces more bugs than the issue it aims to
solve.
- Extend the test coverage introduced by #783, by introducing regexes
that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not
handled correctly by either regex implementation.
Resolves #872.
} | ||
|
||
@Override | ||
public boolean matches(String value) { | ||
Matcher matcher = this.pattern.matcher(value); | ||
return matcher.find() && (!this.hasStartAnchor || 0 == matcher.start()) && (!this.hasEndAnchor || matcher.end() == value.length()); | ||
return this.pattern.matcher(value).find(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -21,7 +21,7 @@ class JoniRegularExpression implements RegularExpression { | |||
.replace("\\S", "[^ \\f\\n\\r\\t\\v\\u00a0\\u1680\\u2000-\\u200a\\u2028\\u2029\\u202f\\u205f\\u3000\\ufeff]"); | |||
|
|||
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); | |||
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript); | |||
this.pattern = new Regex(bytes, 0, bytes.length, Option.SINGLELINE, UTF8Encoding.INSTANCE, Syntax.ECMAScript); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find the Joni documentation, but this change makes more test cases pass; please carefully review.
@@ -8,40 +8,50 @@ class Issue814Test { | |||
|
|||
@Test | |||
void jdkTypePattern() { | |||
JDKRegularExpression ex = new JDKRegularExpression("^list|date|time|string|enum|int|double|long|boolean|number$"); | |||
JDKRegularExpression ex = new JDKRegularExpression("^(list|date|time|string|enum|int|double|long|boolean|number)$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
@Test | ||
void jdkOptionsPattern() { | ||
JDKRegularExpression ex = new JDKRegularExpression("^\\d*|[a-zA-Z_]+$"); | ||
JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I changed *
to +
: because ^\d*
is vacuously true: it matches the empty string at the start of any input. (Without this the internal9
test case below would also yield true
.)
@@ -1,61 +1,105 @@ | |||
[ | |||
{ | |||
"description": "issue495 using ECMA-262", | |||
"regex": "ecma-262", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this fields, because regex
is defined in the TestSpec
class, not inside TestCase
. Without this the tests were skipped...
"pattern": "^[a-z]{1,10}$", | ||
"patternProperties": { | ||
"^[a-z]{1,10}$": true, | ||
"(^1$)": true | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern
keyword didn't seem to work, so I switched to patternProperties
; let me know if something else was intended. Also added an extra test case to show why inspecting the regex will not work in general.
{ | ||
"description": "trailing newline", | ||
"regex": "jdk", | ||
"data": { "aaa\n": 3 }, | ||
"valid": false | ||
"valid": false, | ||
"disabled": true, | ||
"comment": "Test fails" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only test case on which this PR regresses. Both regex implementation suffer from the same issue:
jshell> Pattern.compile("^a$").matcher("a\n").find()
$1 ==> true
jshell> Pattern.compile("(^a$)").matcher("a\n").find()
$2 ==> true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I found why this happens. The documentation states:
The regular expression
$
matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.
So in theory we could avoid this by transforming all $
anchors by prepending a negative lookahead, like so:
jshell> Pattern.compile("^a(?!\n|\r\n|\r|\\u0085|\\u2028|\\u2029)$").matcher("a\n").find()
$1 ==> false
But that will require careful parsing of the input regex, because the $
could be escaped, or inside a \Q
...\E
section, or... Quite frankly, that way lies madness; would like to keep it out of this PR.
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
assertTrue(ex.matches("0external")); | ||
assertTrue(ex.matches("internal")); | ||
assertTrue(ex.matches("external")); | ||
assertTrue(ex.matches("external_gte")); | ||
assertTrue(ex.matches("force")); | ||
assertFalse(ex.matches("internal9")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stephan202 Actually the ^\\d+|[a-zA-Z_]+$
is the same as ^\\d|[a-zA-Z_]$
Explanation:
you should read the ^\\d+|[a-zA-Z_]+$
like: matches to ^\\d+
or to [a-zA-Z_]+$
.
In 'human' words: starts with digit or ends with alpha-or-underscore.
Actually it is identical to: matches to ^\\d
or to [a-zA-Z_]$
or ^\\d|[a-zA-Z_]$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right you are!
} | ||
|
||
@Test | ||
void joniOptionsPattern() { | ||
JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$"); | ||
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
assertTrue(ex.matches("0external")); | ||
assertTrue(ex.matches("internal")); | ||
assertTrue(ex.matches("external")); | ||
assertTrue(ex.matches("external_gte")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that similar examples can be removed, and the next can be added:
(matches left part: starts with digit:)
assertTrue - "5"
assertTrue - "55"
assertTrue - "5%"
(matches right part: ends with alpha-or-undescore:)
assertTrue - "a"
assertTrue - "aa"
assertTrue - "%a"
assertTrue - "%_"
(matches both parts)
assertTrue - "55aa"
assertTrue - "5%%a"
assertFalse: ""
assertFalse: "%"
assertFalse: "a5"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, lemme replace them.
Summary of changes: - Fix the test resources introduced by #783 by moving the `regex` fields, such that the test framework does not skip them with a "Not a valid test case" message. - Revert the changes introduced by #815, as those are simply incorrect. - Extend the test coverage introduced by #815 by (a) updating the test regexes to match their intended semantics and (b) include a few negative test cases. - Partially revert the change introduced by #783: the use of `Matcher#find()` is correct, but the `hasStartAnchor` and `hasEndAnchor` logic introduces more bugs than the issue it aims to solve. - Extend the test coverage introduced by #783, by introducing regexes that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic. - Update the Joni regular expression integration such that it passes more of the test cases. - Disable the "trailing newline" test cases, as these are currently not handled correctly by either regex implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look @kool79!
I squashed the following changes into the commit:
diff --git a/src/test/java/com/networknt/schema/regex/Issue814Test.java b/src/test/java/com/networknt/schema/regex/Issue814Test.java
index 3e7d42d..15ab31c 100644
--- a/src/test/java/com/networknt/schema/regex/Issue814Test.java
+++ b/src/test/java/com/networknt/schema/regex/Issue814Test.java
@@ -21,13 +21,19 @@ class Issue814Test {
@Test
void jdkOptionsPattern() {
- JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
- assertTrue(ex.matches("0external"));
- assertTrue(ex.matches("external"));
- assertTrue(ex.matches("external_gte"));
- assertTrue(ex.matches("force"));
- assertTrue(ex.matches("internal"));
- assertFalse(ex.matches("internal9"));
+ JDKRegularExpression ex = new JDKRegularExpression("^\\d|[a-zA-Z_]$");
+ assertTrue(ex.matches("5"));
+ assertTrue(ex.matches("55"));
+ assertTrue(ex.matches("5%"));
+ assertTrue(ex.matches("a"));
+ assertTrue(ex.matches("aa"));
+ assertTrue(ex.matches("%a"));
+ assertTrue(ex.matches("%_"));
+ assertTrue(ex.matches("55aa"));
+ assertTrue(ex.matches("5%%a"));
+ assertFalse(ex.matches(""));
+ assertFalse(ex.matches("%"));
+ assertFalse(ex.matches("a5"));
}
@Test
@@ -45,13 +51,19 @@ class Issue814Test {
@Test
void joniOptionsPattern() {
- JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
- assertTrue(ex.matches("0external"));
- assertTrue(ex.matches("internal"));
- assertTrue(ex.matches("external"));
- assertTrue(ex.matches("external_gte"));
- assertTrue(ex.matches("force"));
- assertFalse(ex.matches("internal9"));
+ JoniRegularExpression ex = new JoniRegularExpression("^\\d|[a-zA-Z_]$");
+ assertTrue(ex.matches("5"));
+ assertTrue(ex.matches("55"));
+ assertTrue(ex.matches("5%"));
+ assertTrue(ex.matches("a"));
+ assertTrue(ex.matches("aa"));
+ assertTrue(ex.matches("%a"));
+ assertTrue(ex.matches("%_"));
+ assertTrue(ex.matches("55aa"));
+ assertTrue(ex.matches("5%%a"));
+ assertFalse(ex.matches(""));
+ assertFalse(ex.matches("%"));
+ assertFalse(ex.matches("a5"));
}
}
(I squashed, because looking at the master
branch, it appears that the maintainer of this repository uses GitHub's squash-and-rebase feature without cleaning up the commit message auto-generated by GitHub, which creates a rather messy result if there's more than one commit.)
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
assertTrue(ex.matches("0external")); | ||
assertTrue(ex.matches("internal")); | ||
assertTrue(ex.matches("external")); | ||
assertTrue(ex.matches("external_gte")); | ||
assertTrue(ex.matches("force")); | ||
assertFalse(ex.matches("internal9")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right you are!
} | ||
|
||
@Test | ||
void joniOptionsPattern() { | ||
JoniRegularExpression ex = new JoniRegularExpression("^\\d*|[a-zA-Z_]+$"); | ||
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
assertTrue(ex.matches("0external")); | ||
assertTrue(ex.matches("internal")); | ||
assertTrue(ex.matches("external")); | ||
assertTrue(ex.matches("external_gte")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, lemme replace them.
@Stephan202 @kool79 Thanks a lot for the hard work. Do you think we are ready to merge? @fdutton Do you have any concerns? |
@stevehu I think that not supporting the trailing newline case is an acceptable compromise given the larger context; I'm not planning other changes at this time 👍 (We should trigger the build workflow, though, to make sure that the tests also pass on GHA.) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #888 +/- ##
============================================
+ Coverage 78.48% 78.52% +0.03%
+ Complexity 1213 1207 -6
============================================
Files 120 120
Lines 3891 3883 -8
Branches 746 740 -6
============================================
- Hits 3054 3049 -5
Misses 546 546
+ Partials 291 288 -3 ☔ View full report in Codecov by Sentry. |
Do you feel comfortable to merge this PR? Anyone has some concerns? Thanks. |
@stevehu yes, I feel comfortable doing so 👍 (But don't have permission to ;).) |
Summary of changes:
regex
fields, such that the test framework does not skip them with a "Not a valid test case" message.Matcher#find()
is correct, but thehasStartAnchor
andhasEndAnchor
logic introduces more bugs than the issue it aims to solve.hasStartAnchor
/hasEndAnchor
logic.