-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regressions related to cuDF changes in handline of end-of-line/string anchors #7211
Changes from 15 commits
617430c
4085ee1
e4ddc61
b01c681
3ac193c
62c3562
e802c9a
5475b8a
8848020
3f2807f
33be75b
c948419
b37cc1d
94472b8
df46b4a
2e1b5dc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -437,18 +437,23 @@ disable regular expressions on the GPU, set `spark.rapids.sql.regexp.enabled=fal | |
|
||
These are the known edge cases where running on the GPU will produce different results to the CPU: | ||
|
||
- Regular expressions that contain an end of line anchor '$' or end of string anchor '\Z' or '\z' immediately | ||
- Regular expressions that contain an end of line anchor '$' or end of string anchor '\Z' immediately | ||
next to a newline or a repetition that produces zero or more results | ||
([#5610](https://github.com/NVIDIA/spark-rapids/pull/5610))` | ||
- Word and non-word boundaries, `\b` and `\B` | ||
- Line anchor `$` will incorrectly match any of the unicode characters `\u0085`, `\u2028`, or `\u2029` followed by | ||
another line-terminator, such as `\n`. For example, the pattern `TEST$` will match `TEST\u0085\n` on the GPU but | ||
not on the CPU ([#7585](https://github.com/NVIDIA/spark-rapids/issues/7585)). | ||
|
||
The following regular expression patterns are not yet supported on the GPU and will fall back to the CPU. | ||
|
||
- Line anchor `^` is not supported in some contexts, such as when combined with a choice (`^|a`). | ||
- Line anchor `$` is not supported by `regexp_replace`, and in some rare contexts. | ||
- String anchor `\Z` is not supported by `regexp_replace`, and in some rare contexts. | ||
- String anchor `\z` is not supported | ||
- Patterns containing an end of line or string anchor immediately next to a newline or repetition that produces zero | ||
or more results | ||
- Line anchor `$` and string anchors `\z` and `\Z` are not supported in patterns containing `\W` or `\D` | ||
- Line anchor `$` and string anchors `\Z` are not supported in patterns containing `\W` or `\D` | ||
- Line and string anchors are not supported by `string_split` and `str_to_map` | ||
- Lazy quantifiers, such as `a*?` | ||
- Possessive quantifiers, such as `a*+` | ||
|
@@ -457,12 +462,6 @@ The following regular expression patterns are not yet supported on the GPU and w | |
- Empty groups: `()` | ||
- `regexp_replace` does not support back-references | ||
|
||
The following regular expression patterns are known to potentially produce different results on the GPU | ||
vs the CPU. | ||
|
||
- Word and non-word boundaries, `\b` and `\B` | ||
Comment on lines
-460
to
-463
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We had two sections listing known edge cases, so I consolidated them by moving this content. |
||
|
||
|
||
Work is ongoing to increase the range of regular expressions that can run on the GPU. | ||
|
||
## Timestamps | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -628,6 +628,8 @@ class RegexParser(pattern: String) { | |
object RegexParser { | ||
private val regexpChars = Set('\u0000', '\\', '.', '^', '$', '\u0007', '\u001b', '\f') | ||
|
||
def parse(pattern: String): RegexAST = new RegexParser(pattern).parse | ||
|
||
def isRegExpString(s: String): Boolean = { | ||
|
||
def isRegExpString(ast: RegexAST): Boolean = ast match { | ||
|
@@ -842,10 +844,7 @@ class CudfRegexTranspiler(mode: RegexMode) { | |
None | ||
) | ||
} else { | ||
RegexGroup(capture = capture, | ||
RegexChoice( | ||
RegexCharacterClass(negated = false, characters = terminatorChars), | ||
RegexSequence(ListBuffer(RegexChar('\r'), RegexChar('\n')))), None) | ||
RegexGroup(capture = capture, RegexParser.parse("\r|\u0085|\u2028|\u2029|\r\n"), None) | ||
} | ||
} | ||
|
||
|
@@ -1144,8 +1143,8 @@ class CudfRegexTranspiler(mode: RegexMode) { | |
case 'z' if mode == RegexSplitMode => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We left this split case for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
RegexEscaped('Z') | ||
case 'z' => | ||
// cuDF does not support "\z" but supports "$", which is equivalent | ||
RegexChar('$') | ||
// cuDF does not support "\z" | ||
throw new RegexUnsupportedException("\\z is not supported on GPU", regex.position) | ||
case 'Z' => | ||
// \Z is really a synonymn for $. It's used in Java to preserve that behavior when | ||
// using modes that change the meaning of $ (such as MULTILINE or UNIX_LINES) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true now? We still have integration tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Updated this line.