-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regressions related to cuDF changes in handline of end-of-line/string anchors #7211
Conversation
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
There was a test failure against Spark 3.1.1. I will investigate:
|
…d the integration test failure but there are now a few failing java unit tests
build |
The following regular expression patterns are known to potentially produce different results on the GPU | ||
vs the CPU. | ||
|
||
- Word and non-word boundaries, `\b` and `\B` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had two sections listing known edge cases, so I consolidated them by moving this content.
@NVnavkumar This is ready for review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just some nits on docs and error messaging
docs/compatibility.md
Outdated
- Word and non-word boundaries, `\b` and `\B` | ||
- Line anchor `$` will incorrectly match any of the unicode characters `\u0085`, `\u2028`, or `\u2029` followed by | ||
another line-terminator, such as `\n`. For example, the pattern `TEST$` will match `TEST\u0085\n` on the GPU but | ||
not on the CPU ([#7585](https://github.com/NVIDIA/spark-rapids/issues/7585)). | ||
|
||
The following regular expression patterns are not yet supported on the GPU and will fall back to the CPU. | ||
|
||
- Line anchor `^` is not supported in some contexts, such as when combined with a choice (`^|a`). | ||
- Line anchor `$` is not supported by `regexp_replace`, and in some rare contexts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true now? We still have integration tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Updated this line.
@@ -1144,8 +1143,8 @@ class CudfRegexTranspiler(mode: RegexMode) { | |||
case 'z' if mode == RegexSplitMode => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We left this split case for \z
, but the error message just claims that \z
is not supported on GPU. We should either remove this case, or clarify the error messaging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nit: Looks like compatibility doc wasn't update for \z
, but I don't know if it makes sense to call out a feature that only works in StringSplit
.
build |
…-line/string anchors (NVIDIA#7211)" This reverts commit 3398daa.
…-line/string anchors (NVIDIA#7211)" This reverts commit 3398daa. Signed-off-by: Andy Grove <andygrove@nvidia.com>
Closes #7090
Rationale
Now that rapidsai/cudf#11979 is resolved using the fix described rapidsai/cudf#11979 (comment), the regular expression transpiler code needed to be updated for new handling of $, \z and \Z.
Follow-on issue:
Changes in this PR:
$
to reflect recent changes in cuDF\z
because there is no longer an equivalent in cuDF that we can transpile to\z
fallback to CPU\u0085
,\u2028
, and\u2029