Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: regexp_extract returns match in mismatched group #12109

Closed

Conversation

HolyLow
Copy link
Contributor

@HolyLow HolyLow commented Jan 17, 2025

The implementation of Re2Extract has a bug, that it might consider a mismatched group as MATCHED empty string "" rather than MISMATCHED std::nullopt.

For example, in the function calling: regexp_extract("rat cat\nbat dog", "ra(.)|blah(.)(.)", 2).
In this case, for group 2 the result must be std::nullopt because no substring would match pattern blah(.).
But the current implementation would mistake the matching of group 1 ra(.) as a empty match case for group 2, and thus return a empty matching, which is wrong.

See detailed description in #12119.

This PR fix this bug in Re2Extract implementation.

Also note that this bug behavior exists in Re2ExtractAll as well, but this PR doesn't modify in Re2ExtractAll because existing UTs of Re2ExtractAll already rely on this behavior.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2025
Copy link

netlify bot commented Jan 17, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit e76c4d3
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67904f0338d3ec0008914b0b

@HolyLow
Copy link
Contributor Author

HolyLow commented Jan 17, 2025

@kgpai @mbasmanova Could you kindly help review this PR? Thanks a lot.

Any suggestion is welcome.

@mbasmanova
Copy link
Contributor

Also note that this bug behavior exists in Re2ExtractAll as well, but this PR doesn't modify in Re2ExtractAll because existing UTs of Re2ExtractAll already rely on this behavior.

Sounds like both the implementation is buggy and test expectations are wrong. In this case we need to fix both the implementation and the test.

Would you check Presto Java to see if Velox behavior matches it?

@mbasmanova
Copy link
Contributor

Looks like Presto Java returns NULL:

presto:di> SELECT regexp_extract('rat cat\nbat dog', 'ra(.)|blah(.)(.)', 2);
 _col0
-------
 NULL
(1 row)

@HolyLow Would you create a GitHub issue to describe this problem? Then, reference the issue in the PR description.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

CC: @amitkdutta @kevinwilfong @kagamiori

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Jan 17, 2025
@kagamiori kagamiori changed the title bugfix: regexp_extract returns match in mismatched group fix: regexp_extract returns match in mismatched group Jan 17, 2025
@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@HolyLow
Copy link
Contributor Author

HolyLow commented Jan 20, 2025

@mbasmanova Bug description issue added.
The fixing for Re2ExractAll will be introduced in next PR as it involves fixing the existing wrong UTs and would take some time.

@mbasmanova
Copy link
Contributor

Thank you, @HolyLow .

@kagamiori
Copy link
Contributor

Hi @HolyLow, there were some build issues last week that caused CI job failures in your PR. They are fixed now. Could you please rebase the PR onto the latest main so that the failed jobs can be re-triggered? Thanks!

@HolyLow HolyLow force-pushed the fix-regexp-extract-null-group branch from b7ed8c7 to e76c4d3 Compare January 22, 2025 01:50
@HolyLow
Copy link
Contributor Author

HolyLow commented Jan 22, 2025

@kagamiori Rebased to latest main branch as requested.

@facebook-github-bot
Copy link
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kagamiori merged this pull request in 58ff4ac.

@zhouyuan
Copy link
Contributor

It looks like the behavior is not align with Spark-sql:

Spark master: local[*], Application Id: local-1737594817496
spark-sql> SELECT regexp_extract('rat cat\nbat dog', 'ra(.)|blah(.)(.)', 2);

Time taken: 2.226 seconds, Fetched 1 row(s)

For Spark we may need to use a separate impl
Cc: @PHILO-HE

zhouyuan added a commit to oap-project/velox that referenced this pull request Jan 23, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 23, 2025
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Jan 24, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 24, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 25, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 26, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 27, 2025
zhouyuan added a commit to oap-project/velox that referenced this pull request Jan 28, 2025
marin-ma pushed a commit to oap-project/velox that referenced this pull request Jan 29, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 29, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 30, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jan 31, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Feb 1, 2025
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants