Skip to content

Commit

Permalink
fix: regexp_extract_all to not return a match in mismatched group (fa…
Browse files Browse the repository at this point in the history
…cebookincubator#12143)

Summary:
Fix regexp_extract_all to avoid returning a match in mismatched group.

For example,

```
regexp_extract_all("rat cat\nbat dog", "ra(.)|blah(.)(.)", 2)
```

is expected to return {NULL} because the first group `ra(.)` has a match while the second group `blah(.)` has no match. The current buggy implementation returns {""}.

This test case and the expected result is from presto's existing UT, see https://github.com/prestodb/presto/blob/48f2fb1cd0244e8ae1230d27445fcd15d6520597/presto-main/src/test/java/com/facebook/presto/operator/scalar/AbstractTestRegexpFunctions.java#L214.

Fixes facebookincubator#12119.

Pull Request resolved: facebookincubator#12143

Reviewed By: kagamiori

Differential Revision: D68746581

Pulled By: xiaoxmeng

fbshipit-source-id: 10e6c919fde1c4419a404592b7e7c6941fe983a1
  • Loading branch information
HolyLow authored and facebook-github-bot committed Jan 28, 2025
1 parent a837ebd commit 2d24256
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 2 deletions.
9 changes: 7 additions & 2 deletions velox/functions/lib/Re2Functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1070,8 +1070,13 @@ void re2ExtractAll(
const re2::StringPiece fullMatch = groups[0];
const re2::StringPiece subMatch = groups[groupId];

arrayWriter.add_item().setNoCopy(
StringView(subMatch.data(), subMatch.size()));
// Check if the subMatch is null.
if (subMatch.data()) {
arrayWriter.add_item().setNoCopy(
StringView(subMatch.data(), subMatch.size()));
} else {
arrayWriter.add_null();
}
pos = fullMatch.data() + fullMatch.size() - input.data();
if (UNLIKELY(fullMatch.size() == 0)) {
++pos;
Expand Down
14 changes: 14 additions & 0 deletions velox/functions/lib/tests/Re2FunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1103,6 +1103,20 @@ TEST_F(Re2FunctionsTest, regexExtractAllConstantPatternConstantGroupId) {
testRe2ExtractAll(inputs, constantPattern, bigGroupIds, expectedOutputs);
}

TEST_F(Re2FunctionsTest, regexExtractAllMismatchedGroup) {
auto sourceVector =
makeFlatVector<std::string>({"rat cat\nbat dog"}, VARCHAR());
auto patternVector =
makeFlatVector<std::string>({"ra(.)|blah(.)(.)"}, VARCHAR());
auto groupIdVector = makeFlatVector<int>(std::vector{2}, INTEGER());
auto expectedVector = makeNullableArrayVector<std::string>(
std::vector<std::vector<std::optional<std::string>>>{{std::nullopt}});
auto result = evaluate(
"regexp_extract_all(c0, c1, c2)",
makeRowVector({sourceVector, patternVector, groupIdVector}));
assertEqualVectors(expectedVector, result);
}

TEST_F(Re2FunctionsTest, regexExtractAllConstantPatternVariableGroupId) {
const std::vector<std::optional<std::string>> inputs = {
" 123a 2b 14m ",
Expand Down

0 comments on commit 2d24256

Please sign in to comment.