-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add substrings search path for constant like pattern #10731
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Could you help to have a look at this PR ? Is it reasonable to add a fast-path here? |
91983d0
to
ee87495
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a benchmark to show the performance gain?
velox/common/base/SimdUtil-inl.h
Outdated
size_t n, | ||
const char* needle, | ||
MEMCMP memcmp_fun) { | ||
assert(k > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_DCHECK
1ac4f0f
to
60b2015
Compare
@Yuhta update code , could you have a look at this once more ?
I think this is the most interesting thing in this PR:
By this way, load_unaligned is safe when overreading for they are in same memory page.
Howevever, performance seems not best, Doris delete Volnitsky when they copy the StringSearch implement : https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44 And use first-two char compare, however, I test it , about 2x slow than this PR`s implement. |
20d70bf
to
079c6b1
Compare
079c6b1
to
4718337
Compare
velox/functions/lib/Re2Functions.cpp
Outdated
@@ -14,6 +14,7 @@ | |||
* limitations under the License. | |||
*/ | |||
#include "velox/functions/lib/Re2Functions.h" | |||
#include "velox/common/base/SimdUtil.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Would you point me to where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
er...., forget this after removed simdstrstr
fcfd2e4
to
2d91717
Compare
@mbasmanova ping ~ |
@@ -32,6 +32,12 @@ limited to 20 different expressions per instance and thread of execution. | |||
SELECT like('abc', '%b%'); -- true | |||
SELECT like('a_c', '%#_%', '#'); -- true | |||
|
|||
String sequence search: There are some patterns that are equivalent to simple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to embed the new information into the existing flow.
Not all patterns require compilation of regular expressions. Patterns ....
The new pattern needs to added to this list. The separate paragraph then can be about the limitations that apply to these special patterns. I.e. We would clarify separately whether special patterns allow for custom character or not.
Also, please, generate the docs to make sure they look nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand your meaning, add a list for all fast path in velox.
@mbasmanova Could you look at this again ?
b963239
to
97822e5
Compare
@@ -32,6 +32,22 @@ limited to 20 different expressions per instance and thread of execution. | |||
SELECT like('abc', '%b%'); -- true | |||
SELECT like('a_c', '%#_%', '#'); -- true | |||
|
|||
Not all patterns require compilation of regular expressions. For the patterns without custom escape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is existing text
Not all patterns require
compilation of regular expressions. Patterns 'hello', 'hello%', '_hello__%',....
We need to update it to add the new pattern. Currently, you are adding new text as if existing text doesn't exist or doesn't talk about similar thing.
Current:
Patterns 'hello', 'hello%', '_hello__%',
'%hello', '%__hello_', '%hello%', where
New (proposed):
Patterns 'hello', 'hello%', '_hello__%',
'%hello', '%__hello_', '%hello%', '%hello%velox%' where
@@ -32,6 +32,22 @@ limited to 20 different expressions per instance and thread of execution. | |||
SELECT like('abc', '%b%'); -- true | |||
SELECT like('a_c', '%#_%', '#'); -- true | |||
|
|||
Not all patterns require compilation of regular expressions. For the patterns without custom escape: | |||
|
|||
| patterns | description | fast search mothod | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table might be a nice alternative to current doc, but we'll need to iterate on columns, formatting and user-friendly descriptions. I suggest to do that in a separate PR.
97822e5
to
383166b
Compare
@mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Is the regression in benchmark caused by the change? If so we probably want to merge this after #10858 |
It`s strange because Q2 and suffix is not supported by this change.
@Yuhta I do not see regression on my machine before 9851a4e after @Yuhta Maybe input data is not same ? what`s 'data' means in the cicd result? Would you help to check if the cicd result not stable, because I could not reproduced on my machine ? |
383166b
to
88e78a4
Compare
88e78a4
to
2ee154c
Compare
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@skadilover It's just flakiness, disappeared after rerun |
I think it`s nothing to do with this change. |
@@ -2154,6 +2201,14 @@ std::shared_ptr<exec::VectorFunction> makeLike( | |||
|
|||
PatternMetadata patternMetadata = PatternMetadata::generic(); | |||
try { | |||
// Fast path for substrings search. | |||
auto substrings = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After picking this PR, Gluten will be a unit test that fails.
Incorrect evaluation: ///__ LIKE %//%/% ESCAPE '/', actual: true, expected: false
It seems that this implementation did not take into account the ESCAPE. In this example, the last slash (/) is the ESCAPE, which causes the last percent (%) to be a regular character, not a wildcard. Therefore, the result should be false, but Velox returns true.
@skadilover @mbasmanova @Yuhta @xiaoxmeng Do you have any input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry for the mistake, I mis-understand the code here:
BaseVector* escape = inputArgs[2].constantValue.get();
if (!escape) {
return std::make_shared<LikeGeneric>();
}
and the ut just skiped by '#'
testLike("a%c", "%#%%", '#', true);
testLike("%cd", "%#%%", '#', true);
fix it here : #11091
…or#10731) Summary: We noticed that some like constant-patterns can be equivalent to substrings search such as tpch queries Q13 and Q16: "%special%requests%" "%Customer%Complaints%" 1. Add substring-search fast path for like with search pattern : "((%+[^%#_\\]+)+%+)" 2. Some OLAP engine supports "\_" for substring-search. In order to obtain the best performance, "\_" is not supported here, '#' is not support for the same reason too. 3. This PR will also affect single substring-search like tpch Q9, if a fast-simd-strstr to speedup string-search will be added as follow-up, changes nothing right now. benchmark result : before: tpchQuery9 23.96ms 41.73 tpchQuery13 156.02ms 6.41 tpchQuery16Part 9.02ms 110.83 tpchQuery16Supplier 192.13ms 5.20 after: tpchQuery9 24.30ms 41.14 tpchQuery13 55.03ms 18.17 tpchQuery16Part 8.92ms 112.07 tpchQuery16Supplier 17.64ms 56.69 Pull Request resolved: facebookincubator#10731 Reviewed By: xiaoxmeng Differential Revision: D63261743 Pulled By: Yuhta fbshipit-source-id: 3822d581790db32286bd7b9756f53401684c1012
We noticed that some like constant-patterns can be equivalent to substrings search such as tpch queries Q13 and Q16: "%special%requests%" "%Customer%Complaints%"
benchmark result :
before:
tpchQuery9 23.96ms 41.73
tpchQuery13 156.02ms 6.41
tpchQuery16Part 9.02ms 110.83
tpchQuery16Supplier 192.13ms 5.20
after:
tpchQuery9 24.30ms 41.14
tpchQuery13 55.03ms 18.17
tpchQuery16Part 8.92ms 112.07
tpchQuery16Supplier 17.64ms 56.69