Add a faster string-search method with simd instructions than std::find #10858

skadilover · 2024-08-27T12:42:38Z

Description

In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.

Solution

There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] ：
[5] doris : first-two comapre first, and the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.

They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself.

In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.

ut & benchmark
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.

Benchmark : std::find vs simdStrStr:

findSuccessful(opt_50_5) 31.66ms 31.59
findSuccessful(opt_100_10) 36.81ms 27.16
findSuccessful(opt_100_20) 96.47ms 10.37
findSuccessful(opt_1k_10) 63.27ms 15.81
findSuccessful(opt_1k_100) 156.95ms 6.37
findSuccessful(std_50_5) 45.24ms 22.11
findSuccessful(std_100_10) 191.23ms 5.23
findSuccessful(std_100_20) 322.11ms 3.10
findSuccessful(std_1k_10) 122.69ms 8.15
findSuccessful(std_1k_100) 168.10ms 5.95
findUnsuccessful(std_first_char_match) 819.13ms 1.22
findUnsuccessful(opt_first_char_match) 261.46ms 3.82
findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01
findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02

References

netlify · 2024-08-27T12:42:56Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`1c11800`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/672cae94423b1b0008f0bcb1

skadilover · 2024-08-29T03:23:57Z

@Yuhta
ping~

from #10731

I don't think you can put variable definition in header file, you probably need to move it to SimdUtil.cpp and have only forward declaration here. Also what is the value if we do MADV_HUGEPAGE for some addresses?

I found that the performance of the code will be reduced by 30% when placed in the cpp file. So for the case, I change the code like this :

  static const int kPageSize = sysconf(_SC_PAGESIZE);

====>
FOLLY_ALWAYS_INLINE bool pageSafe(const void* const ptr, size_t length) {
  static const int kPageSize = sysconf(_SC_PAGESIZE);
  return ((kPageSize - 1) & reinterpret_cast<std::uintptr_t>(ptr)) <=
      kPageSize - CharVector::size - length;
}

MADV_HUGEPAGE for some addresses

I haven't found a good way to handle this situation, could you give me some advice?
Could we assume MADV_HUGEPAGE is always times of _SC_PAGESIZE , 2M vs 4k etc. see https://hugekernel.org/doc/Documentation/admin-guide/mm/transhuge.rst

Modern kernels support "multi-size THP" (mTHP), which introduces the
ability to allocate memory in blocks that are bigger than a base page
but smaller than traditional PMD-size (as described above), in
increments of a power-of-2 number of pages. mTHP can back anonymous
memory (for example 16K, 32K, 64K, etc)

RedHat define the valid values:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-transhuge#s-memory-configure_hugepages

Defines the size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.

Yuhta

Please consider the other suggestions in #10731 (review)

velox/common/base/SimdUtil-inl.h

skadilover · 2024-09-09T03:06:12Z

@Yuhta

Please consider the other suggestions in #10731 (review)

Updates code for the suggestions in #10731:

These checks can be VELOX_DCHECK since their correctness can be ensured quite locally in the same file

done

VELOX_SIMD_STRSTR_CASE

done

aside: Wondering what is the minimal k to make KMP faster than this. This is also related to content of the strings though.

I have extended StringBenchmark to support different algrithm, and kmp is always slower than simdStrStr(maybe my implement is not good so I add std::boyer_moore_searcher for comparison):

findSuccessful(simd_50_to_5)                              344.39ms      2.90
findSuccessful(std_50_to_5)                               802.61ms      1.25
findSuccessful(std_boyer_moore_50_to_5)                      1.42s   703.66m
findSuccessful(kmp_50_to_5)                                  5.23s   191.05m
findSuccessful(simd_100_to_10)                            315.82ms      3.17
findSuccessful(std_100_to_10)                                1.26s   790.97m
findSuccessful(std_boyer_moore_100_to_10)                 984.90ms      1.02
findSuccessful(kmp_100_to_10)                                6.38s   156.70m
findSuccessful(simd_100_to_20)                            352.90ms      2.83
findSuccessful(std_100_to_20)                                2.03s   492.18m
findSuccessful(std_boyer_moore_100_to_20)                 887.87ms      1.13
findSuccessful(kmp_100_to_20)                                6.16s   162.34m
findSuccessful(simd_1000_to_10)                           254.53ms      3.93
findSuccessful(std_1000_to_10)                            698.50ms      1.43
findSuccessful(std_boyer_moore_1000_to_10)                683.83ms      1.46
findSuccessful(kmp_1000_to_10)                               3.33s   300.39m
findSuccessful(simd_1000_to_100)                           65.42ms     15.29
findSuccessful(std_1000_to_100)                           186.01ms      5.38
findSuccessful(std_boyer_moore_1000_to_100)               381.65ms      2.62
findSuccessful(kmp_1000_to_100)                           913.63ms      1.09
findSuccessful(simd_1000_to_200)                          306.99ms      3.26
findSuccessful(std_1000_to_200)                           642.04ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               683.16ms      1.46
findSuccessful(kmp_1000_to_200)                              9.70s   103.05m
findUnsuccessful(std_first_char_match)                    400.47ms      2.50
findUnsuccessful(opt_first_char_match)                    266.45ms      3.75
findUnsuccessful(std_first_char_unmatch)                  333.54ms      3.00
findUnsuccessful(opt_first_char_unmatch)                  284.63ms      3.51

	simd strstr	std find	std boyer moore	kmp
heystack ：50 needle ：5	344ms	802ms	1.42s	5.23s
heystack : 100 needle ：10	315ms	1.26s	984ms	6.38s
heystack ：100 needle ：20	449ms	2.03s	887ms	1.13s
heystack ：1000 needle ：10	254ms	689ms	693ms	3.3s
heystack ：1000 needle ：100	65ms	186ms	381ms	913ms
heystack ：1000 needle ：200	306ms	642ms	683ms	9.7s

Since this is an almost drop-in replacement for strstr, can you add another test to generate a few thousands of random string pairs with various lengths, and check our result is the same as strstr? Also make sure you cover corner cases like larger needle than haystack, etc.

I update ut refer to folly uts, add random test cases.

Could you have look at this once more?

skadilover · 2024-09-09T15:17:04Z

@Yuhta ping~

Yuhta

The benchmark result looks very good

velox/common/base/SimdUtil-inl.h

skadilover · 2024-09-13T04:25:30Z

@Yuhta update code

pageSize => global variable
fix comments above except this:

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

As we inline kPageSize, pageSafe method is faster than before, I move the check to each batch iteration, we only miss the comparison-span-pages case, maybe better, like clickhouse:
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/StringSearcher.h

            while (haystack < haystack_end)
            {
                if (haystack + N <= haystack_end && isPageSafe(haystack))
                {

new benchmark result:

findSuccessful(simd_50_to_5)                              355.40ms      2.81
findSuccessful(std_50_to_5)                               807.73ms      1.24
findSuccessful(std_boyer_moore_50_to_5)                      1.40s   712.05m
findSuccessful(kmp_50_to_5)                                  5.18s   192.87m
findSuccessful(simd_100_to_10)                            319.76ms      3.13
findSuccessful(std_100_to_10)                                1.25s   802.85m
findSuccessful(std_boyer_moore_100_to_10)                 910.54ms      1.10
findSuccessful(kmp_100_to_10)                                6.38s   156.72m
findSuccessful(simd_100_to_20)                            355.43ms      2.81
findSuccessful(std_100_to_20)                                2.07s   483.83m
findSuccessful(std_boyer_moore_100_to_20)                 901.23ms      1.11
findSuccessful(kmp_100_to_20)                                6.20s   161.18m
findSuccessful(simd_1000_to_10)                           245.61ms      4.07
findSuccessful(std_1000_to_10)                            724.69ms      1.38
findSuccessful(std_boyer_moore_1000_to_10)                870.93ms      1.15
findSuccessful(kmp_1000_to_10)                               3.33s   300.34m
findSuccessful(simd_1000_to_100)                           69.67ms     14.35
findSuccessful(std_1000_to_100)                           186.06ms      5.37
findSuccessful(std_boyer_moore_1000_to_100)               286.45ms      3.49
findSuccessful(kmp_1000_to_100)                           913.97ms      1.09
findSuccessful(simd_1000_to_200)                          354.15ms      2.82
findSuccessful(std_1000_to_200)                           642.50ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               675.70ms      1.48
findSuccessful(kmp_1000_to_200)                              9.78s   102.28m
findUnsuccessful(std_first_char_match)                    920.16ms      1.09
findUnsuccessful(opt_first_char_match)                    383.93ms      2.60
findUnsuccessful(std_first_char_unmatch)                  334.28ms      2.99
findUnsuccessful(opt_first_char_unmatch)                  331.98ms      3.01

skadilover · 2024-09-19T02:34:07Z

@Yuhta ping~

velox/common/base/SimdUtil-inl.h

skadilover · 2024-09-25T05:37:30Z

@Yuhta

Any more comments about this ?

velox/common/base/SimdUtil-inl.h

skadilover · 2024-10-17T05:25:44Z

@Yuhta ping ~

skadilover · 2024-11-01T09:40:44Z

@assignUser @majetideepak @Yuhta
Any feedback for this PR?

assignUser · 2024-11-01T23:42:56Z

@skadilover CMake looks fine, can't comment on the rest :)

add ut / benchmark check the last page

facebook-github-bot · 2024-11-14T16:43:23Z

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-11-14T20:08:10Z

@Yuhta merged this pull request in fe87f0b.

…nd (facebookincubator#10858) Summary: ### Description In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation. ### Solution There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5]. This PR refers to the implementation of [5], [6], [8] ： [5] doris : first-two comapre first, and the issue of page-safe was also mentioned [6] stringzilla: first-mid-lst compare first [8] simd-strstr implement by WojciechMula, first-lst comapre first. They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself. In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars. **ut & benchmark** This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation. ### Benchmark : std::find vs simdStrStr: findSuccessful(opt_50_5) 31.66ms 31.59 findSuccessful(opt_100_10) 36.81ms 27.16 findSuccessful(opt_100_20) 96.47ms 10.37 findSuccessful(opt_1k_10) 63.27ms 15.81 findSuccessful(opt_1k_100) 156.95ms 6.37 findSuccessful(std_50_5) 45.24ms 22.11 findSuccessful(std_100_10) 191.23ms 5.23 findSuccessful(std_100_20) 322.11ms 3.10 findSuccessful(std_1k_10) 122.69ms 8.15 findSuccessful(std_1k_100) 168.10ms 5.95 findUnsuccessful(std_first_char_match) 819.13ms 1.22 findUnsuccessful(opt_first_char_match) 261.46ms 3.82 findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01 findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02 ### References [1] gcc std::search issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414 [2] glibc strstr implements: https://github.com/lattera/glibc/blob/master/string/strstr.c [3] two way search: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 [4] std::boyer_moore_searcher https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher https://github.com/mclow/search-library?tab=readme-ov-file [5] doris string-search https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44 [6] stringzilla https://github.com/ashvardanian/StringZilla/blob/main/include/stringzilla/stringzilla.h#L2187 [7] string-search algorithms https://www-igm.univ-mlv.fr/~lecroq/string/ [8] sse4 strst https://github.com/WojciechMula/sse4-strstr [9] folly benchmark https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTestBenchmarks.cpp.h [10] folly ut https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTest.cpp#L1277 [11] clickhouse Volnitsky https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/Volnitsky.h#L482 Pull Request resolved: facebookincubator#10858 Reviewed By: pedroerp Differential Revision: D65949876 Pulled By: Yuhta fbshipit-source-id: e48929a7f61694da93fe3df4ce777909118b9312

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024

Yuhta self-requested a review August 27, 2024 15:16

skadilover force-pushed the opt_strstr branch 7 times, most recently from e99e699 to 457860e Compare August 29, 2024 02:30

skadilover changed the title ~~Add a fast simd strstr~~ Add a fast string-search method named simdStrStr Aug 29, 2024

skadilover changed the title ~~Add a fast string-search method named simdStrStr~~ Add a fast string-search method with simd instructions Aug 29, 2024

skadilover changed the title ~~Add a fast string-search method with simd instructions~~ Add a faster string-search method with simd instructions than std::find Aug 29, 2024

skadilover force-pushed the opt_strstr branch 3 times, most recently from 991be1a to 54422e3 Compare August 30, 2024 02:14

Yuhta reviewed Aug 30, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch from 54422e3 to 2cf06d7 Compare September 9, 2024 03:00

skadilover force-pushed the opt_strstr branch 4 times, most recently from 58a2c79 to 3ae8be1 Compare September 10, 2024 05:56

skadilover mentioned this pull request Sep 12, 2024

Add substrings search path for constant like pattern #10731

Closed

skadilover requested a review from Yuhta September 12, 2024 03:06

Yuhta reviewed Sep 12, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch from 3ae8be1 to 13220d3 Compare September 13, 2024 03:59

skadilover force-pushed the opt_strstr branch from 13220d3 to 40a8df4 Compare September 13, 2024 04:26

skadilover requested a review from Yuhta September 13, 2024 04:28

skadilover force-pushed the opt_strstr branch 2 times, most recently from 0e2a09d to b6c6706 Compare September 13, 2024 14:00

Yuhta reviewed Sep 20, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch 2 times, most recently from 7aa4d26 to 446f09d Compare September 26, 2024 03:04

Yuhta reviewed Oct 2, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch from 446f09d to cb6e208 Compare October 8, 2024 02:00

skadilover requested a review from Yuhta October 8, 2024 02:00

skadilover force-pushed the opt_strstr branch from cb6e208 to ce1f33c Compare October 15, 2024 01:29

skadilover requested review from assignUser and majetideepak as code owners October 15, 2024 01:29

skadilover force-pushed the opt_strstr branch from ce1f33c to 480c3ae Compare October 18, 2024 03:33

skadilover force-pushed the opt_strstr branch from 480c3ae to a16bf39 Compare November 1, 2024 09:38

skadilover force-pushed the opt_strstr branch from a16bf39 to 08faa90 Compare November 5, 2024 05:57

add simd strstr

1c11800

add ut / benchmark check the last page

skadilover force-pushed the opt_strstr branch from 08faa90 to 1c11800 Compare November 7, 2024 12:12

Yuhta approved these changes Nov 14, 2024

View reviewed changes

facebook-github-bot closed this in fe87f0b Nov 14, 2024

facebook-github-bot added the Merged label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a faster string-search method with simd instructions than std::find #10858

Add a faster string-search method with simd instructions than std::find #10858

skadilover commented Aug 27, 2024 •

edited

Loading

netlify bot commented Aug 27, 2024 •

edited

Loading

skadilover commented Aug 29, 2024 •

edited

Loading

Yuhta left a comment

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

Yuhta left a comment

skadilover commented Sep 13, 2024 •

edited

Loading

skadilover commented Sep 19, 2024

skadilover commented Sep 25, 2024

skadilover commented Oct 17, 2024

skadilover commented Nov 1, 2024

assignUser commented Nov 1, 2024

facebook-github-bot commented Nov 14, 2024

facebook-github-bot commented Nov 14, 2024

Add a faster string-search method with simd instructions than std::find #10858

Add a faster string-search method with simd instructions than std::find #10858

Conversation

skadilover commented Aug 27, 2024 • edited Loading

Description

Solution

Benchmark : std::find vs simdStrStr:

References

netlify bot commented Aug 27, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

skadilover commented Aug 29, 2024 • edited Loading

Yuhta left a comment

Choose a reason for hiding this comment

skadilover commented Sep 9, 2024 • edited Loading

skadilover commented Sep 9, 2024 • edited Loading

Yuhta left a comment

Choose a reason for hiding this comment

skadilover commented Sep 13, 2024 • edited Loading

skadilover commented Sep 19, 2024

skadilover commented Sep 25, 2024

skadilover commented Oct 17, 2024

skadilover commented Nov 1, 2024

assignUser commented Nov 1, 2024

facebook-github-bot commented Nov 14, 2024

facebook-github-bot commented Nov 14, 2024

skadilover commented Aug 27, 2024 •

edited

Loading

netlify bot commented Aug 27, 2024 •

edited

Loading

skadilover commented Aug 29, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 13, 2024 •

edited

Loading