Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a faster string-search method with simd instructions than std::find #10858

Closed
wants to merge 1 commit into from

Conversation

skadilover
Copy link
Contributor

@skadilover skadilover commented Aug 27, 2024

Description

In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.

Solution

There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] :
[5] doris : first-two comapre first, and the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.

They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself.

In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.

ut & benchmark
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.

Benchmark : std::find vs simdStrStr:

findSuccessful(opt_50_5) 31.66ms 31.59
findSuccessful(opt_100_10) 36.81ms 27.16
findSuccessful(opt_100_20) 96.47ms 10.37
findSuccessful(opt_1k_10) 63.27ms 15.81
findSuccessful(opt_1k_100) 156.95ms 6.37
findSuccessful(std_50_5) 45.24ms 22.11
findSuccessful(std_100_10) 191.23ms 5.23
findSuccessful(std_100_20) 322.11ms 3.10
findSuccessful(std_1k_10) 122.69ms 8.15
findSuccessful(std_1k_100) 168.10ms 5.95
findUnsuccessful(std_first_char_match) 819.13ms 1.22
findUnsuccessful(opt_first_char_match) 261.46ms 3.82
findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01
findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02

References

[1] gcc std::search issue:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414
[2] glibc strstr implements:
https://github.com/lattera/glibc/blob/master/string/strstr.c
[3] two way search:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
[4] std::boyer_moore_searcher
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher
https://github.com/mclow/search-library?tab=readme-ov-file
[5] doris string-search
https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44
[6] stringzilla
https://github.com/ashvardanian/StringZilla/blob/main/include/stringzilla/stringzilla.h#L2187
[7] string-search algorithms
https://www-igm.univ-mlv.fr/~lecroq/string/
[8] sse4 strst
https://github.com/WojciechMula/sse4-strstr
[9] folly benchmark
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTestBenchmarks.cpp.h
[10] folly ut
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTest.cpp#L1277
[11] clickhouse Volnitsky
https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/Volnitsky.h#L482

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024
Copy link

netlify bot commented Aug 27, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1c11800
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/672cae94423b1b0008f0bcb1

@Yuhta Yuhta self-requested a review August 27, 2024 15:16
@skadilover skadilover force-pushed the opt_strstr branch 7 times, most recently from e99e699 to 457860e Compare August 29, 2024 02:30
@skadilover skadilover changed the title Add a fast simd strstr Add a fast string-search method named simdStrStr Aug 29, 2024
@skadilover skadilover changed the title Add a fast string-search method named simdStrStr Add a fast string-search method with simd instructions Aug 29, 2024
@skadilover
Copy link
Contributor Author

skadilover commented Aug 29, 2024

@Yuhta
ping~

from #10731

I don't think you can put variable definition in header file, you probably need to move it to SimdUtil.cpp and have only forward declaration here. Also what is the value if we do MADV_HUGEPAGE for some addresses?

I found that the performance of the code will be reduced by 30% when placed in the cpp file. So for the case, I change the code like this :

  static const int kPageSize = sysconf(_SC_PAGESIZE);

====>
FOLLY_ALWAYS_INLINE bool pageSafe(const void* const ptr, size_t length) {
  static const int kPageSize = sysconf(_SC_PAGESIZE);
  return ((kPageSize - 1) & reinterpret_cast<std::uintptr_t>(ptr)) <=
      kPageSize - CharVector::size - length;
}

MADV_HUGEPAGE for some addresses

I haven't found a good way to handle this situation, could you give me some advice?
Could we assume MADV_HUGEPAGE is always times of _SC_PAGESIZE , 2M vs 4k etc. see https://hugekernel.org/doc/Documentation/admin-guide/mm/transhuge.rst

Modern kernels support "multi-size THP" (mTHP), which introduces the
ability to allocate memory in blocks that are bigger than a base page
but smaller than traditional PMD-size (as described above), in
increments of a power-of-2 number of pages. mTHP can back anonymous
memory (for example 16K, 32K, 64K, etc)

RedHat define the valid values:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-transhuge#s-memory-configure_hugepages

Defines the size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.

@skadilover skadilover changed the title Add a fast string-search method with simd instructions Add a faster string-search method with simd instructions than std::find Aug 29, 2024
@skadilover skadilover force-pushed the opt_strstr branch 3 times, most recently from 991be1a to 54422e3 Compare August 30, 2024 02:14
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the other suggestions in #10731 (review)

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
@skadilover
Copy link
Contributor Author

skadilover commented Sep 9, 2024

@Yuhta

Please consider the other suggestions in #10731 (review)

Updates code for the suggestions in #10731:

These checks can be VELOX_DCHECK since their correctness can be ensured quite locally in the same file

done

VELOX_SIMD_STRSTR_CASE

done

aside: Wondering what is the minimal k to make KMP faster than this. This is also related to content of the strings though.

I have extended StringBenchmark to support different algrithm, and kmp is always slower than simdStrStr(maybe my implement is not good so I add std::boyer_moore_searcher for comparison):

findSuccessful(simd_50_to_5)                              344.39ms      2.90
findSuccessful(std_50_to_5)                               802.61ms      1.25
findSuccessful(std_boyer_moore_50_to_5)                      1.42s   703.66m
findSuccessful(kmp_50_to_5)                                  5.23s   191.05m
findSuccessful(simd_100_to_10)                            315.82ms      3.17
findSuccessful(std_100_to_10)                                1.26s   790.97m
findSuccessful(std_boyer_moore_100_to_10)                 984.90ms      1.02
findSuccessful(kmp_100_to_10)                                6.38s   156.70m
findSuccessful(simd_100_to_20)                            352.90ms      2.83
findSuccessful(std_100_to_20)                                2.03s   492.18m
findSuccessful(std_boyer_moore_100_to_20)                 887.87ms      1.13
findSuccessful(kmp_100_to_20)                                6.16s   162.34m
findSuccessful(simd_1000_to_10)                           254.53ms      3.93
findSuccessful(std_1000_to_10)                            698.50ms      1.43
findSuccessful(std_boyer_moore_1000_to_10)                683.83ms      1.46
findSuccessful(kmp_1000_to_10)                               3.33s   300.39m
findSuccessful(simd_1000_to_100)                           65.42ms     15.29
findSuccessful(std_1000_to_100)                           186.01ms      5.38
findSuccessful(std_boyer_moore_1000_to_100)               381.65ms      2.62
findSuccessful(kmp_1000_to_100)                           913.63ms      1.09
findSuccessful(simd_1000_to_200)                          306.99ms      3.26
findSuccessful(std_1000_to_200)                           642.04ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               683.16ms      1.46
findSuccessful(kmp_1000_to_200)                              9.70s   103.05m
findUnsuccessful(std_first_char_match)                    400.47ms      2.50
findUnsuccessful(opt_first_char_match)                    266.45ms      3.75
findUnsuccessful(std_first_char_unmatch)                  333.54ms      3.00
findUnsuccessful(opt_first_char_unmatch)                  284.63ms      3.51
simd strstr std find std boyer moore kmp
heystack :50 needle :5 344ms 802ms 1.42s 5.23s
heystack : 100 needle :10 315ms 1.26s 984ms 6.38s
heystack :100 needle :20 449ms 2.03s 887ms 1.13s
heystack :1000 needle :10 254ms 689ms 693ms 3.3s
heystack :1000 needle :100 65ms 186ms 381ms 913ms
heystack :1000 needle :200 306ms 642ms 683ms 9.7s

Since this is an almost drop-in replacement for strstr, can you add another test to generate a few thousands of random string pairs with various lengths, and check our result is the same as strstr? Also make sure you cover corner cases like larger needle than haystack, etc.

I update ut refer to folly uts, add random test cases.

Could you have look at this once more?

@skadilover
Copy link
Contributor Author

skadilover commented Sep 9, 2024

@Yuhta ping~

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark result looks very good

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
@skadilover
Copy link
Contributor Author

skadilover commented Sep 13, 2024

@Yuhta update code

  1. pageSize => global variable
  2. fix comments above except this:

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

As we inline kPageSize, pageSafe method is faster than before, I move the check to each batch iteration, we only miss the comparison-span-pages case, maybe better, like clickhouse:
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/StringSearcher.h

            while (haystack < haystack_end)
            {
                if (haystack + N <= haystack_end && isPageSafe(haystack))
                {

new benchmark result:

findSuccessful(simd_50_to_5)                              355.40ms      2.81
findSuccessful(std_50_to_5)                               807.73ms      1.24
findSuccessful(std_boyer_moore_50_to_5)                      1.40s   712.05m
findSuccessful(kmp_50_to_5)                                  5.18s   192.87m
findSuccessful(simd_100_to_10)                            319.76ms      3.13
findSuccessful(std_100_to_10)                                1.25s   802.85m
findSuccessful(std_boyer_moore_100_to_10)                 910.54ms      1.10
findSuccessful(kmp_100_to_10)                                6.38s   156.72m
findSuccessful(simd_100_to_20)                            355.43ms      2.81
findSuccessful(std_100_to_20)                                2.07s   483.83m
findSuccessful(std_boyer_moore_100_to_20)                 901.23ms      1.11
findSuccessful(kmp_100_to_20)                                6.20s   161.18m
findSuccessful(simd_1000_to_10)                           245.61ms      4.07
findSuccessful(std_1000_to_10)                            724.69ms      1.38
findSuccessful(std_boyer_moore_1000_to_10)                870.93ms      1.15
findSuccessful(kmp_1000_to_10)                               3.33s   300.34m
findSuccessful(simd_1000_to_100)                           69.67ms     14.35
findSuccessful(std_1000_to_100)                           186.06ms      5.37
findSuccessful(std_boyer_moore_1000_to_100)               286.45ms      3.49
findSuccessful(kmp_1000_to_100)                           913.97ms      1.09
findSuccessful(simd_1000_to_200)                          354.15ms      2.82
findSuccessful(std_1000_to_200)                           642.50ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               675.70ms      1.48
findSuccessful(kmp_1000_to_200)                              9.78s   102.28m
findUnsuccessful(std_first_char_match)                    920.16ms      1.09
findUnsuccessful(opt_first_char_match)                    383.93ms      2.60
findUnsuccessful(std_first_char_unmatch)                  334.28ms      2.99
findUnsuccessful(opt_first_char_unmatch)                  331.98ms      3.01

@skadilover skadilover requested a review from Yuhta September 13, 2024 04:28
@skadilover skadilover force-pushed the opt_strstr branch 2 times, most recently from 0e2a09d to b6c6706 Compare September 13, 2024 14:00
@skadilover
Copy link
Contributor Author

@Yuhta ping~

@skadilover
Copy link
Contributor Author

@Yuhta

Any more comments about this ?

@skadilover skadilover force-pushed the opt_strstr branch 2 times, most recently from 7aa4d26 to 446f09d Compare September 26, 2024 03:04
@skadilover
Copy link
Contributor Author

@Yuhta ping ~

@skadilover
Copy link
Contributor Author

@assignUser @majetideepak @Yuhta
Any feedback for this PR?

@assignUser
Copy link
Collaborator

@skadilover CMake looks fine, can't comment on the rest :)

add ut / benchmark

check the last page
@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in fe87f0b.

athmaja-n pushed a commit to athmaja-n/velox that referenced this pull request Jan 10, 2025
…nd (facebookincubator#10858)

Summary:
### Description
In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.

### Solution
There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] :
[5] doris : first-two comapre first, and  the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.

They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost  of 'optimized path' itself.

In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.

**ut & benchmark**
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.

### Benchmark : std::find vs simdStrStr:

findSuccessful(opt_50_5)                                   31.66ms     31.59
findSuccessful(opt_100_10)                                 36.81ms     27.16
findSuccessful(opt_100_20)                                 96.47ms     10.37
findSuccessful(opt_1k_10)                                  63.27ms     15.81
findSuccessful(opt_1k_100)                                156.95ms      6.37
findSuccessful(std_50_5)                                   45.24ms     22.11
findSuccessful(std_100_10)                                191.23ms      5.23
findSuccessful(std_100_20)                                322.11ms      3.10
findSuccessful(std_1k_10)                                 122.69ms      8.15
findSuccessful(std_1k_100)                                168.10ms      5.95
findUnsuccessful(std_first_char_match)                    819.13ms      1.22
findUnsuccessful(opt_first_char_match)                    261.46ms      3.82
findUnsuccessful(std_first_char_unmatch)                  249.12ms      4.01
findUnsuccessful(opt_first_char_unmatch)                  199.02ms      5.02

### References

[1] gcc std::search issue:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414
[2] glibc strstr implements:
https://github.com/lattera/glibc/blob/master/string/strstr.c
[3] two way search:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
[4] std::boyer_moore_searcher
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher
https://github.com/mclow/search-library?tab=readme-ov-file
[5] doris string-search
https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44
[6] stringzilla
https://github.com/ashvardanian/StringZilla/blob/main/include/stringzilla/stringzilla.h#L2187
[7] string-search algorithms
https://www-igm.univ-mlv.fr/~lecroq/string/
[8] sse4 strst
https://github.com/WojciechMula/sse4-strstr
[9] folly benchmark
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTestBenchmarks.cpp.h
[10] folly ut
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTest.cpp#L1277
[11] clickhouse Volnitsky
https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/Volnitsky.h#L482

Pull Request resolved: facebookincubator#10858

Reviewed By: pedroerp

Differential Revision: D65949876

Pulled By: Yuhta

fbshipit-source-id: e48929a7f61694da93fe3df4ce777909118b9312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants