Faster `h.findBucket(v)` #1662

imorph · 2024-10-27T14:22:02Z

What problem this PR is solving

Hi!
It is part "two" of exploration for optimization possibilities for cockroachdb/cockroach#133306 (part "one" is here). Second largest contributor in original issue was findBucket function. Closer look at that function revealed some excellent suggestion from @beorn7 about smaller-sized arrays and linear search:

	// TODO(beorn7): For small numbers of buckets (<30), a linear search is
	// slightly faster than the binary search. If we really care, we could
	// switch from one search strategy to the other depending on the number
	// of buckets.
	//
	// Microbenchmarks (BenchmarkHistogramNoLabels):
	// 11 buckets: 38.3 ns/op linear - binary 48.7 ns/op
	// 100 buckets: 78.1 ns/op linear - binary 54.9 ns/op
	// 300 buckets: 154 ns/op linear - binary 61.6 ns/op

I decided to explore it in isolation and this is a result: https://github.com/imorph/go-search-research

TLDR:

linear search indeed faster for arrays < 35 elements (regardless position of the needle in haystack, but the closer to beginning the faster search)
linear search always outperforms binary search in case of +Inf bucket (regardless the size)
both search variants are performing the same (under lower bounds) when executed in parallel
sophisticated loop-unrolling implementation of linear search is not worth it

This PR is proposal to use this "adaptive" implementation

Design

use simple linear search for everything under 35 elements
check for +Inf before calling into sort.SearchFloat64s
use stdlib's sort.SearchFloat64s for everything else

Results

as expected short histograms performing better by 50%
in case of +Inf new implementation is better by 90% (no need to execute full binary search just check last element)
(!!!) mid/large/huge tests are faster as well by 12%-16%, i have no other explanation for that other than that early exit check are touching slice which leads to prefetching it into CPU cache and then binary search on it running faster

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Intel(R) Xeon(R) Platinum 8488C
                  │ ../../fbOLD  │             ../../fbNEW             │
                  │    sec/op    │   sec/op     vs base                │
FindBucketShort-8   13.450n ± 0%   6.365n ± 0%  -52.67% (p=0.000 n=10)
FindBucketMid-8      16.40n ± 1%   14.32n ± 0%  -12.68% (p=0.000 n=10)
FindBucketLarge-8    18.95n ± 0%   16.41n ± 1%  -13.43% (p=0.000 n=10)
FindBucketHuge-8     23.91n ± 0%   20.00n ± 0%  -16.35% (p=0.000 n=10)
FindBucketNone-8    23.835n ± 0%   2.012n ± 1%  -91.56% (p=0.000 n=10)
geomean              18.86n        9.036n       -52.08%

@vesari
@ArthurSens
@bwplotka
@kakkoyun

ArthurSens · 2024-10-28T22:30:05Z

That's really cool @imorph, I didn't have the time to look deep into this but just skimmed over the changes. One question, how did you come up with the 35 limit? Was just intuition or backed by evidence? There's no wrong answer, I'm just curious :)

imorph · 2024-10-29T07:04:45Z

@ArthurSens I made bite-size research here -> https://github.com/imorph/go-search-research where benchmarks were executed for many different lengths of slice. On both x86 and M3 slice length where binary search starts to take over was somewhere in between 30 and 40 (it also depends on a position of an element in a slice - the closer to beginning the lengthier slice should be for binary search to win).

kakkoyun · 2024-10-31T14:36:14Z

I plan to review this PR this weekend. Sorry for the tardiness.

imorph · 2024-11-02T16:36:37Z

One more before/after diff on another CPU (+ one test added):

goos: linux
goarch: amd64
pkg: github.com/prometheus/client_golang/prometheus
cpu: INTEL(R) XEON(R) GOLD 5512U
                   │     orig      │              new_35_low              │
                   │    sec/op     │    sec/op     vs base                │
FindBucketShort-56    19.74n ±  1%   11.34n ±  3%  -42.53% (p=0.000 n=10)
FindBucketMid-56      23.69n ±  1%   21.59n ±  1%   -8.85% (p=0.000 n=10)
FindBucketLarge-56    27.18n ± 14%   24.35n ±  0%  -10.41% (p=0.022 n=10)
FindBucketHuge-56     35.07n ±  1%   30.57n ±  1%  -12.84% (p=0.000 n=10)
FindBucketInf-56     34.810n ± 14%   3.140n ± 15%  -90.98% (p=0.000 n=10)
FindBucketLow-56     30.305n ±  0%   2.425n ± 42%  -92.00% (p=0.000 n=10)
geomean               27.88n         10.56n        -62.12%

ArthurSens

Similar to the other PR, let's add a comment explaining that numbers were chosen based on empirical testing while linking to your PR description.

I've added some comments about early exits and then LGTM!

ArthurSens · 2024-11-03T14:08:18Z

prometheus/histogram.go

+	n := len(h.upperBounds)
+
+	// Early exit: if v is less than or equal to the first upper bound, return 0
+	if v <= h.upperBounds[0] {
+		return 0
+	}


Should we swap the order here between the early exit and n := ... just so it truly is an early exit? :o)

Seems like there is cases when len(h.upperBounds) == 0 so we should first check for len otherwise TestNativeHistogram panics at factor 1.1 results in schema 3

ArthurSens · 2024-11-03T14:13:22Z

prometheus/histogram.go

+	// Early exit: if v is greater than the last upper bound, return len(h.upperBounds)
+	if v > h.upperBounds[n-1] {
+		return n
+	}


Could we move this part upwards? Even before the linear search?

ArthurSens · 2024-11-03T14:24:22Z

By the way, awesome research here: https://github.com/imorph/go-search-research

ArthurSens · 2024-11-04T14:06:26Z

@imorph looks like you added some extra commits by accident 🤔

imorph · 2024-11-04T15:47:33Z

@imorph looks like you added some extra commits by accident 🤔

yes, sorry for the this git-mess =(

first I FF-merged main (with fixed test to make CI green again)
then "early exit re-arrangements" were pushed without sign-off (so I followed directions to rewrite DCO'less commit)
then one of the tests start to failing (turns out this n==0 check were needed on top of all other logic) and i reverted back a bit

I can reset to main this branch, and put everything new on top for more clearer picture. WDYT?

ArthurSens · 2024-11-04T15:59:04Z

No worries!

Yeah, ideally we have only the relevant commits in this PR, so rebasing on main sounds like a good idea

imorph · 2024-11-04T19:38:10Z

@ArthurSens this PR closed itself after me trying to start from fresh main.

#1673 is what it should be from the start.

latest version comparison with main on apple M3:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/client_golang/prometheus
cpu: Apple M3 Max
                   │  ../../main  │          ../../branch_alt           │
                   │    sec/op    │   sec/op     vs base                │
FindBucketShort-14    8.394n ± 1%   4.240n ± 1%  -49.49% (p=0.000 n=10)
FindBucketMid-14      10.73n ± 1%   10.92n ± 0%   +1.86% (p=0.000 n=10)
FindBucketLarge-14    13.73n ± 0%   13.95n ± 1%   +1.60% (p=0.000 n=10)
FindBucketHuge-14     19.05n ± 1%   19.38n ± 1%   +1.76% (p=0.000 n=10)
FindBucketInf-14     19.230n ± 1%   1.359n ± 1%  -92.94% (p=0.000 n=10)
FindBucketLow-14     22.010n ± 1%   1.344n ± 2%  -93.90% (p=0.000 n=10)
geomean               14.67n        5.327n       -63.68%

interesting that I don't see any significant improvements for basic cases (that were documented in PR description). May be this is some intel specifics or measurement artefact

imorph force-pushed the faster_find_bucket branch from cce97b4 to e3911b2 Compare October 27, 2024 14:39

imorph changed the title ~~Faster find bucket~~ Faster h.findBucket(v) Oct 27, 2024

ArthurSens mentioned this pull request Oct 31, 2024

Flaky test: TestSummaryDecay #1666

Closed

ArthurSens reviewed Nov 3, 2024

View reviewed changes

imorph force-pushed the faster_find_bucket branch from d6d7c58 to aaa3f32 Compare November 4, 2024 10:00

imorph closed this Nov 4, 2024

imorph force-pushed the faster_find_bucket branch from e3cc4e4 to 0c73c1c Compare November 4, 2024 19:24

imorph mentioned this pull request Nov 4, 2024

Duplicate of #1662 #1673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `h.findBucket(v)` #1662

Faster `h.findBucket(v)` #1662

imorph commented Oct 27, 2024

ArthurSens commented Oct 28, 2024 •

edited

Loading

imorph commented Oct 29, 2024

kakkoyun commented Oct 31, 2024

imorph commented Nov 2, 2024

ArthurSens left a comment

ArthurSens Nov 3, 2024

imorph Nov 4, 2024

ArthurSens Nov 3, 2024

ArthurSens commented Nov 3, 2024

ArthurSens commented Nov 4, 2024

imorph commented Nov 4, 2024

ArthurSens commented Nov 4, 2024

imorph commented Nov 4, 2024

Faster h.findBucket(v) #1662

Faster h.findBucket(v) #1662

Conversation

imorph commented Oct 27, 2024

What problem this PR is solving

Design

Results

ArthurSens commented Oct 28, 2024 • edited Loading

imorph commented Oct 29, 2024

kakkoyun commented Oct 31, 2024

imorph commented Nov 2, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

ArthurSens Nov 3, 2024

Choose a reason for hiding this comment

imorph Nov 4, 2024

Choose a reason for hiding this comment

ArthurSens Nov 3, 2024

Choose a reason for hiding this comment

ArthurSens commented Nov 3, 2024

ArthurSens commented Nov 4, 2024

imorph commented Nov 4, 2024

ArthurSens commented Nov 4, 2024

imorph commented Nov 4, 2024

Faster `h.findBucket(v)` #1662

Faster `h.findBucket(v)` #1662

ArthurSens commented Oct 28, 2024 •

edited

Loading