Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<functional>: better hash function #2360

Open
AlexGuteniev opened this issue Nov 26, 2021 · 9 comments
Open

<functional>: better hash function #2360

AlexGuteniev opened this issue Nov 26, 2021 · 9 comments
Labels
performance Must go faster vNext Breaks binary compatibility

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Nov 26, 2021

@lhecker observes FNV1a is bad for integers.

https://discord.com/channels/737189251069771789/737189251497721887/913600700846596147

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.

Also tracked by VSO-371547 / AB#371547 .

@AlexGuteniev AlexGuteniev changed the title <functional>: better has function <functional>: better hash function Nov 26, 2021
@AlexGuteniev
Copy link
Contributor Author

#686 (comment) mentions SipHash-2-4

@lhecker
Copy link
Member

lhecker commented Nov 26, 2021

I did some benchmarks based on https://github.com/rurban/smhasher
The test involves putting numbers 0 to 127 into a std::unordered_map<uint32_t, uint32_t>, choosing 128 random keys and calling find() for at least 100ms for each of those and computing the average time per random key. The table below contains the mean/median/stddev between those 128 randomly chosen keys.

Benchmark Time CPU
FNV1a
mean
median
stddev

1.660 ns
1.650 ns
0.035 ns

1.670 ns
1.560 ns
0.168 ns
Murmur3F finalizer
mean
median
stddev

1.640 ns
1.480 ns
0.274 ns

1.640 ns
1.400 ns
0.314 ns
XXH3
mean
median
stddev

2.340 ns
2.040 ns
0.411 ns

2.340 ns
2.090 ns
0.500 ns
SipHash-2-4
mean
median
stddev

8.810 ns
8.550 ns
0.488 ns

8.820 ns
9.340 ns
1.020 ns
SipHash-1-3
mean
median
stddev

7.030 ns
6.680 ns
0.524 ns

7.030 ns
6.250 ns
0.980 ns

While SipHash seems promising in preventing hash flood attacks, the actual paper describes it as fast in relation to other MACs and not other PRFs. The claim that it's fast for short messages in general appears to be an overstatement by easily 4x. Considering things like VSO-653642 (complaint about 2x std::hash performance regression) it might be difficult to introduce SipHash into a language like C++.

@StephanTLavavej StephanTLavavej added performance Must go faster vNext Breaks binary compatibility labels Dec 1, 2021
@StephanTLavavej
Copy link
Member

Thanks - the SipHash perf data is definitely news to me, so we'll need to consider the available options carefully.

@cbezault
Copy link
Contributor

cbezault commented Dec 1, 2021

I would consider also looking at SipHash13 rust-lang/rust#29754 (comment)

@AlexGuteniev
Copy link
Contributor Author

@BillyONeal also mentioned a security consideration:

the hash values are not the same in successive runs of the program to make unordered_Xxx safe to use with untrusted input out of the box.

But I'm not sure if this can be implemented with insignificant perf impact.

@BillyONeal
Copy link
Member

I would argue that the default, user doesn't know any better, option, should be the not exploitable one, even at a performance cost. Moreover, the resulting strength of the hash may help some programs by reducing collisions vs. other hash functions; although it will not for very small keys in relatively small containers like 127 int32s.

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

@AlexGuteniev
Copy link
Contributor Author

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

std::hash is not neccessarily for unordered_map, a proper hash table may use it either

@lhecker
Copy link
Member

lhecker commented Dec 2, 2021

@cbezault For short inputs the performance difference between SipHash-2-4 and 1-3 is minimal. I've updated the table above.

@BillyONeal Good point. I agree that a better, but slower hash function should be preferred as the default one for the stdlib. However SipHash isn't a stronger hash function than alternatives and it's rate of collisions is the same as for any other one that more or less passes the smhasher suite.

But personally I'm still against using SipHash as the default for this STL so far - at least according to what I know at the moment. I believe the absolute majority of code written with this STL are end-user, desktop applications and not publicly exposed web servers. In fact I'd explicitly argue that the number of web servers written in C++ and running on Windows is quite low. While the safety promises of SipHash are nice, it doesn't excuse the increase in computing power consumption for applications that don't require any such safety at all.

Before using a new hash function I believe there's a large number of alternatives we could first consider.
smhasher has useful reference code, results and benchmarks on lower end hardware, which could be used as a point of reference for us.

@StephanTLavavej
Copy link
Member

I agree that we'll need to explore the increasingly large space of alternatives (when vNext work resumes). On the STL Discord, I observed that C++'s concerns are somewhat different from Python's; notably, we don't use dictionaries as our universal data structure (so the impact of a hash function choice is dramatically smaller), and map is available as an alternative to unordered_map and although map is typically somewhat slower, it also offers absolute immunity against hash-denial-of-service attacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster vNext Breaks binary compatibility
Projects
None yet
Development

No branches or pull requests

5 participants