-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str and [u8] should hash the same #27108
Comments
👍 |
This makes using a Tendril as a HashMap key a little more palatable. Sadly, str and [u8] hash differently at present, so we shouldn’t implement Borrow<str> for StrTendril. An alternative would be making the Hash implementations for Tendril<F> manual and making Tendril<UTF8> use the str Hash implementation and the rest use the [u8] one. See also rust-lang/rust#27108 which deals with fixing the underlying problem of the differing Hash implementations.
I'm all for this change as long as it's benchmarked to not cause a noticeable regression. (I agree with your assessment that it shouldn't) |
Hashing an strings of length n will be n + 1 bytes before and n + 8 bytes after. Siphash adds one byte and hashes in 8-byte blocks, so very short strings bumped up to the next size, using 2 * 2 + 4 = 8 siphash rounds vs. previous 2 * 1 + 4 = 6. I'm sure you can measure the difference. Also, this is a change to what the |
This would be optimized if rust-lang/rfcs#1666 is accepted. |
@arthurprs I don’t see how rust-lang/rfcs#1666 would have any bearing on this. All this should need is diff --git a/src/libcore/hash/mod.rs b/src/libcore/hash/mod.rs
index 051eb97..85427d0 100644
--- a/src/libcore/hash/mod.rs
+++ b/src/libcore/hash/mod.rs
@@ -328,8 +328,7 @@ mod impls {
#[stable(feature = "rust1", since = "1.0.0")]
impl Hash for str {
fn hash<H: Hasher>(&self, state: &mut H) {
- state.write(self.as_bytes());
- state.write_u8(0xff)
+ self.as_bytes().hash(state)
}
} |
Yes, but &[u8].hash() is internally
And that |
I agree that it would be desirable for |
We can always make &[u8] hash to be similar to &str (appending 0xFF) and not the other way around if any performance regression is a concern. It's purely implementation detail. |
My preliminary benchmarking of hashing Here’s one of my tests: #![feature(test)]
use std::hash::Hash;
use std::collections::hash_map::DefaultHasher;
extern crate test;
use test::Bencher;
#[bench]
fn hash_str(b: &mut Bencher) {
let mut hasher = DefaultHasher::new();
b.iter(|| {
include_str!("x.rs").hash(&mut hasher);
});
}
#[bench]
fn hash_bytes(b: &mut Bencher) {
let mut hasher = DefaultHasher::new();
b.iter(|| {
include_bytes!("x.rs").hash(&mut hasher);
});
} I had been thinking of specializing the |
@arthurprs I presume the length thing is to help with types with few possibilities, e.g. zero-sized types, so that a two-element |
There's definitely a difference, but you need to use very small sequences, from 0 to 17 bytes. It'll also differ among hashers depending on the internal block size. I have plenty of test code around, I'll compile a test for this in a bit. @chris-morgan Precisely, they use different techniques to achieve the same (prefixing the usize len / appending a 0xFF). |
Fnv and others byte-at-a-time will take a sizeable hit. Siphash has an internal block size of 8 and it's more expensive overall so its less susceptible to the change.
|
Another real example of needing the length, tuples. They use no field separator themselves. We need to avoid hash dos with pairs of slices too. |
We could make [u8] append a 0xFF like str, this will essentially guarantee no perf regressions. In theory this could be a breaking change though. |
I return to what I suggested: specialisation of the implementation on I suppose it would also allow others to perform the same specialisation on their own integer newtype arrays if they really cared about it. I will also mention that @arthurprs You’d need to do that for all |
I may be missing something here. Isn't that already done? Line 559 in b32267f
and the len prefix comes from Line 647 in b32267f
|
@arthurprs I’m talking about the implementation of |
I fell we are talking about the same thing though. What are you proposing exactly? |
I was proposing specializing the actual OK, so |
Exactly. |
@arthurprs appending 0xFF to [u8] is not protecting against manufactured collisions. Example: // the 257 (&[u8], &[u8]) tuples here all hash the same way.
// if hashing uses 0xFF-termination instead of length prefix.
let data = [0xFFu8; 256];
let mut map = HashSet::new();
for i in 0..data.len() + 1 { map.insert(data.split_at(i)); } The problem with this is that it gives an hash function indpendent way of generating arbitrarily many collisions, so it's the hash-dos problem, which we want to protect against by default, or at least when using the default hasher. |
My understanding is that the len prefix (slices) and 0xFF suffix (str) is added to prevent stuff like |
That's the same as my split example. It's crucial that the 0xFF byte never appears in a str's representation. And it doesn't, by the utf-8 invariant. |
Triage: not aware of any changes here |
str
and[u8]
hash differently and have done since the solution to #5257.This is inconvenient in that it leads to things like the
StrTendril
type (from thetendril
crate) hashing like a[u8]
rather than like astr
; one is thus unable to happily implementBorrow<str>
on it in order to makeHashMap<StrTendril, _>
palatable.[u8]
gets its length prepended, whilestr
gets 0xff appended. Sure, one u8 rather than one usize is theoretically cheaper to hash; but marginally so only, marginally so. I see no good reason why they should use different techniques, and so I suggest thatstr
should be changed to use the hashing technique of[u8]
. This will prevent potential nasty surprises and bring us back closer to the blissful land of “str is just [u8] with a UTF-8 guarantee”.Hashes being internal matters, I do not believe this should be considered a breaking change, but it would still probably be a release-notes-worthy change as it could conceivably break eldritch codes.
The text was updated successfully, but these errors were encountered: