Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str and [u8] should hash the same #27108

Open
chris-morgan opened this issue Jul 18, 2015 · 24 comments
Open

str and [u8] should hash the same #27108

chris-morgan opened this issue Jul 18, 2015 · 24 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@chris-morgan
Copy link
Member

str and [u8] hash differently and have done since the solution to #5257.

This is inconvenient in that it leads to things like the StrTendril type (from the tendril crate) hashing like a [u8] rather than like a str; one is thus unable to happily implement Borrow<str> on it in order to make HashMap<StrTendril, _> palatable.

[u8] gets its length prepended, while str gets 0xff appended. Sure, one u8 rather than one usize is theoretically cheaper to hash; but marginally so only, marginally so. I see no good reason why they should use different techniques, and so I suggest that str should be changed to use the hashing technique of [u8]. This will prevent potential nasty surprises and bring us back closer to the blissful land of “str is just [u8] with a UTF-8 guarantee”.

Hashes being internal matters, I do not believe this should be considered a breaking change, but it would still probably be a release-notes-worthy change as it could conceivably break eldritch codes.

@SimonSapin
Copy link
Contributor

👍

chris-morgan added a commit to chris-morgan/tendril that referenced this issue Jul 18, 2015
This makes using a Tendril as a HashMap key a little more palatable.
Sadly, str and [u8] hash differently at present, so we shouldn’t
implement Borrow<str> for StrTendril. An alternative would be making
the Hash implementations for Tendril<F> manual and making Tendril<UTF8>
use the str Hash implementation and the rest use the [u8] one.
See also rust-lang/rust#27108 which deals with
fixing the underlying problem of the differing Hash implementations.
@Gankra
Copy link
Contributor

Gankra commented Jul 18, 2015

I'm all for this change as long as it's benchmarked to not cause a noticeable regression. (I agree with your assessment that it shouldn't)

@bluss
Copy link
Member

bluss commented Aug 14, 2015

Hashing an strings of length n will be n + 1 bytes before and n + 8 bytes after. Siphash adds one byte and hashes in 8-byte blocks, so very short strings bumped up to the next size, using 2 * 2 + 4 = 8 siphash rounds vs. previous 2 * 1 + 4 = 6. I'm sure you can measure the difference.

Also, this is a change to what the Hash trait emits for str, so it's a breaking change no matter which hash function is used. I understand it can be a minor change, I think it's fine to change this in a new rust release.

@arthurprs
Copy link
Contributor

This would be optimized if rust-lang/rfcs#1666 is accepted.

@chris-morgan
Copy link
Member Author

@arthurprs I don’t see how rust-lang/rfcs#1666 would have any bearing on this. All this should need is

diff --git a/src/libcore/hash/mod.rs b/src/libcore/hash/mod.rs
index 051eb97..85427d0 100644
--- a/src/libcore/hash/mod.rs
+++ b/src/libcore/hash/mod.rs
@@ -328,8 +328,7 @@ mod impls {
     #[stable(feature = "rust1", since = "1.0.0")]
     impl Hash for str {
         fn hash<H: Hasher>(&self, state: &mut H) {
-            state.write(self.as_bytes());
-            state.write_u8(0xff)
+            self.as_bytes().hash(state)
         }
     }

@arthurprs
Copy link
Contributor

arthurprs commented Aug 17, 2016

Yes, but &[u8].hash() is internally

    #[stable(feature = "rust1", since = "1.0.0")]
    impl<T: Hash> Hash for [T] {
        fn hash<H: Hasher>(&self, state: &mut H) {
            self.len().hash(state);
            Hash::hash_slice(self, state)
        }
    }

And that self.len().hash(state); would be optimized to a single branch (or nothing at all if LLVM is that that) if the RFC is accepted.

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 22, 2017
@dtolnay
Copy link
Member

dtolnay commented Nov 16, 2017

I agree that it would be desirable for str and [u8] to hash the same. Does anyone have an estimate of the worst case, how much this would affect performance of realistic usage of hash map with string keys?

@arthurprs
Copy link
Contributor

We can always make &[u8] hash to be similar to &str (appending 0xFF) and not the other way around if any performance regression is a concern. It's purely implementation detail.

@chris-morgan
Copy link
Member Author

My preliminary benchmarking of hashing [u8] and str shows no measurable difference on nightly-x86_64-pc-windows-msvc, regardless of data size (up to a test file of 3.2 megabytes)—it’s always well within ε (though admittedly ε is always at least ⅓ of the total) with one winning sometimes and the other other times. I therefore think we can make this change without concern.

Here’s one of my tests:

#![feature(test)]

use std::hash::Hash;
use std::collections::hash_map::DefaultHasher;

extern crate test;
use test::Bencher;

#[bench]
fn hash_str(b: &mut Bencher) {
    let mut hasher = DefaultHasher::new();
    b.iter(|| {
        include_str!("x.rs").hash(&mut hasher);
    });
}

#[bench]
fn hash_bytes(b: &mut Bencher) {
    let mut hasher = DefaultHasher::new();
    b.iter(|| {
        include_bytes!("x.rs").hash(&mut hasher);
    });
}

I had been thinking of specializing the [T] implementation on all integer types, to make just one write call with the entire slice rather than one per element (as Hash::hash_slice does), but it looks like that may not be necessary; it may well be a zero-cost abstraction (though I haven’t looked at the ASM produced). (I presume we’re OK with libcore using specialization, seeing as it has #![feature(specialization)] in it.)

@chris-morgan
Copy link
Member Author

@arthurprs I presume the length thing is to help with types with few possibilities, e.g. zero-sized types, so that a two-element [()] is more likely to hash differently to a three-element [()]. It bears a certain resemblance to techniques to strengthen against length-extension attacks, also. Not sure whether it’s useful in practice.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 16, 2017

There's definitely a difference, but you need to use very small sequences, from 0 to 17 bytes. It'll also differ among hashers depending on the internal block size. I have plenty of test code around, I'll compile a test for this in a bit.

@chris-morgan Precisely, they use different techniques to achieve the same (prefixing the usize len / appending a 0xFF).

@arthurprs
Copy link
Contributor

Fnv and others byte-at-a-time will take a sizeable hit.

Siphash has an internal block size of 8 and it's more expensive overall so its less susceptible to the change.

➜  hash-rs git:(rfc-extend-hasher) ✗ cargo benchcmp sip13::str_:: sip13::slice:: benches.txt
 name             sip13::str_:: ns/iter  sip13::slice:: ns/iter  diff ns/iter   diff % 
 bytes_000000001  13 (76 MB/s)           15 (66 MB/s)                       2   15.38% 
 bytes_000000002  13 (153 MB/s)          15 (133 MB/s)                      2   15.38% 
 bytes_000000004  13 (307 MB/s)          13 (307 MB/s)                      0    0.00% 
 bytes_000000007  18 (388 MB/s)          16 (437 MB/s)                     -2  -11.11% 
 bytes_000000008  15 (533 MB/s)          16 (500 MB/s)                      1    6.67% 
 bytes_000000009  16 (562 MB/s)          17 (529 MB/s)                      1    6.25% 
 bytes_000000012  15 (800 MB/s)          16 (750 MB/s)                      1    6.67% 
 bytes_000000016  16 (1000 MB/s)         18 (888 MB/s)                      2   12.50% 
 bytes_000000017  17 (1000 MB/s)         19 (894 MB/s)                      2   11.76% 
➜  hash-rs git:(rfc-extend-hasher) ✗ cargo benchcmp fnv::str_:: fnv::slice:: benches.txt
 name             fnv::str_:: ns/iter  fnv::slice:: ns/iter  diff ns/iter   diff % 
 bytes_000000001  1 (1000 MB/s)        4 (250 MB/s)                     3  300.00% 
 bytes_000000002  2 (1000 MB/s)        5 (400 MB/s)                     3  150.00% 
 bytes_000000004  2 (2000 MB/s)        5 (800 MB/s)                     3  150.00% 
 bytes_000000007  4 (1750 MB/s)        7 (1000 MB/s)                    3   75.00% 
 bytes_000000008  3 (2666 MB/s)        8 (1000 MB/s)                    5  166.67% 
 bytes_000000009  4 (2250 MB/s)        8 (1125 MB/s)                    4  100.00% 
 bytes_000000012  5 (2400 MB/s)        10 (1200 MB/s)                   5  100.00% 
 bytes_000000016  7 (2285 MB/s)        13 (1230 MB/s)                   6   85.71% 
 bytes_000000017  9 (1888 MB/s)        13 (1307 MB/s)                   4   44.44%

@bluss
Copy link
Member

bluss commented Nov 16, 2017

Another real example of needing the length, tuples. They use no field separator themselves. We need to avoid hash dos with pairs of slices too.

@arthurprs
Copy link
Contributor

We could make [u8] append a 0xFF like str, this will essentially guarantee no perf regressions. In theory this could be a breaking change though.

@chris-morgan
Copy link
Member Author

I return to what I suggested: specialisation of the implementation on [T] for all integer types? That resolves the perf issue in types like [u8].

I suppose it would also allow others to perform the same specialisation on their own integer newtype arrays if they really cared about it.

I will also mention that str can reasonably be changed to use a length prefix rather than a 0xff suffix quite independently of this, by implementing that the fast way rather than saying “hash self.as_bytes()” which can be the slow way.

@arthurprs You’d need to do that for all [T], not just [u8], because if you use trait objects you lose the specialisation. Not sure if that change would be ideal for arbitrary types or not.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 17, 2017

I return to what I suggested: specialisation of the implementation on [T] for all integer types? That resolves the perf issue in types like [u8].

I may be missing something here. Isn't that already done?

fn hash_slice<H: Hasher>(data: &[$ty], state: &mut H) {

and the len prefix comes from

self.len().hash(state);

@chris-morgan
Copy link
Member Author

@arthurprs I’m talking about the implementation of Hash for [T], not the hash_slice method on the implementations of Hash for integer types T.

@arthurprs
Copy link
Contributor

I fell we are talking about the same thing though. What are you proposing exactly?

@chris-morgan
Copy link
Member Author

chris-morgan commented Nov 18, 2017

I was proposing specializing the actual Hash implementation for [T] (impl<T: Hash> Hash for [T]default impl<T: Hash> Hash for [T] plus the barrage of impl Hash for [u8] et al.); I had neglected to notice that the Hash implementations on those integer types already override the trait’s default method for hash_slice, thus obviating specialization of the [T] implementation (which would be a legitimate approach these days, potentially even preferable to the hash_slice method; as Hash already can’t be a trait object I see no downsides to it at a brief glance).

OK, so [u8] and str do hash all of their data in just one call, leaving the difference as that of hashing a usize versus a u8. Correct?

@arthurprs
Copy link
Contributor

OK, so [u8] and str do hash all of their data in just one call, leaving the difference as that of hashing a usize versus a u8. Correct?

Exactly.

@bluss
Copy link
Member

bluss commented Nov 18, 2017

@arthurprs appending 0xFF to [u8] is not protecting against manufactured collisions.

Example:

// the 257 (&[u8], &[u8]) tuples here all hash the same way.
// if hashing uses 0xFF-termination instead of length prefix.
let data = [0xFFu8; 256];
let mut map = HashSet::new();
for i in 0..data.len() + 1 { map.insert(data.split_at(i)); }

The problem with this is that it gives an hash function indpendent way of generating arbitrarily many collisions, so it's the hash-dos problem, which we want to protect against by default, or at least when using the default hasher.

@arthurprs
Copy link
Contributor

My understanding is that the len prefix (slices) and 0xFF suffix (str) is added to prevent stuff like ("1", "") from hashing the same as ("", "1"). Not sure what's the name of that.

@bluss
Copy link
Member

bluss commented Nov 18, 2017

That's the same as my split example. It's crucial that the 0xFF byte never appears in a str's representation. And it doesn't, by the utf-8 invariant.

@steveklabnik
Copy link
Member

Triage: not aware of any changes here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants