Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: std::hash::Hash should ensure prefix-free data #89438

Merged
merged 3 commits into from
Oct 10, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions library/core/src/hash/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,19 @@ mod sip;
/// Thankfully, you won't need to worry about upholding this property when
/// deriving both [`Eq`] and `Hash` with `#[derive(PartialEq, Eq, Hash)]`.
///
/// ## Prefix collisions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this... "Collision" isn't the right term, here, is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not but I can't think of a better word to use.

///
/// Implementations of `hash` should ensure that the data they
/// pass to the `Hasher` are prefix-free. That is, different concatenations
Copy link

@tczajka tczajka Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation of what "prefix-free" means is incomplete. It should say that unequal values should cause two different byte sequences to be written, and neither of the two sequences should be a prefix of the other.

Note that it's not sufficient to say that concatenations of outputs of multiple values of the same type should result in different outputs. It has to be true when concatenated with outputs for other types as well (think about hashing (A, B)). That's where the prefix-free property comes in: the outputs will be different if all the types involved satisfy the prefix-free property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tczajka! I'm not sure I understand the idea of one sequence being a prefix of another. Does it simply mean "starts with", or is it another kind of relation? Is there a way we can rephrase this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to ask the question: in the example of ("ab", "c") and ("a", "bc") where and how would the "prefix" occur, and how does the extra byte prevent it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "prefix" is a beginning of a string, so it's same as "starts with". https://en.wikipedia.org/wiki/Prefix

If strings were hashed without the extra 0xff at the end, hashing ("ab", "c") and ("a", "bc") would write the same byte sequence "abc" to Hasher. The problem is that "a" is a prefix of "ab". Whereas "a\xff" is not a prefix of "ab\xff", so if Hash outputs these sequences instead that solves the problem. "ab\xffc\xff" != "a\xffbc\xff".

Copy link
Member

@cuviper cuviper Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: \xff is not actually allowed in string literals, since it would be invalid UTF-8 -- which is also what makes it a useful separator here. You could really write those as byte strings though, b"ab\xffc\xff" != b"a\xffbc\xff".

/// of the same data should not produce the same output.
/// For example, the standard implementation of [`Hash` for `&str`][impl] passes an extra
/// `0xFF` byte to the `Hasher` so that the values `("ab", "c")` and `("a",
/// "bc")` hash differently.
///
/// [`HashMap`]: ../../std/collections/struct.HashMap.html
/// [`HashSet`]: ../../std/collections/struct.HashSet.html
/// [`hash`]: Hash::hash
/// [impl]: ../../std/primitive.str.html#impl-Hash
#[stable(feature = "rust1", since = "1.0.0")]
#[rustc_diagnostic_item = "Hash"]
pub trait Hash {
Expand Down