-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ContentHashable for Hash{Map, Set} is not stable #215
Comments
Close tweag#215 `unordered-containers` structures, such as `HashMap` and `HashSet` uses `Hashable` `hash` function to maintain their internal structure. Hence the order of the items (when converting using `toList`, or `elems`, or just `traverse` or `fold`) depends on the version of `hashable` used. This results on a change of hash of theses structure when `hashable` library is updated. It is not possible to ensure that hash will never change for any type, because it relies on other instances, `Ord/Eq` for `containers` `Map` and `Set`, or even the implementation of `ContentHashable` itself. However `hashable` is known for its frequent changes when on the other hand, `Ord/Eq` instances are most of the time generated by `deriving` and are hence perfectly reproducible. This commit removes the dependency on the implicit order from `hashable` at the cost of an introduce `Ord` constraint on the keys of the `Hash{Map,Set}`. This is a tradeoff: - The user can still use `Hash{Map,Set}` based structure in their type for whatever reason they want (usually performance). - They will need to provide an `Ord` instance for `ContentHashable` - If an `Ord` instance already exists, this change will just bring stability. If the instance does not exist, user will now be aware of the problem. Note: this invalidate previous hash computed on `Hash{Map, Set}`.
Close tweag#215 `unordered-containers` structures, such as `HashMap` and `HashSet` uses `Hashable` `hash` function to maintain their internal structure. Hence the order of the items (when converting using `toList`, or `elems`, or just `traverse` or `fold`) depends on the version of `hashable` used. This results on a change of hash of theses structure when `hashable` library is updated. It is not possible to ensure that hash will never change for any type, because it relies on other instances, `Ord/Eq` for `containers` `Map` and `Set`, or even the implementation of `ContentHashable` itself. However `hashable` is known for its frequent changes when on the other hand, `Ord/Eq` instances are most of the time generated by `deriving` and are hence perfectly reproducible. This commit removes the dependency on the implicit order from `hashable` at the cost of an introduce `Ord` constraint on the keys of the `Hash{Map,Set}`. This is a tradeoff: - The user can still use `Hash{Map,Set}` based structure in their type for whatever reason they want (usually performance). - They will need to provide an `Ord` instance for `ContentHashable` - If an `Ord` instance already exists, this change will just bring stability. If the instance does not exist, user will now be aware of the problem. Note: this invalidate previous hash computed on `Hash{Map, Set}`.
Close tweag#215 `unordered-containers` structures, such as `HashMap` and `HashSet` uses `Hashable` `hash` function to maintain their internal structure. Hence the order of the items (when converting using `toList`, or `elems`, or just `traverse` or `fold`) depends on the version of `hashable` used. This results on a change of hash of theses structure when `hashable` library is updated. It is not possible to ensure that hash will never change for any type, because it relies on other instances, `Ord/Eq` for `containers` `Map` and `Set`, or even the implementation of `ContentHashable` itself. However `hashable` is known for its frequent changes when on the other hand, `Ord/Eq` instances are most of the time generated by `deriving` and are hence perfectly reproducible. This commit removes the dependency on the implicit order from `hashable` at the cost of an introduce `Ord` constraint on the keys of the `Hash{Map,Set}`. This is a tradeoff: - The user can still use `Hash{Map,Set}` based structure in their type for whatever reason they want (usually performance). - They will need to provide an `Ord` instance for `ContentHashable` - If an `Ord` instance already exists, this change will just bring stability. If the instance does not exist, user will now be aware of the problem. Note: this invalidate previous hash computed on `Hash{Map, Set}`.
The branch https://github.com/guibou/funflow/tree/stable_unordered-containers-hash-cas-hashable contains a commit which fix for this issue: it pre-sort the It adds an |
However, removing cas-hashable from our codebase took me less time than fixing this (I'm now using a one line function which compute the sha256 sum of the output of Note that I wrote a custom instance for |
Hi @guibou, thanks for reporting this issue - I imagine that was a tricky one to track down 😬. We will take a closer look and get back to you regarding possible fix you linked. I would also be interested in hearing your use case for |
Hi @dorranh. We are solving virtual population of patient and the effect of treatments on them. So basically, we have a list of tasks (i.e. simulating a patient), some tasks depends on other task (i.e. comparing treatment effect versus control arm need to wait for the result of both task on the same patient. Or generating complete trial statistics needs to wait on all tasks) and we want to run theses tasks on a cluster of machines on AWS and we want to have an efficient and robust cache. Caching is important for us because imagine the same simulation, we can run it with 1 million patient on two protocol arms (i.e. placebo versus treatment), that's 2 millions patients.. At the end of this run, we realize that we want to test another treatment (for example, double dose). With caching, we can avoid recomputing 2 millions patients (placebo / first treatment) and only recompute the new treatment. We use content hash of the patient information / model / treatment. Initially all of that was done with funflow and hence edit: that's why robust hash is important for us (see also #202) because when hash change, caching is invalidated and it may lead to additional computation cost, storage cost and user frustration. |
Describe the bug
ContentHashable
to hash the content of anHashMap
(directly or indirectly)hashable
is released, which change thehash
function. The order ofHashMap
is hence changed, and my nicely content hashed object suddently change their hash.See the instance for
HashMap
in the code, it usestoList
, the order is unspecified.https://github.com/tweag/funflow/blob/master/cas/hashable/src/Data/CAS/ContentHashable.hs#L361-L369
To Reproduce
Run this code with different version of
hashable
(You may have to try different keys in the map, or a bigger map, in order to trigger the effect).Expected behavior
It is impossible to guarantee that the hash won't change, but either:
hashable
Ord, Eq
constraints on the keys of theHashMap
.The text was updated successfully, but these errors were encountered: