-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/BUG: hashing of datetimes is based on UTC values #16372
Comments
cc @mrocklin @TomAugspurger @jorisvandenbossche from a practical perspective I don't think this makes a whole lot of difference, but should fix to be correct. |
What alternative do you think off to hash upon? Hash the timezone separately and combine the hashes? |
I think you could do something like this.
IOW, hash the tz as an additional column and combine (which is what we do with a This would break backward compat for tz-aware, but (and maybe should document this more), that this is version-to-version hashing, it is not (necessarily) designed to be backward compat. |
I suppose it depends on what people are using the hashing for. Suppose I have hashed values
To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same). So I think it's on pandas to either mix the dtype information into the hash somehow, or provide guidance that you should store the original dtype along with the hashed values.
Hashing an extra column seems wasteful. I'd rather have some kind of stable map of each type and do a bit-shift on each type after hashing.
Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...
We should explicitly state that hashing can change between versions. Maintaining that seems like it would be a nightmare. |
Yes, I agree with that (that is eg what joblib uses hashing for)
I am not familiar with how hash values are calculated. But would it be possible to somehow combine the hash of the dtype with the hash of the values? |
Was a satisfactory solution ever found for this? Looks like this is about hash_pandas_object and DatetimeIndex, but I'm looking at |
These should are 3 different 'views' of the same time. We DO disambiguate these in mains. So we should do so when hashing as well.
xref #16346
The text was updated successfully, but these errors were encountered: