API/BUG: hashing of datetimes is based on UTC values #16372

jreback · 2017-05-16T22:17:20Z

These should are 3 different 'views' of the same time. We DO disambiguate these in mains. So we should do so when hashing as well.

xref #16346

In [1]: from pandas.util import hash_pandas_object

In [8]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'))
Out[8]: 
2012-12-31 19:00:00-05:00     4326795898974544501
2013-01-01 19:00:00-05:00     2833560015380952180
2013-01-02 19:00:00-05:00    14913883737423839247
Freq: D, dtype: uint64

In [9]: hash_pandas_object(pd.date_range('20130101', periods=3, tz='UTC'))
Out[9]: 
2013-01-01 00:00:00+00:00     4326795898974544501
2013-01-02 00:00:00+00:00     2833560015380952180
2013-01-03 00:00:00+00:00    14913883737423839247
Freq: D, dtype: uint64

In [10]: hash_pandas_object(pd.date_range('20130101', periods=3))
Out[10]: 
2013-01-01     4326795898974544501
2013-01-02     2833560015380952180
2013-01-03    14913883737423839247
Freq: D, dtype: uint64

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-16T22:17:59Z

cc @mrocklin

@TomAugspurger @jorisvandenbossche

from a practical perspective I don't think this makes a whole lot of difference, but should fix to be correct.

jorisvandenbossche · 2017-05-17T07:39:08Z

What alternative do you think off to hash upon? Hash the timezone separately and combine the hashes?

jreback · 2017-05-17T12:08:19Z

I think you could do something like this.

In [1]: df = pd.DataFrame({'tz': pd.date_range('20130101', periods=3, tz='UTC').tz_convert('US/Eastern'),
   ...: 'utc': pd.date_range('20130101', periods=3, tz='UTC'),
   ...: 'naive': pd.date_range('20130101', periods=3)})

In [2]: df
Out[2]: 
       naive                        tz                       utc
0 2013-01-01 2012-12-31 19:00:00-05:00 2013-01-01 00:00:00+00:00
1 2013-01-02 2013-01-01 19:00:00-05:00 2013-01-02 00:00:00+00:00
2 2013-01-03 2013-01-02 19:00:00-05:00 2013-01-03 00:00:00+00:00

In [3]: from pandas.util import hash_pandas_object

In [6]: hash_pandas_object(pd.DataFrame({'tz':df['tz'],'zone':df['tz'].dt.tz}), index=False)
Out[6]: 
0    11960632900184590671
1    17909201100930397932
2      244240496600445005
dtype: uint64

In [7]: hash_pandas_object(pd.DataFrame({'utc':df['tz'],'zone':df['utc'].dt.tz}), index=False)
Out[7]: 
0     557885042773898185
1    1996380570925580138
2    5435501107539799243
dtype: uint64

In [8]: hash_pandas_object(pd.DataFrame({'naivec':df['naive']}), index=False)
Out[8]: 
0    14376405836841727586
1     1052390041072582175
2    12596642793234779168
dtype: uint64

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

This would break backward compat for tz-aware, but (and maybe should document this more), that this is version-to-version hashing, it is not (necessarily) designed to be backward compat.

TomAugspurger · 2018-10-10T12:03:25Z

I suppose it depends on what people are using the hashing for. Suppose I have hashed values a and a pandas object x.

hash_pandas_object(x) == a, then x may be the same as the object that originally hashed to a
hash_pandas_ojbect(x) != a, then x is not the same as the object that originally hashed to a.

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

So I think it's on pandas to either mix the dtype information into the hash somehow, or provide guidance that you should store the original dtype along with the hashed values.

IOW, hash the tz as an additional column and combine (which is what we do with a DataFrame with index=False).

Hashing an extra column seems wasteful. I'd rather have some kind of stable map of each type and do a bit-shift on each type after hashing.

type_map = {
    int: 0,
    float: 1,
    ...
}

h = hash_array(obj.values, encoding, hash_key,
               categorize).astype('uint64', copy=False)
h >>= type_map[obj.dtype]

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

version-to-version hashing, it is not (necessarily) designed to be backward compat.

We should explicitly state that hashing can change between versions. Maintaining that seems like it would be a nightmare.

jorisvandenbossche · 2018-10-11T12:53:25Z

To me, the most common use case is likely storing hashed values somewhere and wanting to answer "are these new values the same as what I have hashed?", so a stronger form of 1. (is the same instead of may be the same).

Yes, I agree with that (that is eg what joblib uses hashing for)

Building that type map is tricky (impossible?), because of parameterize types, 3rd party extension types...

I am not familiar with how hash values are calculated. But would it be possible to somehow combine the hash of the dtype with the hash of the values?

jbrockmendel · 2022-03-24T02:52:43Z

Was a satisfactory solution ever found for this? Looks like this is about hash_pandas_object and DatetimeIndex, but I'm looking at Timestamp.__hash__ and it ignores the tz too. Current motivation is to adapt for non-nano.

jreback added API Design IO Data IO issues that don't fit into a more specific label Difficulty Intermediate Numeric Operations Arithmetic, Comparison, and Logical operations Datetime Datetime data dtype labels May 16, 2017

jreback added this to the 0.21.0 milestone May 16, 2017

jreback mentioned this issue May 16, 2017

PERF: improve MultiIndex get_loc performance #16346

Merged

jorisvandenbossche removed the API Design label May 19, 2017

jreback modified the milestones: 0.21.0, Interesting Issues Sep 23, 2017

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

TomAugspurger mentioned this issue Oct 10, 2018

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

Closed

jorisvandenbossche modified the milestones: Contributions Welcome, 0.24.0 Oct 18, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel removed the IO Data IO issues that don't fit into a more specific label label Dec 11, 2019

mroeschke added the Bug label Mar 31, 2020

mroeschke added hashing hash_pandas_object Needs Discussion Requires discussion from core team before further action and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

cryptochecktool mentioned this issue Nov 27, 2024

ENH: Add a safe Option to hash_pandas_object with Default Value Set to True #60428

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/BUG: hashing of datetimes is based on UTC values #16372

API/BUG: hashing of datetimes is based on UTC values #16372

jreback commented May 16, 2017 •

edited

Loading

jreback commented May 16, 2017

jorisvandenbossche commented May 17, 2017 •

edited

Loading

jreback commented May 17, 2017

TomAugspurger commented Oct 10, 2018

jorisvandenbossche commented Oct 11, 2018

jbrockmendel commented Mar 24, 2022

API/BUG: hashing of datetimes is based on UTC values #16372

API/BUG: hashing of datetimes is based on UTC values #16372

Comments

jreback commented May 16, 2017 • edited Loading

jreback commented May 16, 2017

jorisvandenbossche commented May 17, 2017 • edited Loading

jreback commented May 17, 2017

TomAugspurger commented Oct 10, 2018

jorisvandenbossche commented Oct 11, 2018

jbrockmendel commented Mar 24, 2022

jreback commented May 16, 2017 •

edited

Loading

jorisvandenbossche commented May 17, 2017 •

edited

Loading