Pandas likely 2.0.0 causing modeldata test to fail #324

aaraney · 2023-04-20T21:36:12Z

The dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric.test_uid_1_a test is currently failing in several PRs. Below is a snipped from an action log showing the failure.

source

===========================================================================
............................F...............
======================================================================
FAIL: test_uid_1_a (dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric)
Test that the hydrofabric instance for example one has the expected unique id.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", line 309, in test_uid_1_a
    self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '7b022f401ea2da1fdce2c1c2e36a8664b2299778' != '8a24b5eeae2596ceaf21058c49a27c8ae6f444ab'
- 7b022f401ea2da1fdce2c1c2e36a8664b2299778
+ 8a24b5eeae2596ceaf21058c49a27c8ae6f444ab

I compared the dependency versions installed when the tests were passing with the failing tests and it seems that pandas==2.0.0 is the likely culprit. The last known pandas version that works is 1.5.3. I tested this locally with fiona version 1.9.1 and 1.9.3 and pandas==1.5.3 and the tests passed. However, there is one outlier action with pandas==2.0.0 and fiona==1.9.2 installed that passed? Im still a little puzzled about that one and ive not been able to reproduce it locally (yet, ill do that in the morning, there isn't a fiona wheel for that version for my machine).

Passing with pandas==1.5.3
Failing with pandas==2.0.0
Weird passing test pandas==2.0.0

I went looking through the fiona's change log and PRs for release 1.9.3 and its doesnt look like anything is related. Ive not gone to look through geopandas change log yet, so I need to check there too.

The text was updated successfully, but these errors were encountered:

christophertubbs · 2023-04-21T13:32:47Z

I think it's pandas, but I believe there's a default seed in the hash function for the column. If that was changed for 2.0 for whatever reason, it'd change the results of the hash function on the column.

aaraney · 2023-04-21T15:50:33Z

Yeah, the more I look into this, I am also convinced that it is pandas too. Just so we are all on the same page, the test that is failing is comparing hashes derived from a geopackage version of the hydrofabric. Here is the code:

    @property
    def uid(self) -> str:
        # removed docstring for readability
        layer_hashes = [np.apply_along_axis(hash_array, 0, self._dataframes[l].values).sum() for l in self._layer_names]
        return hashlib.sha1(','.join([str(h) for h in layer_hashes]).encode('UTF-8')).hexdigest()

self._dataframes is a dictionary of geopackage layer name to geopandas Dataframe of that layer.

I wrote up a script to do basically the same thing to more easily compare pandas versions. The script is in the twirl down if you are interested.

Hash each column in geopackage script

import numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
import fiona

p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
layers = fiona.listlayers(p)

dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}

with open(pd.__version__, "w") as f:
    f.write(f"layers: {str(layers)}\n")
    for l in layers:
        f.write(f"layer: {l}\n")
        f.write(f"columns: {str(dataframes[l].columns.to_list())}\n")
        f.write(f"column type: {list(map(str, dataframes[l].dtypes))}\n")
        
        # computes hash of each element in dataframe (equivalent to pd.Dataframe.applymap)
        hash_of_each_value = np.apply_along_axis(hash_array, 0, dataframes[l].values)
        summed_hash_on_each_row = np.apply_along_axis(np.sum, 0, hash_of_each_value)
        f.write(f"{str(summed_hash_on_each_row.tolist())}\n")

print(pd.__version__)

Raw output

1.5.3

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
[11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
[4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
[11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
[3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]

2.0.0

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
[9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
[11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
[4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
[11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
[3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]

Looking as the combined output below, it looks like the discrepancies are in the numeric datatypes. This leads me to think there might be discrepancies in how na / None values are either represented and / or hashed between the two versions. Looking into that now.

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
1.5.3:[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
2.0.0 [9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
1.5.3: [11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
2.0.0: [11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
1.5.3: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
2.0.0: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
1.5.3: [11397087368252117007, 4254618872625276163, 18446744073709551609]
2.0.0: [11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
2.0.0: [11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
1.5.3: [3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
2.0.0: [3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]
2.0.0: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]

aaraney · 2023-04-21T17:12:10Z

So, ive started to isolate the problem, however I still dont understand why this is happening. Something seems different about pd.DataFrame.values between 1.5.3 and 2.0.0:

Script

import numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
from pprint import pprint

p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
print(pd.__version__)

df = gpd.read_file(p, layer="divides")

subset = df[["id", "areasqkm"]]
subset_loc = df.loc[:, ["id", "areasqkm"]]
square = pd.DataFrame({"id": df["id"], "areasqkm": df["areasqkm"]})
loc = pd.DataFrame({"id": df.loc[:, "id"], "areasqkm": df.loc[:, "areasqkm"]})
values = pd.DataFrame({"id": df["id"].values, "areasqkm": df["areasqkm"].values})
tolist = pd.DataFrame({"id": df["id"].values.tolist(), "areasqkm": df["areasqkm"].values.tolist()})

dfs = [subset, subset_loc, square, loc, values, tolist]

print("apply to each row")
pprint([d.apply(lambda a: hash_array(a.values), axis=0).values.sum() for d in dfs])

print("apply_along_axis to each row")
pprint([np.apply_along_axis(hash_array, 0, d.values).sum() for d in dfs])

print("apply to each column")
pprint([d.apply(lambda a: hash_array(a.values), axis=1).values.sum().sum() for d in dfs])

print("apply_along_axis to each column")
pprint([np.apply_along_axis(hash_array, 1, d.values).sum() for d in dfs])

1.5.3
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]

2.0.0
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]

aaraney · 2023-04-21T20:23:44Z

So, I figured it out. Here is the simplest example that illustrates and reproduces the problem:

import numpy as np
from pandas.util import hash_array

a = np.array([1.0], dtype="object")
print(hash_array(a))

# 1.5.3
# [3035652100526550566]

# 2.0.0
# [7736021350537868001]

Having looked through the pandas source, this regression was introduced in pandas-dev/pandas#50001, specifically here (diff below).

diff --git a/pandas/core/util/hashing.py b/pandas/core/util/hashing.py
index 5a5e46e0227aa..e0b18047aa0ec 100644
--- a/pandas/core/util/hashing.py
+++ b/pandas/core/util/hashing.py
@@ -344,9 +344,7 @@ def _hash_ndarray(
             )
 
             codes, categories = factorize(vals, sort=False)
-            cat = Categorical(
-                codes, Index._with_infer(categories), ordered=False, fastpath=True
-            )
+            cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
             return _hash_categorical(cat, encoding, hash_key)
 
         try:

In short, the array is categorized and in 1.5.3 the type is inferred using the values in the, now category instead of using the dtype as specified on the np.ndarray object. In 2.0.0 it now seems that this has been fixed. So hashed np.ndarray's now respect their dtype rather. Tying this back to pd.DataFrame.values, .values must set its returned np.ndarray's dtype to a type that types in the collection can be cast to (e.g. float64, int32, object). So in our case, since we have a dataframe of strings, float, and ints, .dtype has to be set to object. This consequently is the inherited type of any inner dimension in an ndarray view. My guess is that .values actually returns a copy on write (CoW) view of the dataframe's inner ndarray's and that view has to "show" all inner array dimension types as the outer most dtype.

aaraney · 2023-04-24T13:28:25Z

More wierdness

import numpy as np
import hashlib
from pandas.util import hash_array, hash_pandas_object
import geopandas as gpd
import fiona

p = "<path-to-repo>/dmod/refactor-data-service/data/example_hydrofabric_2/hydrofabric.gpkg"

layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}

layer_hashes = [np.apply_along_axis(hash_array, 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [10103771696888273306, 4572071176093428412, 15272590391029730009, 15651706240877393163, 15901469198598983537, 17501800407106816969, 756873605291097582]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 15651706240877393163, 12994735377762201353, 12039723046569473286, 18438881045715204344]

layer_hashes = [np.apply_along_axis(lambda h: hash_array(h, categorize=False), 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]

layer_hashes = [np.sum(hash_pandas_object(dataframes[layer]).values) for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [hash_pandas_object(dataframes[layer]).sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
# 2.0.0
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]

layer_hashes = [hash_pandas_object(dataframes[layer]).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [dataframes[layer].apply(hash_pandas_object).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
# 2.0.0
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]

layer_hashes = [dataframes[layer].apply(lambda a: hash_array(a.values), axis=0).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
# 2.0.0
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]

aaraney · 2023-04-24T14:56:26Z

Given that hash_pandas_object produces the same result for both versions (if the sum is computed using numpy), I think our best bet is to switch our implementation to use hash_pandas_object. Having talked with @robertbartel about this, the reason hash_array is likely used now is because of concerns with geopandas and specifically geometry columns in a geopandas dataframe. In brief, geopandas uses shapey objects to represent geometries and at one point (shapely<=2.0.0) shapely geometries were not hashable (see shapely #209 and geopandas #221). However, now we require shapely>=2.0.0 so this should not be an issue.

aaraney · 2023-11-02T16:03:02Z

Reopening this because tests are failing again b.c. of a related failure. This failure started reoccurring 3 weeks ago. https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319

Traceback (most recent call last):
 File >"/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", >line 309, in test_uid_1_a
   self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '10105591058b39504e73842da89e0c3dcac5ba99' != >'b7367023aadad961315dd05e184359dad68613c3'
- 10105591058b39504e73842da89e0c3dcac5ba99
+ b7367023aadad961315dd05e184359dad68613c3

aaraney · 2023-11-02T16:14:35Z

The same code path is not effected. #468 will track this instead.

aaraney added bug Something isn't working CI Issues related to the continuous integration testing. labels Apr 20, 2023

aaraney assigned christophertubbs, robertbartel and aaraney Apr 20, 2023

aaraney changed the title ~~Pandas 2.0.0 causing modeldata test to fail~~ Pandas likely 2.0.0 causing modeldata test to fail Apr 20, 2023

robertbartel added the maas MaaS Workstream label Apr 24, 2023

robertbartel mentioned this issue Apr 24, 2023

Address issues with Geopackage hydrofabric unique id tests #326

Merged

robertbartel closed this as completed Apr 26, 2023

aaraney reopened this Nov 2, 2023

aaraney mentioned this issue Nov 2, 2023

TestGeoPackageHydrofabric.test_uid_1_a failure #468

Closed

aaraney closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas likely 2.0.0 causing modeldata test to fail #324

Pandas likely 2.0.0 causing modeldata test to fail #324

aaraney commented Apr 20, 2023

christophertubbs commented Apr 21, 2023

aaraney commented Apr 21, 2023 •

edited

Loading

aaraney commented Apr 21, 2023

aaraney commented Apr 21, 2023

aaraney commented Apr 24, 2023

aaraney commented Apr 24, 2023

aaraney commented Nov 2, 2023 •

edited

Loading

aaraney commented Nov 2, 2023

Pandas likely 2.0.0 causing modeldata test to fail #324

Pandas likely 2.0.0 causing modeldata test to fail #324

Comments

aaraney commented Apr 20, 2023

christophertubbs commented Apr 21, 2023

aaraney commented Apr 21, 2023 • edited Loading

aaraney commented Apr 21, 2023

aaraney commented Apr 21, 2023

aaraney commented Apr 24, 2023

aaraney commented Apr 24, 2023

aaraney commented Nov 2, 2023 • edited Loading

aaraney commented Nov 2, 2023

aaraney commented Apr 21, 2023 •

edited

Loading

aaraney commented Nov 2, 2023 •

edited

Loading