Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas likely 2.0.0 causing modeldata test to fail #324

Closed
aaraney opened this issue Apr 20, 2023 · 8 comments
Closed

Pandas likely 2.0.0 causing modeldata test to fail #324

aaraney opened this issue Apr 20, 2023 · 8 comments
Assignees
Labels
bug Something isn't working CI Issues related to the continuous integration testing. maas MaaS Workstream

Comments

@aaraney
Copy link
Member

aaraney commented Apr 20, 2023

The dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric.test_uid_1_a test is currently failing in several PRs. Below is a snipped from an action log showing the failure.

source

===========================================================================
............................F...............
======================================================================
FAIL: test_uid_1_a (dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric)
Test that the hydrofabric instance for example one has the expected unique id.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", line 309, in test_uid_1_a
    self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '7b022f401ea2da1fdce2c1c2e36a8664b2299778' != '8a24b5eeae2596ceaf21058c49a27c8ae6f444ab'
- 7b022f401ea2da1fdce2c1c2e36a8664b2299778
+ 8a24b5eeae2596ceaf21058c49a27c8ae6f444ab

I compared the dependency versions installed when the tests were passing with the failing tests and it seems that pandas==2.0.0 is the likely culprit. The last known pandas version that works is 1.5.3. I tested this locally with fiona version 1.9.1 and 1.9.3 and pandas==1.5.3 and the tests passed. However, there is one outlier action with pandas==2.0.0 and fiona==1.9.2 installed that passed? Im still a little puzzled about that one and ive not been able to reproduce it locally (yet, ill do that in the morning, there isn't a fiona wheel for that version for my machine).

Passing with pandas==1.5.3
Failing with pandas==2.0.0
Weird passing test pandas==2.0.0

I went looking through the fiona's change log and PRs for release 1.9.3 and its doesnt look like anything is related. Ive not gone to look through geopandas change log yet, so I need to check there too.

@aaraney aaraney added bug Something isn't working CI Issues related to the continuous integration testing. labels Apr 20, 2023
@aaraney aaraney changed the title Pandas 2.0.0 causing modeldata test to fail Pandas likely 2.0.0 causing modeldata test to fail Apr 20, 2023
@christophertubbs
Copy link
Contributor

I think it's pandas, but I believe there's a default seed in the hash function for the column. If that was changed for 2.0 for whatever reason, it'd change the results of the hash function on the column.

@aaraney
Copy link
Member Author

aaraney commented Apr 21, 2023

Yeah, the more I look into this, I am also convinced that it is pandas too. Just so we are all on the same page, the test that is failing is comparing hashes derived from a geopackage version of the hydrofabric. Here is the code:

    @property
    def uid(self) -> str:
        # removed docstring for readability
        layer_hashes = [np.apply_along_axis(hash_array, 0, self._dataframes[l].values).sum() for l in self._layer_names]
        return hashlib.sha1(','.join([str(h) for h in layer_hashes]).encode('UTF-8')).hexdigest()

self._dataframes is a dictionary of geopackage layer name to geopandas Dataframe of that layer.

I wrote up a script to do basically the same thing to more easily compare pandas versions. The script is in the twirl down if you are interested.

Hash each column in geopackage script
import numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
import fiona

p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
layers = fiona.listlayers(p)

dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}

with open(pd.__version__, "w") as f:
    f.write(f"layers: {str(layers)}\n")
    for l in layers:
        f.write(f"layer: {l}\n")
        f.write(f"columns: {str(dataframes[l].columns.to_list())}\n")
        f.write(f"column type: {list(map(str, dataframes[l].dtypes))}\n")
        
        # computes hash of each element in dataframe (equivalent to pd.Dataframe.applymap)
        hash_of_each_value = np.apply_along_axis(hash_array, 0, dataframes[l].values)
        summed_hash_on_each_row = np.apply_along_axis(np.sum, 0, hash_of_each_value)
        f.write(f"{str(summed_hash_on_each_row.tolist())}\n")

print(pd.__version__)
Raw output

1.5.3

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
[11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
[4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
[11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
[3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]

2.0.0

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
[9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
[11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
[4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
[11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
[3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
[9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]

Looking as the combined output below, it looks like the discrepancies are in the numeric datatypes. This leads me to think there might be discrepancies in how na / None values are either represented and / or hashed between the two versions. Looking into that now.

layers: ['divides', 'flowpaths', 'nexus', 'flowpath_edge_list', 'flowpath_attributes', 'crosswalk', 'cfe_noahowp_attributes']
layer: divides
columns    : ['id', 'areasqkm', 'type', 'toid', 'geometry']
column type: ['object', 'float64', 'object', 'object', 'geometry']
1.5.3:[9910909016688245206, 4257016642943720818, 523519807225067410, 4254618872625276163, 9604451431115515325]
2.0.0 [9910909016688245206, 7714536644407060282, 523519807225067410, 4254618872625276163, 9604451431115515325]
layer: flowpaths
columns    : ['id', 'lengthkm', 'main_id', 'member_comid', 'tot_drainage_areasqkm', 'order', 'realized_catchment', 'toid', 'geometry']
column type: ['object', 'float64', 'int64', 'object', 'float64', 'float64', 'object', 'object', 'geometry']
1.5.3: [11397087368252117007, 10440532369965811158, 5750755180915541183, 6610491045185399198, 12480819840933176463, 16223896574922372839, 9910909016688245206, 4254618872625276163, 1289937201443695659]
2.0.0: [11397087368252117007, 295265702994286425, 3079000369136598424, 6610491045185399198, 16619260224802041063, 10866567253940249541, 9910909016688245206, 4254618872625276163, 1289937201443695659]
layer: nexus
columns    : ['id', 'type', 'toid', 'geometry']
column type: ['object', 'object', 'object', 'geometry']
1.5.3: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
2.0.0: [4254618872625276163, 5245838080655516008, 15939269768229192618, 8279607743229296836]
layer: flowpath_edge_list
columns    : ['id', 'toid', 'geometry']
column type: ['object', 'object', 'geometry']
1.5.3: [11397087368252117007, 4254618872625276163, 18446744073709551609]
2.0.0: [11397087368252117007, 4254618872625276163, 18446744073709551609]
layer: flowpath_attributes
columns    : ['id', 'rl_gages', 'rl_NHDWaterbodyComID', 'Qi', 'MusK', 'MusX', 'n', 'So', 'ChSlp', 'BtmWdth', 'time', 'Kchan', 'nCC', 'TopWdthCC', 'TopWdth', 'length_m', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [11397087368252117007, 18446744073709551609, 18446744073709551609, 18446744073709551612, 6945878055642010754, 10639750839072135192, 14059973711066289446, 6292855941134473512, 15378706233695216860, 10232710820753472246, 18446744073709551612, 18446744073709551612, 3296008595016872355, 16812141029101979377, 7177634671963409277, 5902442301448765624, 18446744073709551609]
2.0.0: [11397087368252117007, 18446744073709551609, 18446744073709551609, 3179149979871189512, 11986439733596641007, 11099151438926169628, 17812221497236259243, 5178549615237787454, 93929622743921236, 11651573430419844666, 3179149979871189512, 3179149979871189512, 6398729633788531660, 1245452818585839232, 12465211508769344715, 6362659139139935070, 18446744073709551609]
layer: crosswalk
columns    : ['id', 'toid', 'NHDPlusV2_COMID', 'NHDPlusV2_COMID_part', 'reconciled_ID', 'mainstem', 'POI_ID', 'POI_TYPE', 'POI_VALUE', 'geometry']
column type: ['object', 'object', 'float64', 'float64', 'float64', 'float64', 'object', 'object', 'object', 'geometry']
1.5.3: [3752687620607738028, 10679333900438930345, 11644050028096106736, 2569989159426054126, 11240671135618503159, 13074351299704411022, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
2.0.0: [3752687620607738028, 10679333900438930345, 8742293209049677422, 8334557188444525933, 15888965815924940002, 101168048888588003, 18057368295285813311, 14239525810013382383, 6030799452754084336, 18446744073709551603]
layer: cfe_noahowp_attributes
columns    : ['id', 'gw_Coeff', 'gw_Zmax', 'gw_Expon', 'ISLTYP', 'IVGTYP', 'bexp_soil_layers_stag=1', 'bexp_soil_layers_stag=2', 'bexp_soil_layers_stag=3', 'bexp_soil_layers_stag=4', 'dksat_soil_layers_stag=1', 'dksat_soil_layers_stag=2', 'dksat_soil_layers_stag=3', 'dksat_soil_layers_stag=4', 'psisat_soil_layers_stag=1', 'psisat_soil_layers_stag=2', 'psisat_soil_layers_stag=3', 'psisat_soil_layers_stag=4', 'cwpvt', 'mfsno', 'mp', 'refkdt', 'slope', 'smcmax_soil_layers_stag=1', 'smcmax_soil_layers_stag=2', 'smcmax_soil_layers_stag=3', 'smcmax_soil_layers_stag=4', 'smcwlt_soil_layers_stag=1', 'smcwlt_soil_layers_stag=2', 'smcwlt_soil_layers_stag=3', 'smcwlt_soil_layers_stag=4', 'vcmx25', 'geometry']
column type: ['object', 'float64', 'float64', 'float64', 'int64', 'int64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'float64', 'geometry']
1.5.3: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 5978206864406560404, 7481721767258314475, 4938289557806202459, 4938289557806202459, 4938289557806202459, 4938289557806202459, 10879245026471355412, 10879245026471355412, 10879245026471355412, 10879245026471355412, 3720317251396855529, 3720317251396855529, 3720317251396855529, 3720317251396855529, 10415052716999926891, 7882324864216808000, 6833873961703491437, 12397854334200377101, 11714806517525372952, 14084054139117991909, 14084054139117991909, 14084054139117991909, 14084054139117991909, 13183223717059209830, 13183223717059209830, 13183223717059209830, 13183223717059209830, 17622765900528814828, 18446744073709551609]
2.0.0: [9910909016688245206, 18446744073709551609, 18446744073709551609, 18446744073709551609, 17218099967287971373, 13656126546251369994, 8348731800770055667, 8348731800770055667, 8348731800770055667, 8348731800770055667, 5343560834163534075, 5343560834163534075, 5343560834163534075, 5343560834163534075, 3269881205696930493, 3269881205696930493, 3269881205696930493, 3269881205696930493, 13084577964608946138, 17632152680294710829, 6280827595974488025, 11366273065597691016, 12540298840484485334, 6453370661115098702, 6453370661115098702, 6453370661115098702, 6453370661115098702, 11911029417517962984, 11911029417517962984, 11911029417517962984, 11911029417517962984, 15250992649697139781, 18446744073709551609]

@aaraney
Copy link
Member Author

aaraney commented Apr 21, 2023

So, ive started to isolate the problem, however I still dont understand why this is happening. Something seems different about pd.DataFrame.values between 1.5.3 and 2.0.0:

Script
import numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
from pprint import pprint

p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
print(pd.__version__)

df = gpd.read_file(p, layer="divides")

subset = df[["id", "areasqkm"]]
subset_loc = df.loc[:, ["id", "areasqkm"]]
square = pd.DataFrame({"id": df["id"], "areasqkm": df["areasqkm"]})
loc = pd.DataFrame({"id": df.loc[:, "id"], "areasqkm": df.loc[:, "areasqkm"]})
values = pd.DataFrame({"id": df["id"].values, "areasqkm": df["areasqkm"].values})
tolist = pd.DataFrame({"id": df["id"].values.tolist(), "areasqkm": df["areasqkm"].values.tolist()})

dfs = [subset, subset_loc, square, loc, values, tolist]

print("apply to each row")
pprint([d.apply(lambda a: hash_array(a.values), axis=0).values.sum() for d in dfs])

print("apply_along_axis to each row")
pprint([np.apply_along_axis(hash_array, 0, d.values).sum() for d in dfs])

print("apply to each column")
pprint([d.apply(lambda a: hash_array(a.values), axis=1).values.sum().sum() for d in dfs])

print("apply_along_axis to each column")
pprint([np.apply_along_axis(hash_array, 1, d.values).sum() for d in dfs])
1.5.3
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
2.0.0
apply to each row
[14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024,
 14167925659631966024]
apply_along_axis to each row
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]
apply_along_axis to each column
[17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488,
 17625445661095305488]

@aaraney
Copy link
Member Author

aaraney commented Apr 21, 2023

So, I figured it out. Here is the simplest example that illustrates and reproduces the problem:

import numpy as np
from pandas.util import hash_array

a = np.array([1.0], dtype="object")
print(hash_array(a))

# 1.5.3
# [3035652100526550566]

# 2.0.0
# [7736021350537868001]

Having looked through the pandas source, this regression was introduced in pandas-dev/pandas#50001, specifically here (diff below).

diff --git a/pandas/core/util/hashing.py b/pandas/core/util/hashing.py
index 5a5e46e0227aa..e0b18047aa0ec 100644
--- a/pandas/core/util/hashing.py
+++ b/pandas/core/util/hashing.py
@@ -344,9 +344,7 @@ def _hash_ndarray(
             )
 
             codes, categories = factorize(vals, sort=False)
-            cat = Categorical(
-                codes, Index._with_infer(categories), ordered=False, fastpath=True
-            )
+            cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
             return _hash_categorical(cat, encoding, hash_key)
 
         try:

In short, the array is categorized and in 1.5.3 the type is inferred using the values in the, now category instead of using the dtype as specified on the np.ndarray object. In 2.0.0 it now seems that this has been fixed. So hashed np.ndarray's now respect their dtype rather. Tying this back to pd.DataFrame.values, .values must set its returned np.ndarray's dtype to a type that types in the collection can be cast to (e.g. float64, int32, object). So in our case, since we have a dataframe of strings, float, and ints, .dtype has to be set to object. This consequently is the inherited type of any inner dimension in an ndarray view. My guess is that .values actually returns a copy on write (CoW) view of the dataframe's inner ndarray's and that view has to "show" all inner array dimension types as the outer most dtype.

@aaraney
Copy link
Member Author

aaraney commented Apr 24, 2023

More wierdness

import numpy as np
import hashlib
from pandas.util import hash_array, hash_pandas_object
import geopandas as gpd
import fiona

p = "<path-to-repo>/dmod/refactor-data-service/data/example_hydrofabric_2/hydrofabric.gpkg"

layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}

layer_hashes = [np.apply_along_axis(hash_array, 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [10103771696888273306, 4572071176093428412, 15272590391029730009, 15651706240877393163, 15901469198598983537, 17501800407106816969, 756873605291097582]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 15651706240877393163, 12994735377762201353, 12039723046569473286, 18438881045715204344]

layer_hashes = [np.apply_along_axis(lambda h: hash_array(h, categorize=False), 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]

layer_hashes = [np.sum(hash_pandas_object(dataframes[layer]).values) for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [hash_pandas_object(dataframes[layer]).sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
# 2.0.0
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]

layer_hashes = [hash_pandas_object(dataframes[layer]).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]

layer_hashes = [dataframes[layer].apply(hash_pandas_object).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
# 2.0.0
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]

layer_hashes = [dataframes[layer].apply(lambda a: hash_array(a.values), axis=0).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
# 2.0.0
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]

@robertbartel robertbartel added the maas MaaS Workstream label Apr 24, 2023
@aaraney
Copy link
Member Author

aaraney commented Apr 24, 2023

Given that hash_pandas_object produces the same result for both versions (if the sum is computed using numpy), I think our best bet is to switch our implementation to use hash_pandas_object. Having talked with @robertbartel about this, the reason hash_array is likely used now is because of concerns with geopandas and specifically geometry columns in a geopandas dataframe. In brief, geopandas uses shapey objects to represent geometries and at one point (shapely<=2.0.0) shapely geometries were not hashable (see shapely #209 and geopandas #221). However, now we require shapely>=2.0.0 so this should not be an issue.

@aaraney
Copy link
Member Author

aaraney commented Nov 2, 2023

Reopening this because tests are failing again b.c. of a related failure. This failure started reoccurring 3 weeks ago. https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319

Traceback (most recent call last):
 File >"/home/runner/work/DMOD/DMOD/python/lib/modeldata/dmod/test/test_geopackage_hydrofabric.py", >line 309, in test_uid_1_a
   self.assertEqual(hydrofabric.uid, expected_uid)
AssertionError: '10105591058b39504e73842da89e0c3dcac5ba99' != >'b7367023aadad961315dd05e184359dad68613c3'
- 10105591058b39504e73842da89e0c3dcac5ba99
+ b7367023aadad961315dd05e184359dad68613c3

@aaraney
Copy link
Member Author

aaraney commented Nov 2, 2023

The same code path is not effected. #468 will track this instead.

@aaraney aaraney closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Issues related to the continuous integration testing. maas MaaS Workstream
Projects
None yet
Development

No branches or pull requests

3 participants