-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas likely 2.0.0 causing modeldata test to fail #324
Comments
I think it's pandas, but I believe there's a default seed in the hash function for the column. If that was changed for 2.0 for whatever reason, it'd change the results of the hash function on the column. |
Yeah, the more I look into this, I am also convinced that it is @property
def uid(self) -> str:
# removed docstring for readability
layer_hashes = [np.apply_along_axis(hash_array, 0, self._dataframes[l].values).sum() for l in self._layer_names]
return hashlib.sha1(','.join([str(h) for h in layer_hashes]).encode('UTF-8')).hexdigest()
I wrote up a script to do basically the same thing to more easily compare Hash each column in geopackage scriptimport numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
import fiona
p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}
with open(pd.__version__, "w") as f:
f.write(f"layers: {str(layers)}\n")
for l in layers:
f.write(f"layer: {l}\n")
f.write(f"columns: {str(dataframes[l].columns.to_list())}\n")
f.write(f"column type: {list(map(str, dataframes[l].dtypes))}\n")
# computes hash of each element in dataframe (equivalent to pd.Dataframe.applymap)
hash_of_each_value = np.apply_along_axis(hash_array, 0, dataframes[l].values)
summed_hash_on_each_row = np.apply_along_axis(np.sum, 0, hash_of_each_value)
f.write(f"{str(summed_hash_on_each_row.tolist())}\n")
print(pd.__version__) Raw output1.5.3
2.0.0
Looking as the combined output below, it looks like the discrepancies are in the numeric datatypes. This leads me to think there might be discrepancies in how
|
So, ive started to isolate the problem, however I still dont understand why this is happening. Something seems different about Scriptimport numpy as np
import pandas as pd
import geopandas as gpd
from pandas.util import hash_array
from pprint import pprint
p = "<path-to-repo>/data/example_hydrofabric_2/hydrofabric.gpkg"
print(pd.__version__)
df = gpd.read_file(p, layer="divides")
subset = df[["id", "areasqkm"]]
subset_loc = df.loc[:, ["id", "areasqkm"]]
square = pd.DataFrame({"id": df["id"], "areasqkm": df["areasqkm"]})
loc = pd.DataFrame({"id": df.loc[:, "id"], "areasqkm": df.loc[:, "areasqkm"]})
values = pd.DataFrame({"id": df["id"].values, "areasqkm": df["areasqkm"].values})
tolist = pd.DataFrame({"id": df["id"].values.tolist(), "areasqkm": df["areasqkm"].values.tolist()})
dfs = [subset, subset_loc, square, loc, values, tolist]
print("apply to each row")
pprint([d.apply(lambda a: hash_array(a.values), axis=0).values.sum() for d in dfs])
print("apply_along_axis to each row")
pprint([np.apply_along_axis(hash_array, 0, d.values).sum() for d in dfs])
print("apply to each column")
pprint([d.apply(lambda a: hash_array(a.values), axis=1).values.sum().sum() for d in dfs])
print("apply_along_axis to each column")
pprint([np.apply_along_axis(hash_array, 1, d.values).sum() for d in dfs])
|
So, I figured it out. Here is the simplest example that illustrates and reproduces the problem: import numpy as np
from pandas.util import hash_array
a = np.array([1.0], dtype="object")
print(hash_array(a))
# 1.5.3
# [3035652100526550566]
# 2.0.0
# [7736021350537868001] Having looked through the pandas source, this regression was introduced in pandas-dev/pandas#50001, specifically here (diff below). diff --git a/pandas/core/util/hashing.py b/pandas/core/util/hashing.py
index 5a5e46e0227aa..e0b18047aa0ec 100644
--- a/pandas/core/util/hashing.py
+++ b/pandas/core/util/hashing.py
@@ -344,9 +344,7 @@ def _hash_ndarray(
)
codes, categories = factorize(vals, sort=False)
- cat = Categorical(
- codes, Index._with_infer(categories), ordered=False, fastpath=True
- )
+ cat = Categorical(codes, Index(categories), ordered=False, fastpath=True)
return _hash_categorical(cat, encoding, hash_key)
try: In short, the array is categorized and in |
More wierdness import numpy as np
import hashlib
from pandas.util import hash_array, hash_pandas_object
import geopandas as gpd
import fiona
p = "<path-to-repo>/dmod/refactor-data-service/data/example_hydrofabric_2/hydrofabric.gpkg"
layers = fiona.listlayers(p)
dataframes = {layer_name: gpd.read_file(p, layer=layer_name) for layer_name in layers}
layer_hashes = [np.apply_along_axis(hash_array, 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [10103771696888273306, 4572071176093428412, 15272590391029730009, 15651706240877393163, 15901469198598983537, 17501800407106816969, 756873605291097582]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 15651706240877393163, 12994735377762201353, 12039723046569473286, 18438881045715204344]
layer_hashes = [np.apply_along_axis(lambda h: hash_array(h, categorize=False), 0, dataframes[l].values).sum() for l in layers]
print(layer_hashes)
# 1.5.3
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
# 2.0.0
# [13561291698351612770, 8982904833939253838, 15272590391029730009, 563669827856263632, 8015613103103036070, 14264485950225397329, 14213034935086656097]
layer_hashes = [np.sum(hash_pandas_object(dataframes[layer]).values) for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
layer_hashes = [hash_pandas_object(dataframes[layer]).sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
# 2.0.0
# [-3515745103180661136, 5391557828027012765, -4959420968363052274, 6147183089241954451, 789248401423909681, -8480411101370528140, -6595055862230412446]
layer_hashes = [hash_pandas_object(dataframes[layer]).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
# 2.0.0
# [14930998970528890480, 5391557828027012765, 13487323105346499342, 6147183089241954451, 789248401423909681, 9966332972339023476, 11851688211479139170]
layer_hashes = [dataframes[layer].apply(hash_pandas_object).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
# 2.0.0
# [76075061348514196, 4684754646151721689, 5687276790938548378, 16714125617493265531, 16929656731059435989, 2147256848333013419, 7136568188139294014]
layer_hashes = [dataframes[layer].apply(lambda a: hash_array(a.values), axis=0).values.sum() for layer in layers]
print(layer_hashes)
# 1.5.3
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746]
# 2.0.0
# [7219723789373133966, 4289999241617869509, 9598795642696610532, 15651706240877393163, 17758308453597611833, 17501800407106816969, 15369625150764002746] |
Given that |
Reopening this because tests are failing again b.c. of a related failure. This failure started reoccurring 3 weeks ago. https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319
|
The same code path is not effected. #468 will track this instead. |
The
dmod.test.test_geopackage_hydrofabric.TestGeoPackageHydrofabric.test_uid_1_a
test is currently failing in several PRs. Below is a snipped from an action log showing the failure.source
I compared the dependency versions installed when the tests were passing with the failing tests and it seems that
pandas==2.0.0
is the likely culprit. The last known pandas version that works is1.5.3
. I tested this locally withfiona
version1.9.1
and1.9.3
andpandas==1.5.3
and the tests passed. However, there is one outlier action withpandas==2.0.0
andfiona==1.9.2
installed that passed? Im still a little puzzled about that one and ive not been able to reproduce it locally (yet, ill do that in the morning, there isn't afiona
wheel for that version for my machine).Passing with
pandas==1.5.3
Failing with
pandas==2.0.0
Weird passing test
pandas==2.0.0
I went looking through the fiona's change log and PRs for release
1.9.3
and its doesnt look like anything is related. Ive not gone to look throughgeopandas
change log yet, so I need to check there too.The text was updated successfully, but these errors were encountered: