Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestGeoPackageHydrofabric.test_uid_1_a failure #468

Closed
aaraney opened this issue Nov 2, 2023 · 13 comments · Fixed by #478
Closed

TestGeoPackageHydrofabric.test_uid_1_a failure #468

aaraney opened this issue Nov 2, 2023 · 13 comments · Fixed by #478
Assignees
Labels
bug Something isn't working

Comments

@aaraney
Copy link
Member

aaraney commented Nov 2, 2023

Possibly related to #324.

Failure: https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319

Effected Code:

TestGeoPackageHydrofabric.test_uid_1_a is failing.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

So, a bit of background. We are using pandas.util.hash_pandas_object to generate a unique identifier for a geopackage hydrofabric. test_uid_1_a is failing b.c. the old hash does not match. This appears to be a regression in pandas.util.hash_pandas_object. This is the second time we've encoutered a regression in this function (see #324).

In doing some initial investigating, I came across pandas documentation that notes that the pandas.util package is considered private.

The pandas.core, pandas.compat, and pandas.util top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.

With that in mind, we need to find an alternative stable alternative to compute a unique identifier that does not use pandas.util.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

Issue is with fiona, not with pandas. Tested all pandas >= 2.0.0 and all failed.

Issue is present in fiona==1.9.5 but not fiona==1.9.4.post1. Thinking that is likely the call to fiona.listlayers that is now returning a different layer name list ordering.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

Well, its never that easy. Looks like it is something else with fiona...

Either way, we should sort that layers list before hashing to create the unique identifier (#TODO @aaraney).

@robertbartel
Copy link
Contributor

I've confirmed in a quick test on two different machines (an Intel Mac and M2 Mac) that sorting the order of processing of the layers will produce consistent results when generating the unique id. I'll leave it to you for the moment, @aaraney, but let me know if you want me to go ahead and open a PR.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

Thanks for verifying that, @robertbartel. Yeah, if you want to put in a quick PR that would be great!

robertbartel added a commit to robertbartel/DMOD that referenced this issue Jan 22, 2024
Fixing to ensure deterministic ordering of layers.  Should address NOAA-OWP#468.
@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

Update: im still tracking down the change that introduced this (feature) bug. But, ive found the source.

So, in short, gpd.read_file(..., layer="some_layer") by default, uses fiona which uses gdal to open the file and deserialize the contents. In this case fiona returns an iterable Features object. Geopandas then iterates over that feature collection to build up a DataFrame (as we would expect). During that iteration, geopandas checks to see if a special __geo_interface__ property is on each feature, if it is, it uses that as the feature.

for feature in features_lst:
  # load geometry
  if hasattr(feature, "__geo_interface__"):
    feature = feature.__geo_interface__
  ...

A feature looks like:

fiona<=1.9.4.post1

{'geometry': None, 'id': '1', 'properties': {'id': 'wb-10', 'length_m': 7503.9047990657755}, 'type': 'Feature'}

fiona==1.9.5

{'geometry': None, 'id': '1', 'properties': {'id': 'wb-10', 'rl_gages': None, 'rl_NHDWaterbodyComID': None, 'Qi': None, 'MusK': None, 'MusX': None, 'n': None, 'So': None, 'ChSlp': None, 'BtmWdth': None, 'time': None, 'Kchan': None, 'nCC': None, 'TopWdthCC': None, 'TopWdth': None, 'length_m': 7503.9047990657755}, 'type': 'Feature'}

Note the absence of None values in fiona<=1.9.4.post1. When a pandas.DataFrame is constructed and given a column name without and values its data type is by default float64 and its value is NaN. Given that fiona==1.9.5 does return None values, the DataFrame's datatype will be object and have a python sentinel None value. This difference in deserialization explains why the test is failing.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

The hardcoded hash that we are testing against in the test is wrong. The correct hash is the hash produced by the call to uid. It is wrong because string type columns without values are being assigned the value NaN when they should be None.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

So, what are our options? One option is to is to pin fiona>=1.9.5, however given how synonymous fiona is in the python geospatial world, I am not the biggest fan of that approach. Alternatively, we could bring in the pyogrio library and use it as geopandas engine (e.g. gpd.read_file(..., engine="pyogrio"). Ive tested this locally and it reproduces the expected hash well into their version history. pyogrio is a more performant library that mainly deals with OCR (vector) data. It works with pretty much any vector geospatial format and is quickly overtaking the stronghold fiona has had for over a decade in the python vector geospatial space.

@robertbartel
Copy link
Contributor

The alternative would probably be to retrieve a more restricted subset of low level data ourselves from the dataframe from which to construct a hash. I.e., manually build the hash from only the relevant (to our purposes) contents of the pandas object, rather than using hash_pandas_object as a blunt instrument on the entire thing. Certainly not ideal, though.

I'm fine with going with pyogrio as long as it is relatively plug-and-play. If dependencies get complicated, we should seriously look at what the manual solution will take.

@robertbartel
Copy link
Contributor

There is actually one more option, though I don't know how feasible it is: lobby for suitable unique identifier(s) to be in the hydrofabric data. We could probably get by on just a unique hydrofabric version string and the already-included set of catchment ids.

@aaraney
Copy link
Member Author

aaraney commented Jan 22, 2024

Yeah, I really like that idea. I think its at least worth chewing on for a bit. We might bring it up to @program-- and get his thoughts.

@program--
Copy link

Yeah, I really like that idea. I think its at least worth chewing on for a bit. We might bring it up to @program-- and get his thoughts.

I think you guys seem to have it mostly figured out -- If you want a (somewhat) persistent hash for a given subset, I would most likely try to take all the IDs: divides.divide_id, nexus.id, and flowpaths.id, and create an aggregate hash from that (probably sort before hashing). That, theoretically, should be persistent, for each subset covering the same area of interest, unless there is a topology change in that AOI. However, it's still not perfectly fault-tolerant, so you might still see CI failures later down the road from it... imo aggregate hashing for data like the hydrofabric is difficult, since persistent IDs for geospatial data tends to be difficult

On the other point, pyogrio is much better than fiona; highly recommend using pyogrio for the geopandas engine instead!

@robertbartel
Copy link
Contributor

That, theoretically, should be persistent, for each subset covering the same area of interest, unless there is a topology change in that AOI.

That is really the key. We don't want a persistent hash across multiple releases of the hydrofabric. We want something that will reflect if the underlying data changes (or, more practically, if we are dealing with two different hydrofabrics in two parts of a larger operation). We just need it to be deterministic and consistent for the same hydrofabric release/data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants