`TestGeoPackageHydrofabric.test_uid_1_a` failure #468

aaraney · 2023-11-02T16:13:21Z

Possibly related to #324.

Failure: https://github.com/NOAA-OWP/DMOD/actions/runs/6510147982/job/17683206387#step:10:319

Effected Code:

TestGeoPackageHydrofabric.test_uid_1_a is failing.

The text was updated successfully, but these errors were encountered:

aaraney · 2024-01-22T16:55:32Z

So, a bit of background. We are using pandas.util.hash_pandas_object to generate a unique identifier for a geopackage hydrofabric. test_uid_1_a is failing b.c. the old hash does not match. This appears to be a regression in pandas.util.hash_pandas_object. This is the second time we've encoutered a regression in this function (see #324).

In doing some initial investigating, I came across pandas documentation that notes that the pandas.util package is considered private.

The pandas.core, pandas.compat, and pandas.util top-level modules are PRIVATE. Stable functionality in such modules is not guaranteed.

With that in mind, we need to find an alternative stable alternative to compute a unique identifier that does not use pandas.util.

aaraney · 2024-01-22T17:37:35Z

Issue is with fiona, not with pandas. Tested all pandas >= 2.0.0 and all failed.

Issue is present in fiona==1.9.5 but not fiona==1.9.4.post1. Thinking that is likely the call to fiona.listlayers that is now returning a different layer name list ordering.

aaraney · 2024-01-22T17:40:11Z

Well, its never that easy. Looks like it is something else with fiona...

Either way, we should sort that layers list before hashing to create the unique identifier (#TODO @aaraney).

robertbartel · 2024-01-22T19:05:02Z

I've confirmed in a quick test on two different machines (an Intel Mac and M2 Mac) that sorting the order of processing of the layers will produce consistent results when generating the unique id. I'll leave it to you for the moment, @aaraney, but let me know if you want me to go ahead and open a PR.

aaraney · 2024-01-22T19:49:46Z

Thanks for verifying that, @robertbartel. Yeah, if you want to put in a quick PR that would be great!

Fixing to ensure deterministic ordering of layers. Should address NOAA-OWP#468.

aaraney · 2024-01-22T20:54:29Z

Update: im still tracking down the change that introduced this (feature) bug. But, ive found the source.

So, in short, gpd.read_file(..., layer="some_layer") by default, uses fiona which uses gdal to open the file and deserialize the contents. In this case fiona returns an iterable Features object. Geopandas then iterates over that feature collection to build up a DataFrame (as we would expect). During that iteration, geopandas checks to see if a special __geo_interface__ property is on each feature, if it is, it uses that as the feature.

for feature in features_lst:
  # load geometry
  if hasattr(feature, "__geo_interface__"):
    feature = feature.__geo_interface__
  ...

A feature looks like:

fiona<=1.9.4.post1

{'geometry': None, 'id': '1', 'properties': {'id': 'wb-10', 'length_m': 7503.9047990657755}, 'type': 'Feature'}

fiona==1.9.5

{'geometry': None, 'id': '1', 'properties': {'id': 'wb-10', 'rl_gages': None, 'rl_NHDWaterbodyComID': None, 'Qi': None, 'MusK': None, 'MusX': None, 'n': None, 'So': None, 'ChSlp': None, 'BtmWdth': None, 'time': None, 'Kchan': None, 'nCC': None, 'TopWdthCC': None, 'TopWdth': None, 'length_m': 7503.9047990657755}, 'type': 'Feature'}

Note the absence of None values in fiona<=1.9.4.post1. When a pandas.DataFrame is constructed and given a column name without and values its data type is by default float64 and its value is NaN. Given that fiona==1.9.5 does return None values, the DataFrame's datatype will be object and have a python sentinel None value. This difference in deserialization explains why the test is failing.

aaraney · 2024-01-22T21:05:07Z

The hardcoded hash that we are testing against in the test is wrong. The correct hash is the hash produced by the call to uid. It is wrong because string type columns without values are being assigned the value NaN when they should be None.

aaraney · 2024-01-22T21:20:05Z

So, what are our options? One option is to is to pin fiona>=1.9.5, however given how synonymous fiona is in the python geospatial world, I am not the biggest fan of that approach. Alternatively, we could bring in the pyogrio library and use it as geopandas engine (e.g. gpd.read_file(..., engine="pyogrio"). Ive tested this locally and it reproduces the expected hash well into their version history. pyogrio is a more performant library that mainly deals with OCR (vector) data. It works with pretty much any vector geospatial format and is quickly overtaking the stronghold fiona has had for over a decade in the python vector geospatial space.

robertbartel · 2024-01-22T21:32:36Z

The alternative would probably be to retrieve a more restricted subset of low level data ourselves from the dataframe from which to construct a hash. I.e., manually build the hash from only the relevant (to our purposes) contents of the pandas object, rather than using hash_pandas_object as a blunt instrument on the entire thing. Certainly not ideal, though.

I'm fine with going with pyogrio as long as it is relatively plug-and-play. If dependencies get complicated, we should seriously look at what the manual solution will take.

robertbartel · 2024-01-22T21:38:01Z

There is actually one more option, though I don't know how feasible it is: lobby for suitable unique identifier(s) to be in the hydrofabric data. We could probably get by on just a unique hydrofabric version string and the already-included set of catchment ids.

aaraney · 2024-01-22T21:44:50Z

Yeah, I really like that idea. I think its at least worth chewing on for a bit. We might bring it up to @program-- and get his thoughts.

program-- · 2024-01-22T22:37:10Z

Yeah, I really like that idea. I think its at least worth chewing on for a bit. We might bring it up to @program-- and get his thoughts.

I think you guys seem to have it mostly figured out -- If you want a (somewhat) persistent hash for a given subset, I would most likely try to take all the IDs: divides.divide_id, nexus.id, and flowpaths.id, and create an aggregate hash from that (probably sort before hashing). That, theoretically, should be persistent, for each subset covering the same area of interest, unless there is a topology change in that AOI. However, it's still not perfectly fault-tolerant, so you might still see CI failures later down the road from it... imo aggregate hashing for data like the hydrofabric is difficult, since persistent IDs for geospatial data tends to be difficult

On the other point, pyogrio is much better than fiona; highly recommend using pyogrio for the geopandas engine instead!

robertbartel · 2024-01-22T23:29:30Z

That, theoretically, should be persistent, for each subset covering the same area of interest, unless there is a topology change in that AOI.

That is really the key. We don't want a persistent hash across multiple releases of the hydrofabric. We want something that will reflect if the underlying data changes (or, more practically, if we are dealing with two different hydrofabrics in two parts of a larger operation). We just need it to be deterministic and consistent for the same hydrofabric release/data.

aaraney assigned robertbartel Nov 2, 2023

aaraney added the bug Something isn't working label Nov 2, 2023

aaraney mentioned this issue Nov 2, 2023

Pandas likely 2.0.0 causing modeldata test to fail #324

Closed

robertbartel added a commit to robertbartel/DMOD that referenced this issue Jan 22, 2024

Fix hydrofabric unique id gen.

8eabe29

Fixing to ensure deterministic ordering of layers. Should address NOAA-OWP#468.

robertbartel mentioned this issue Jan 22, 2024

Fix hydrofabric unique id generation #477

Closed

aaraney mentioned this issue Jan 22, 2024

Fix failing test: TestGeoPackageHydrofabric.test_uid_1_a #478

Merged

robertbartel closed this as completed in #478 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TestGeoPackageHydrofabric.test_uid_1_a` failure #468

`TestGeoPackageHydrofabric.test_uid_1_a` failure #468

aaraney commented Nov 2, 2023

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

robertbartel commented Jan 22, 2024

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024 •

edited

Loading

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

robertbartel commented Jan 22, 2024

robertbartel commented Jan 22, 2024

aaraney commented Jan 22, 2024

program-- commented Jan 22, 2024

robertbartel commented Jan 22, 2024

TestGeoPackageHydrofabric.test_uid_1_a failure #468

TestGeoPackageHydrofabric.test_uid_1_a failure #468

Comments

aaraney commented Nov 2, 2023

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

robertbartel commented Jan 22, 2024

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024 • edited Loading

aaraney commented Jan 22, 2024

aaraney commented Jan 22, 2024

robertbartel commented Jan 22, 2024

robertbartel commented Jan 22, 2024

aaraney commented Jan 22, 2024

program-- commented Jan 22, 2024

robertbartel commented Jan 22, 2024

`TestGeoPackageHydrofabric.test_uid_1_a` failure #468

`TestGeoPackageHydrofabric.test_uid_1_a` failure #468

aaraney commented Jan 22, 2024 •

edited

Loading