Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When appending to file with newer schema, the older schema is not deleted #1098

Open
rly opened this issue Apr 16, 2024 · 11 comments
Open
Assignees
Labels
category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@rly
Copy link
Contributor

rly commented Apr 16, 2024

What happened?

See flatironinstitute/neurosift#157

Currently, a file may contain multiple versions of a schema, e.g., "core" 1.5.0 and "core" 1.6.0 after a user appends to a file that has an older version of a namespace cached than the one currently loaded by pynwb/hdmf. I don't see the value of that, given that individual neurodata type objects are attached to a namespace and not a namespace & version. It adds confusion.

I think, when writing the file and the schema is cached, the older version of a namespace should be deleted and the newer one added.

This raises the use case where the existing file has a cached schema and the user chooses not to cache the new schema., but the data are written with the new schema. We should not allow that. Is there a use case where we don't want to cache the schema on write? Should we just disallow that to prevent edge cases during append/modification like this? I think so.

Related: NeurodataWithoutBorders/pynwb#967 which mentions possible performance slowdowns, but they seem minor, I think the value of caching the schema trumps that, and we can optimize the writing of the schema separately - it is just a group with a few string datasets.

Steps to Reproduce

Create a file with an older version of a schema (cache=True by default)
Append to the file using the API that uses a newer version of the schema (cache=True by default)
Inspect the file

Traceback

No response

Operating System

macOS

Python Executable

Conda

Python Version

3.12

Package Versions

No response

@rly rly added category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Apr 16, 2024
@rly rly added this to the Future milestone Apr 16, 2024
@rly
Copy link
Contributor Author

rly commented Apr 16, 2024

@stephprince I assigned you this because you'll be working with namespaces and validation soon, and this is related.

@oruebel
Copy link
Contributor

oruebel commented Apr 16, 2024

I don't see the value of that, given that individual neurodata type objects are attached to a namespace and not a namespace & version.

This assumes that the schema (not just the API are compatible). E.g., let's say you stored an OpticalSeries with version 2.1 in a file and then modified the definition of OpticalSeries in version 2.?. Erasing and replacing the namespace now implies that the older OpricalSeries follows 2.? when it actually is 2.1. In particular for extensions, I think that is problematic behavior. The actual solution would be to also store the version of the namespace. HDMF/PyNWB does not support multiple versions of the same namespace, which is probably why this was omitted, but I think it would be worth considering supporting namespace <version> values as part of the namespace attribute.

@rly
Copy link
Contributor Author

rly commented Apr 16, 2024

Under that example, different objects of the NWB file are valid only under different schema. The file could fail validation (which validates all objects under one schema) under both the new and old schema. Allowing this would open a big can of edge case worms. I would rather be more restrictive and say that saving an NWB file using the newest schema is allowed only if all objects are valid in the newest schema. The new schema may not be compatible with the old schema, though I can't think of a great example of that right now. We could validate before write to check this, which is something we were thinking of doing anyway.

@oruebel
Copy link
Contributor

oruebel commented Apr 16, 2024

I would rather be more restrictive and say that saving an NWB file using the newest schema is allowed only if all objects are valid in the newest schema.

So that means validating before updating a file to the newer schema?

@rly
Copy link
Contributor Author

rly commented Apr 16, 2024

If the file with an older schema is opened in pynwb/hdmf that has loaded a newer schema, probably we should validate the file against the newer schema in case there are weird discrepancies that are not accounted for. During the write process, we should validate the builders against the newer schema and not allow the user to proceed if something is invalid (this usually means the API is buggy because it allowed the user to create non-conforming builders).

@oruebel
Copy link
Contributor

oruebel commented Apr 17, 2024

@rly do If I understand this correctly, the proposal is effectively to create a migration path to upgrade a file a new schema

@rly
Copy link
Contributor Author

rly commented Apr 17, 2024

@oruebel Effectively, yes.

Here is a concrete use case: In NWB 2.1.0, NWBFile datasets "experimenter" and "related_publications" changed from scalar to 1-D arrays. Now, PyNWB in the object mapper reads a scalar "experimenter" value as a 1-D array in the NWBFile container. If a user opens a pre-2.1.0 file in current PyNWB in append mode and adds a TimeSeries, currently, I believe only the TimeSeries is written. NWBFile.experimenter is not rewritten as a 1-D array. The current 2.7.0 schema would be appended to the file alongside the pre-2.1.0 schema. Technically, this data is invalid against the 2.7.0 schema unless we add an exception to the validator (I think we actually don't have that right now but it hasn't come up...)

Before we can delete the pre-2.1.0 schema from the cache, we should migrate the no-longer-valid scalar "experimenter" value to a valid 1D array in memory and write that change to the file on write.

@rly
Copy link
Contributor Author

rly commented Apr 17, 2024

We can use the modified flags on containers and builders to flag that a data type has been migrated, but I'm not sure that would replace a dataset. I also think we should not rewrite all datasets in the NWBFile. We'll need to test that out...

@oruebel
Copy link
Contributor

oruebel commented Apr 17, 2024

Before we can delete the pre-2.1.0 schema from the cache, we should migrate the no-longer-valid scalar "experimenter" value to a valid 1D array in memory and write that change to the file on write.

This may not be trivial. Changing the data type, shape etc. of datasets and attributes is not possible in HDF5. I think one would need to first remove the existing dataset (or attribute) in that case and create a new one with the same name.

@sneakers-the-rat
Copy link

related to : #1018 (comment)

why not separate the actions of appending and migrating? It seems like the real issue is that it's possible (and in fact impossible not to as far as i can tell) to use the wrong version of a schema with a file. if, when opening a file for writing, the schema version that's embedded with the file is the one that's loaded by pynwb/hdmf, then the problem disappears?

then if we wanted to do migrations then it would be an explicit, rather than accidental process - a file uses the schema that it explicitly packages with it until the user explicitly acts to change that, eg. via a migration system. implicitly writing data with a newer schema by accident seems like highly undesirable behavior, as does implicit/automatic migration.

@sneakers-the-rat
Copy link

revisited this issue bc the issue of handling multiple schema versions simultaneously came up for me again, and realize the last thing that the discussion was left on.

I think one would need to first remove the existing dataset (or attribute) in that case and create a new one with the same name.

if it's helpful & this is still something that would be desirable to do, i had to do this recently to make truncated sample files for testing, and the hardest part was preserving the references, because when you remove and recreate the dataset all the references to it are destroyed.

this is a relatively cumbersome function since it's intended to be a cli command (once i write the cli module), so sorry for the prints, but as far as I can tell this re-creates all references correctly, and should be easy to strip out the truncation parts: https://github.com/p2p-ld/nwb-linkml/blob/f94a144d75411c3c2d370d7171b03a0af7eb37c4/nwb_linkml/src/nwb_linkml/io/hdf5.py#L489-L632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

4 participants