-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versioned arrays #76
Comments
Hi @shoyer, can you explain this one a little more? Not fully getting it. |
Let me clarify the use-case first. In weather and climate science, it's common to have large array datasets that are continuously updated. For example, every hour we might combine data from weather stations, satellites and physical models to estimate air temperature at every location on a grid. This quickly adds up to a large amount of data, but each incremental update is small. Imagine a movie being written frame-by-frame. If it were always just a matter of always writing new chunks in new files, we could simply write the chunks to new files and then overwrite the array metadata. But it's common to want to overwrite existing chunks, too, either because of another incremental update (a separate quality control check) or because we need to use another chunking scheme for efficient access (e.g., to get a weather history for one location, we don't want to use one chunk/hour). Hence, indirection is useful, to enable atomic writes and/or "time travel" (like git) to view the state of the database at some prior point in time. |
Thanks @shoyer. So trying to make this concrete, correct me if any of the following is wrong. You have an array of air temperature estimates When designing the chunk shape for this array, you want to be able to view the state of the whole grid at some specific time, but you also want to be able to view a time series for a specific grid square or some region of the spatial grid. To make this possible, you would naturally chunk all three dimensions, e.g., chunks might be (gridx//100, gridy//100, 1) - each chunk covers 1/10000 of the spatial grid and a single time point. If this is all there was to it, then querying the state of the grid at some point in time could be done by indexing the third (time) dimension, e.g., Have I got this right? cc @benjeffery |
Yes, I think you have this right. To be clear, while the ability to view previous versions of an array (like git) is important for some use cases (like reproducibility), it's also essential for atomic updates. In practice, you do not want to use chunking like We made use of both these features in Mandoline, an array database written at my former employer. |
Ok, thanks. So the requirement for versioning, do you think that lives at On Thu, Sep 22, 2016 at 4:39 PM, Stephan Hoyer notifications@github.com
Alistair Miles |
@alimanfoo Yes, I think it would make sense to implement this at the storage level. |
One simple way to implement this that immediately comes to mind would be to extend the DirectoryStore class and use git to provide versioning. Could even use GitPython to provide some convenience, like initialise the git repo when the store is first created, and provide Python API to make commits, checkout old versions, etc. |
Hi, my organisation works with climate simulations and they get very big (i.e. Tera to Peta-scale). The data is all stored in netCDF format at present and a single dataset is typically a set of netCDF files within a directory with a version label, such as A major issue for our data centres is the re-publication of data that has changed. However, some of the changes are very minor. They could be, for example, a change to:
In any of the above cases, it is very costly for our community because:
@alimanfoo and @shoyer: I am interested in the discussion you have had because I could Zarr being an excellent solution for resolving this issue. It could enable:
Do you know if anyone else is doing work looking at this issue with Zarr? Thanks |
@agstephens I think that would likely be possible using zarr, it also depends on whether you want zarr to be aware of versioned data, or not. At least even in current state, only changed files/keys will be synchronized by smarts tools but you do need to scan the whole archive. I can imagine that a store could have a transaction logging all writes, allowing efficient queries of what have have changed, and only sync these. Depending on how users access data it might be more or less difficult to implement it in a way that keep the log consistant with the modifications. I will also point out https://github.com/Quansight/versioned-hdf5
How do you store the dataset(s), is that using a filesystem or an object store ? I thought that filesystem like ZFS had block deduplication and that snapshot transfer was smart enough to limit to change blocks ? I also seem to remember that https://docs.datproject.org/ was trying to solve som of those issues. |
My 2cents. The work in databases on temporal databases may be relevant here. |
I suppose one way to address this would be to design a store to include timestamps as part of paths (perhaps underneath chunks and Though there should be the possibility of exploring revision history. This could be done by filtering timestamps based on the revision someone wants to revisit and then selecting the latest amongst those (these steps can be fused for simplicity and performance). |
@Carreau , @DennisHeimbigner , @jakirkham: thanks for your responses regarding the versioning discussion. In response to @Carreau: the data is currently all stored in NetCDF4 on a POSIX file system (although we are experimenting with object store). I'll do some more reading this week following the leads that you have sent. My instinct is that the approach suggested by @jakirkham is worth exploring - i.e. that the |
@agstephens, if you do explore this versioning idea with say a subclass of |
Oddly enough, our team started discussing the same requirement Friday, too. The initial use case would be for metadata, but I can certainly also see the benefit of the functionality originally described by @shoyer. In our case, one goal would be to have global public URLs for multiple versions of the same dataset available without needing another tool to ensure API stability of existing URLs. (An example of another tool would be git cloning and then changing the branch.) On a filesystem this could be done fairly trivially without extra storage with symlinks. There's no general solution I know of for cloud storage though. Following the above discussion, I too could see a way forward via Perhaps it's natural to start with an initial implementation and then choose to include that in the spec later, but I expect composition of the stores will be an almost immediate issue. (cf. zarr-developers/zarr-python#540) |
From @shoyer: What about indirection for chunks? The use case here is incrementally updated and/or versioned arrays (e.g., each day you update the last chunk with the latest data, without altering the rest of the data). In the typical way this is done, you hash each chunk and use hashes as keys in another store. The map from chunk indices to hashes is also something that you might want to store differently (e.g., in memory). I guess this would also be pretty easy to implement on top of the Storage abstraction.
The text was updated successfully, but these errors were encountered: