Zarr fragment writers #391

rabernat · 2022-07-27T03:08:43Z

This implements one of the last few key pieces of #376: writing dataset fragments to Zarr in parallel. Still lots TODO here, namely:

Better testing around chunks. We are no longer using locks for writing, so we have to take care to explicitly align writes with chunks. This PR does not implement rechunking.
Unit tests for IndexItems (and maybe move its core functionality to a different module)
Improve docstrings for transforms
Incorporate feedback on the API. Do we like the names of the functions, PTransforms, and their arguments? All of this is free to change.
Parametrize the 💩 out of test_end_to_end.py::test_xarray_zarr

I don't know why there are so many commits in this PR. Something about how I merged my branch with upstream. The actual diff is not that huge.

…riters

rabernat · 2022-07-27T03:15:27Z

tests/test_end_to_end.py

+    with pipeline as p:
+        inputs = p | beam.Create(pattern.items())
+        datasets = inputs | OpenWithXarray(file_type=pattern.file_type)
+        # TODO determine this dynamically
+        combine_dims = [DimKey("time", operation=CombineOp.CONCAT)]
+        schemas = datasets | DatasetToSchema()
+        schema = schemas | DetermineSchema(combine_dims=combine_dims)
+        indexed_datasets = datasets | IndexItems(schema=schema)
+        target = schema | PrepareZarrTarget(target_url=tmp_target_url)
+        _ = indexed_datasets | StoreToZarr(target_store=target)
+
+    ds = xr.open_dataset(tmp_target_url, engine="zarr").load()
+    xr.testing.assert_equal(ds, daily_xarray_dataset)


This block is really the highlight of this PR. It is the closest we have come yet to recreating the original monolithic XarrayZarrRecipe in modular Beam style.

But there is still a lot of room to refine this. Perhaps some of the intermediate steps can be merged into a single PTransform? I'm not sure the user needs to be passing schemas around.

Really interested in getting feedback on the API and flow here - @cisaacstern and @alxmrs.

I'll review this PR in more depth later today or tomorrow. It's exciting to see all this progress!

alxmrs · 2022-07-27T14:44:49Z

tests/test_end_to_end.py

+        inputs = p | beam.Create(pattern.items())
+        datasets = inputs | OpenWithXarray(file_type=pattern.file_type)
+        # TODO determine this dynamically
+        combine_dims = [DimKey("time", operation=CombineOp.CONCAT)]
+        schemas = datasets | DatasetToSchema()
+        schema = schemas | DetermineSchema(combine_dims=combine_dims)
+        indexed_datasets = datasets | IndexItems(schema=schema)
+        target = schema | PrepareZarrTarget(target_url=tmp_target_url)
+        _ = indexed_datasets | StoreToZarr(target_store=target)


Here's a translation (really, for myself) to more idiomatic Beam:

Suggested change

inputs = p | beam.Create(pattern.items())

datasets = inputs | OpenWithXarray(file_type=pattern.file_type)

# TODO determine this dynamically

combine_dims = [DimKey("time", operation=CombineOp.CONCAT)]

schemas = datasets | DatasetToSchema()

schema = schemas | DetermineSchema(combine_dims=combine_dims)

indexed_datasets = datasets | IndexItems(schema=schema)

target = schema | PrepareZarrTarget(target_url=tmp_target_url)

_ = indexed_datasets | StoreToZarr(target_store=target)

combine_dims = [DimKey("time", operation=CombineOp.CONCAT)]

datasets = (

p

| beam.Create(pattern.items())

| OpenWithXarray(file_type=pattern.file_type)

)

schema = (

datasets

| DatasetToSchema()

| DetermineSchema(combine_dims=combine_dims)

)

target = schema | PrepareZarrTarget(target_url=tmp_target_url)

_ = (

datasets

| IndexItems(schema=schema)

| StoreToZarr(target_store=target)

)

My high level feedback is that this seems a bit too imperative. This is my bias, but I could see the pipeline fitting closer to the XArray Beam abstractions, something like:

( p | beam.PatternToChunks(pattern) # other processing... | beam.ChunksToZarr() )

Though, I could see how each of these steps would be composed out of the components you have here. Exposing the right level of granularity is hard.

From your comment below:

Perhaps some of the intermediate steps can be merged into a single PTransform? I'm not sure the user needs to be passing schemas around.

I totally agree with this sentiment!

vietnguyengit · 2022-07-29T11:15:20Z

Hi @rabernat , maybe my question is not too related to this PR, but keywords zarr and pangeo-forge bring me here, I wonder when I build a recipe to convert multiple NetCDF files (around 100k+) to Zarr to S3 using pangeo-forge-recipe on the private environment (that I won't use pangeo-forge cloud, the recipe also build to use with Prefect 2.0, will pangeo-forge-recipe and DaskTaskRunner of Prefect perform the tasks (e.g. xarray.dataset.to_zarr()) in parallel wisely? Currently, DaskTaskRunner of Prefect won't be able to pick up the tasks of xarray functions automatically, tasks need to be wrapped in a function with @task decorator to be added into Dask graph, eg. Perhaps I'm not advanced enough to figure out the way for DaskTaskRunner to pick up xarray functions and fill up the CPU threads wisely, I'm not sure if it's doable as well.

@task
def xr_to_zarr(dataset):
  dataset.to_zarr(...)

but doing this way, the whole block xr_to_zarr will be performed on a single thread. When without Prefect, calling dataset.to_zarr(...) , Dask will be able to fill up the threads with tasks, something looks like this:

Much appreciated your time reading this. Please advice. Thank you.

rabernat · 2022-07-29T15:05:34Z

Hi @vietnguyengit! Thanks for your interest in Pangeo Forge. We would love to help you, but please open a new issue with your question from above. As you said, your question is not related to this pull request.

rabernat · 2022-08-10T12:37:59Z

@martindurant - there is a new fsspec serialization error showing up in the tests. The errors look like

E   AttributeError: Can't pickle local object 'OpenFile.open.<locals>.close' [while running 'OpenWithXarray/Open with Xarray']

These errors did not get raised earlier (see green tests on #375) nor in my local environment. However, if I update to 2022.7.1, I can reproduce the error. Downgrading back to 2022.7.0 makes the error go away. So it must be a fairly recent change.

martindurant · 2022-08-12T20:21:05Z

I am looking into it. However, the following snippet is informative:

import pickle, io
b = io.BytesIO(b"data")
s = io.TextIOWrapper(b)
pickle.dumps(s)  # TypeError

i.e., even simple file-like types are often not pickleable. I have been playing with weakrefs to augment OpenFile.open's logic, but I cannot get around something fundamental like the above, without writing an explicitly pickleable wrapper class such as LocalFileOpener currently is. This way we could, with a little effort, more rigorously guarantee that the output of OpenFile.open() (or inside with) can be pickled.

Conversely, we could backtrack and declare that file-like objects should not, in general, be pickled and we don't guarantee that they can; this means only pickling outside contexts. Indeed, it is somewhat rare for context manager yielded objects to be pickleable, since they explicitly contain state and at least some closure.

rabernat · 2022-08-13T07:40:59Z

I understand that pickling these objects is complicated (remember fsspec/filesystem_spec#579 (comment) 🙃 ). The problem here is API stability. I spent literally many full days trying to figure out a solution for serializing fsspec open file objects inside Xarray datasets that worked with beam. I found one. It's coded up and tested in Pangeo Forge. The fsspec 2022.7.1 release made some change that broke whatever fragile solution I had found, which triggered the tests to fail. (Unit testing ftw!) So there are two options:

The solution I found was actually unsupported behavior. It was an accident that it worked. I have to go back to the drawing board.
There was a regression in fsspec in the 2022.7.1 release which needs to be fixed in fsspec.

Do you have a sense of which of the two options above is more appropriate here?

martindurant · 2022-08-14T13:24:06Z

I'll have a look to see what changed, and whether it can easily be overcome. I'd say that indeed, this was not explicitly supported. I'll also look into my suggestion, above, of how it could be supported and see whether that's tractable or not.

rabernat · 2022-08-29T17:06:29Z

Upstream dev passes thanks to fsspec/filesystem_spec#1024. Gonna move on.

cisaacstern · 2022-08-29T17:07:13Z

🚀

martindurant · 2022-08-31T01:49:37Z

Sorry, I'm going to have to revert that PR. It breaks too many assumptions in other places :|

In short: an OpenFile should either

be used in a context
have .close() called on the OpenFile to ensure all contained file-like objects are cleaned up.

Since all AbstractBufferedFile descendants (s3fs, gcsfs, abfs) are pickleable, and now so are LocalFileOpeners, you should be OK to pickle inside the context, if you want, either the OpenFile or the file-like it generates.

rabernat and others added 30 commits June 5, 2022 12:37

refactored OpenWithXarray

2b7bbe7

pre commit

9fbfdfb

tighten up type hints; beam is happy but mypy is not

83dade7

finally satisfy mypy and beam type checks

9f6b4d6

remove wrong type hint

019d4f7

cleaned up testing

0f24c1c

add test_open_url

0f649ec

tweak testing

e10f80b

test coverage for openers.py

22ce8d2

added zarr fixture

55729b4

add netcdf3 fixture

7bbbd97

comprehensive testing

d0382f0

clean up type hints and docstring

4cc71bb

rename make_netcdf_local_paths -> make_local_paths

a44eefe

use pipeline fixture

1ed1a79

rename section in API docs

829fab5

add copy_to_local option

ee985ac

new functions

28d4073

use promote_types

9ecbdf5

expanded test

857f704

got basic combiner working

ec1b4b0

add builders

efa596b

a few improvements

fae67d1

refacotr CombineXarraySchemas

abbf7fc

changed definition of Index

b6d4791

changed definition of Index

a150b76

make other modules work with new Index type

f9fdeb1

actually delete xarray_zarr.py

72222bb

remove builders.py for now

dabd933

remove recipe serialization tests for now

9a677c4

rabernat added 6 commits July 26, 2022 10:20

add XarraySchema TypedDict

1973464

fix docstring typos

a93cf95

Merge branch 'target-initializer-detached' into fragment-writers

288b5c9

working end to end test!

950992a

end-to-end test

b57c9ba

Merge remote-tracking branch 'upstream/beam-refactor' into fragment-w…

3497153

…riters

rabernat added the beam refactor label Jul 27, 2022

rabernat commented Jul 27, 2022

View reviewed changes

alxmrs reviewed Jul 27, 2022

View reviewed changes

rabernat mentioned this pull request Jul 27, 2022

List of beam PTransforms to implement to recreate XarrayZarrRecipe #376

Closed

7 tasks

rabernat added 2 commits July 28, 2022 21:16

make StoreToZarr higher level

87126de

verify target chunks

cdb9948

add docstring

92b7c2f

rabernat mentioned this pull request Aug 7, 2022

add GRIB2ReferenceRecipe #387

Closed

remove prefect from conftest

c6b61c0

rabernat mentioned this pull request Aug 16, 2022

Make OpenFile filelike fsspec/filesystem_spec#1024

Merged

rabernat marked this pull request as ready for review August 29, 2022 16:49

norlandrhagen added 3 commits August 29, 2022 12:52

removed upstream-dev prefect req

7f47f71

correct eqiv

8b58b61

pinned prefect version to 1.3.0 to avoid 2.0 major changes

7ce6357

rabernat merged commit c5bc983 into pangeo-forge:beam-refactor Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr fragment writers #391

Zarr fragment writers #391

rabernat commented Jul 27, 2022 •

edited

Loading

rabernat Jul 27, 2022 •

edited

Loading

alxmrs Jul 27, 2022

alxmrs Jul 27, 2022 •

edited

Loading

alxmrs Jul 27, 2022

vietnguyengit commented Jul 29, 2022 •

edited

Loading

rabernat commented Jul 29, 2022

rabernat commented Aug 10, 2022

martindurant commented Aug 12, 2022

rabernat commented Aug 13, 2022

martindurant commented Aug 14, 2022

rabernat commented Aug 29, 2022

cisaacstern commented Aug 29, 2022

martindurant commented Aug 31, 2022

Zarr fragment writers #391

Zarr fragment writers #391

Conversation

rabernat commented Jul 27, 2022 • edited Loading

rabernat Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

alxmrs Jul 27, 2022

Choose a reason for hiding this comment

alxmrs Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

alxmrs Jul 27, 2022

Choose a reason for hiding this comment

vietnguyengit commented Jul 29, 2022 • edited Loading

rabernat commented Jul 29, 2022

rabernat commented Aug 10, 2022

martindurant commented Aug 12, 2022

rabernat commented Aug 13, 2022

martindurant commented Aug 14, 2022

rabernat commented Aug 29, 2022

cisaacstern commented Aug 29, 2022

martindurant commented Aug 31, 2022

rabernat commented Jul 27, 2022 •

edited

Loading

rabernat Jul 27, 2022 •

edited

Loading

alxmrs Jul 27, 2022 •

edited

Loading

vietnguyengit commented Jul 29, 2022 •

edited

Loading