Rework grib2 #198

martindurant · 2022-07-19T13:13:22Z

With ideas from cogrib and gribscan:

in-memory temporaries at scan and load time
using ecccodes directly on load, no need for cfgrib
produce one output per message, to then be passed through combine.

Seems to still need the physical file

martindurant · 2022-07-20T18:53:43Z

@rsignell-usgs , this will need the HRRR notebook example to be remade, because scan_grib doesn't merge by itself, but creates a list of dataset refs to be merged afterwards. See the new kerchunk.grib2.example_combine .

emfdavid

I can review in depth later.
Do you intend the new codecs to be compatible with old output?
It seems like it should be. Is there already a test that asserts the correct values read using the codec? That would make it easier to show the behavior has not changed.

martindurant · 2022-07-21T15:38:46Z

There was no test before, but now there is one which shows that kerchunk/zarr and xarray/cfgrib get the same set of array values. Yes, I believe the codec is backward compatible, but of course that is a generally tricky thing in general. I don't know how numcodecs intends, in general, to deal with codec versions.

martindurant · 2022-07-21T16:59:21Z

@jakirkham, @joshmoore , I wonder if you have an opinion on a version= keyword that codecs can optionally take, to allow for major updates? Or maybe some other way to allow for both API-braking improvements and also backwards compatibility?

jakirkham · 2022-07-21T17:27:25Z

Filed a Zarr spec issue ( zarr-developers/zarr-specs#148 ) to discuss this. Thanks for surfacing Martin! 🙏

emfdavid · 2022-07-22T01:06:00Z

@martindurant Got a better solution for the tmp/index files from the ecmwf folks!
ecmwf/cfgrib#306 (comment)
Shall we close my PR on Kerchunk as well and then see if and when to pass the additional indexpath:'' kwarg to open_dataset?

martindurant · 2022-07-22T18:00:14Z

Thanks @emfdavid . We no longer call cfgrib.dataset at all here, but work with the Message directly, so there was no need for that workaround; but I did add it for the case of the test where a .grib2 is loaded directly with xarray and would have left an index file in the source directory.

emfdavid

Thank you @martindurant
Sorry I am late to review. Last week was crazy.
One question about the logging setup call.
Otherwise it looks like a big improvement though I don't fully grok the details yet.

emfdavid · 2022-07-25T14:28:22Z

kerchunk/grib2.py


 logger = logging.getLogger("grib2-to-zarr")
+fsspec.utils.setup_logging(logger=logger)


Is this appropriate to do in a library?
I don't think you want to add your own handlers here.

Ah sorry, this does seem to be from my own debugging. Generally, people probably do want this logging, but indeed it should be up to the caller to decide.

Will you release 0.0.7 with these changes?

It's on my list :)

Released - thank you!

emfdavid · 2022-07-27T23:28:55Z

kerchunk/grib2.py

+            if "typeOfLevel" in m and "level" in m:
+                name = m["typeOfLevel"]
+                data = np.array([m["level"]])
+                attrs = cfgrib.dataset.COORD_ATTRS[name]


@martindurant Updating after release 0.0.7 I am now getting the following error using scan_grib with HRRR: gcs://high-resolution-rapid-refresh/hrrr.20220720/conus/hrrr.t00z.wrfsfcf21.grib2

zarr_meta = scan_grib(input_path, **SCAN_SURFACE_GRIB) File "/Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/execroot/ritta/bazel-out/k8-dbg/bin/ingestion/noaa_nwp/run_local.runfiles/common_deps_kerchunk/kerchunk/grib2.py", line 160, in scan_grib attrs = cfgrib.dataset.COORD_ATTRS[name] KeyError: 'surface'

where

SCAN_SURFACE_GRIB = dict( common=["time", "step", "latitude", "longitude", "valid_time"], filter=dict(stepType="instant", typeOfLevel="surface"), storage_options=dict(token=None), )

I am afraid I am a bit lost on how to start debugging this and wishing I tested last week. I tried shuffling around my filter keywords a bit but I can't grok how this changed from the previous iteration?

I had meant to merge #204 before release... Can you try with that? I believe it runs, but note that the grib scanner no longer does any merging by itself, but creates a list of output dicts that you need to pass to MultiZarrToZarr. With that PR, I think you will still get a coordinate for the typeOfLevel parameter, which you might argue would be better as a dataset attribute. Also, I think I see a bug (which won't stop you, but the attributes of that one-value coordinate variable will be wrong), so maybe it's just as well I didn't merge yet.

This works - thank you.
Will look forward to the release 0.0.8 when you have time.

Stop point

ede72b3

martindurant mentioned this pull request Jul 19, 2022

Cleanup temporary files #199

Closed

martindurant added 5 commits July 19, 2022 13:31

fix dtype back-compat

b4e0927

Seems to still need the physical file

This now works

d3fc6ee

add a test

07dc1a1

Move codecs, update in-module grib example

b7e8237

remove unused codecs

5e81a32

martindurant marked this pull request as ready for review July 20, 2022 18:23

Remove .idx file

80f8a25

martindurant mentioned this pull request Jul 20, 2022

Work around for CFGRIB memory leak? #194

Closed

darothen mentioned this pull request Jul 21, 2022

add GRIB2ReferenceRecipe pangeo-forge/pangeo-forge-recipes#387

Closed

martindurant mentioned this pull request Jul 21, 2022

new version in kerchunk gribscan/gribscan#16

Open

emfdavid reviewed Jul 21, 2022

View reviewed changes

jakirkham mentioned this pull request Jul 21, 2022

Versioning codecs zarr-developers/zarr-specs#148

Closed

Don't make .idx when testing

9933dc6

martindurant merged commit 7e4bee0 into fsspec:main Jul 22, 2022

martindurant deleted the new_grib branch July 22, 2022 18:07

darothen mentioned this pull request Jul 25, 2022

Modifies coord attrs lookup for non-coord levels on GRIB2 files #204

Merged

emfdavid reviewed Jul 25, 2022

View reviewed changes

emfdavid reviewed Jul 27, 2022

View reviewed changes

observingClouds mentioned this pull request Sep 2, 2022

Why does Dask load every chunk into memory when using sel on a xarray Dataset? dask/dask#9451

Closed

guidocioni mentioned this pull request Sep 2, 2022

cfgrib loads all chunks into memory when indexing ecmwf/cfgrib#311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework grib2 #198

Rework grib2 #198

martindurant commented Jul 19, 2022

martindurant commented Jul 20, 2022

emfdavid left a comment

martindurant commented Jul 21, 2022

martindurant commented Jul 21, 2022

jakirkham commented Jul 21, 2022

emfdavid commented Jul 22, 2022

martindurant commented Jul 22, 2022

emfdavid left a comment

emfdavid Jul 25, 2022

martindurant Jul 25, 2022

emfdavid Jul 25, 2022

martindurant Jul 25, 2022

emfdavid Jul 27, 2022

emfdavid Jul 27, 2022

martindurant Jul 28, 2022

emfdavid Jul 28, 2022


		logger = logging.getLogger("grib2-to-zarr")
		fsspec.utils.setup_logging(logger=logger)

Rework grib2 #198

Rework grib2 #198

Conversation

martindurant commented Jul 19, 2022

martindurant commented Jul 20, 2022

emfdavid left a comment

Choose a reason for hiding this comment

martindurant commented Jul 21, 2022

martindurant commented Jul 21, 2022

jakirkham commented Jul 21, 2022

emfdavid commented Jul 22, 2022

martindurant commented Jul 22, 2022

emfdavid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment