Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework grib2 #198

Merged
merged 8 commits into from
Jul 22, 2022
Merged

Rework grib2 #198

merged 8 commits into from
Jul 22, 2022

Conversation

martindurant
Copy link
Member

With ideas from cogrib and gribscan:

  • in-memory temporaries at scan and load time
  • using ecccodes directly on load, no need for cfgrib
  • produce one output per message, to then be passed through combine.

@martindurant martindurant marked this pull request as ready for review July 20, 2022 18:23
@martindurant
Copy link
Member Author

@rsignell-usgs , this will need the HRRR notebook example to be remade, because scan_grib doesn't merge by itself, but creates a list of dataset refs to be merged afterwards. See the new kerchunk.grib2.example_combine .

Copy link
Contributor

@emfdavid emfdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can review in depth later.
Do you intend the new codecs to be compatible with old output?
It seems like it should be. Is there already a test that asserts the correct values read using the codec? That would make it easier to show the behavior has not changed.

@martindurant
Copy link
Member Author

There was no test before, but now there is one which shows that kerchunk/zarr and xarray/cfgrib get the same set of array values. Yes, I believe the codec is backward compatible, but of course that is a generally tricky thing in general. I don't know how numcodecs intends, in general, to deal with codec versions.

@martindurant
Copy link
Member Author

@jakirkham, @joshmoore , I wonder if you have an opinion on a version= keyword that codecs can optionally take, to allow for major updates? Or maybe some other way to allow for both API-braking improvements and also backwards compatibility?

@jakirkham
Copy link

Filed a Zarr spec issue ( zarr-developers/zarr-specs#148 ) to discuss this. Thanks for surfacing Martin! 🙏

@emfdavid
Copy link
Contributor

@martindurant Got a better solution for the tmp/index files from the ecmwf folks!
ecmwf/cfgrib#306 (comment)
Shall we close my PR on Kerchunk as well and then see if and when to pass the additional indexpath:'' kwarg to open_dataset?

@martindurant
Copy link
Member Author

Thanks @emfdavid . We no longer call cfgrib.dataset at all here, but work with the Message directly, so there was no need for that workaround; but I did add it for the case of the test where a .grib2 is loaded directly with xarray and would have left an index file in the source directory.

Copy link
Contributor

@emfdavid emfdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @martindurant
Sorry I am late to review. Last week was crazy.
One question about the logging setup call.
Otherwise it looks like a big improvement though I don't fully grok the details yet.


logger = logging.getLogger("grib2-to-zarr")
fsspec.utils.setup_logging(logger=logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this appropriate to do in a library?
I don't think you want to add your own handlers here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, this does seem to be from my own debugging. Generally, people probably do want this logging, but indeed it should be up to the caller to decide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you release 0.0.7 with these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my list :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Released - thank you!

if "typeOfLevel" in m and "level" in m:
name = m["typeOfLevel"]
data = np.array([m["level"]])
attrs = cfgrib.dataset.COORD_ATTRS[name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant Updating after release 0.0.7 I am now getting the following error using scan_grib with HRRR: gcs://high-resolution-rapid-refresh/hrrr.20220720/conus/hrrr.t00z.wrfsfcf21.grib2

zarr_meta = scan_grib(input_path, **SCAN_SURFACE_GRIB)
  File "/Users/dstuebe/.cache/bazel/_bazel_dstuebe/76bbf57da584b86027104686797623fa/execroot/ritta/bazel-out/k8-dbg/bin/ingestion/noaa_nwp/run_local.runfiles/common_deps_kerchunk/kerchunk/grib2.py", line 160, in scan_grib
    attrs = cfgrib.dataset.COORD_ATTRS[name]
KeyError: 'surface'

where

SCAN_SURFACE_GRIB = dict(
    common=["time", "step", "latitude", "longitude", "valid_time"],
    filter=dict(stepType="instant", typeOfLevel="surface"),
    storage_options=dict(token=None),
)

I am afraid I am a bit lost on how to start debugging this and wishing I tested last week. I tried shuffling around my filter keywords a bit but I can't grok how this changed from the previous iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to merge #204 before release... Can you try with that? I believe it runs, but note that the grib scanner no longer does any merging by itself, but creates a list of output dicts that you need to pass to MultiZarrToZarr. With that PR, I think you will still get a coordinate for the typeOfLevel parameter, which you might argue would be better as a dataset attribute. Also, I think I see a bug (which won't stop you, but the attributes of that one-value coordinate variable will be wrong), so maybe it's just as well I didn't merge yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works - thank you.
Will look forward to the release 0.0.8 when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants