-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for daily output frequency #9
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @duncanwp, thank you very much for this very helpful contribution. I think the changes are all good as is.
I think it would be a good idea to add some simple unit tests for get_timesteps_simple
to ensure we get consistent results for a set of given inputs.
pangeo_forge_esgf/dynamic_kwargs.py
Outdated
@@ -13,21 +13,35 @@ | |||
# TODO: I might not want to allow this in the ocean and ice fields. Lets see | |||
) | |||
|
|||
daily_divisors = sorted( | |||
[1, 30, 360, 365, 3653] | |||
# TODO: It might be more efficient to default to the underlying file chunks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think so. Dealing with the underlying chunks, and outputting them in a way that makes sense to the user is one of the core tasks pangeo-forge aims to solve. These divisons should rather reflect scientifically useful units to e.g. avoid splitting e.g. a yearly chunk with a few days left, as this would make certain operations (annual average) much slower to compute.
These look good for now, but I was wondering about different calendars and how we could maybe influence which divisors to use depending on the calendar of the data.
Just thinking out loud here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. As far as I know there isn't any ESGF metadata to describe the calendar though and we'd have to open the files to inspect the time coordinate metadata directly which would be pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is a problem to be pushed to later work.
allowed_divisors = { | ||
"Omon": monthly_divisors, | ||
"SImon": monthly_divisors, | ||
"Amon": monthly_divisors, | ||
"day": daily_divisors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these dividers also work generally for other daily data (see here: https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_table_id.json)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no idea, I didn't realise there were different day
table_id and can't find any documentation on what they represent. They might be the same but it seems a bit redundant. Happy to include them or not depending on how much you want to protect against potential future issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to check this PR out in the next few days, and maybe run some tests with the 'other' daily frequencies.
elif "day" in table_id: | ||
timesteps = [ | ||
m * 30 for m in months | ||
] # This is close enough without worrying about calendars etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. I guess dealing with calendars is a whole can of worms, but I am thinking it might be worth to put some of these thoughts into a dedicated issue to keep track of it? Again for now this seems fine for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds like a plan
Great, happy to add a test or two - but there doesn't seem to be a framework. How do you want to set it up? Would it make sense as a separate PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this soon. I hope to get some time to play around with this in the next days. @duncanwp could you provide me with the instance ids of the data you are using for climatebench right now?
pangeo_forge_esgf/dynamic_kwargs.py
Outdated
@@ -13,21 +13,35 @@ | |||
# TODO: I might not want to allow this in the ocean and ice fields. Lets see | |||
) | |||
|
|||
daily_divisors = sorted( | |||
[1, 30, 360, 365, 3653] | |||
# TODO: It might be more efficient to default to the underlying file chunks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this is a problem to be pushed to later work.
allowed_divisors = { | ||
"Omon": monthly_divisors, | ||
"SImon": monthly_divisors, | ||
"Amon": monthly_divisors, | ||
"day": daily_divisors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to check this PR out in the next few days, and maybe run some tests with the 'other' daily frequencies.
Co-authored-by: Julius Busecke <julius@ldeo.columbia.edu>
for more information, see https://pre-commit.ci
I don't have an explicit list sorry but I extract all
See the original (hacky!) script here: https://github.com/duncanwp/ClimateBench/blob/main/prepare_data.py |
bump |
1 similar comment
bump |
Hi @duncanwp, sorry for the radio silence here. I am just now returning to this work in my context of transforming CMIP data from ESGF. I think that some work I have started upstream in pangeo-forge-recipes could entirely eliminate the need for dynamic kwargs generation in |
No worries, sounds good. I'd still be keen to look at daily ClimateBench variables as a potential use-case so would appreciate a pointer on where to get started with that! |
I am also still very curious about that use case @duncanwp, and I truly appreciate your patience. I think we can try to support that from the LEAP data-ingestion side (or at least focus discussions there). FYI here is some more info on LEAP and how we are trying to handle data ingestion (even though your use case is more of a derived data product than a canonical ingestion): https://leap-stc.github.io/leap-pangeo/jupyterhub.html#i-have-a-dataset-and-want-to-work-with-it-on-the-hub-how-do-i-upload-it Either way having an outline of your plan there would be very helpful to keep things focused. |
Sure, sounds good! I just created an issue over there: leap-stc/data-management#43 |
Fantastic. Thank you so much. |
I needed to support the daily
table_id
for my climatebench-feedstock so have made a first pass at fixing #3 here.Happy for any feedback / other approaches.