-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Add functionality to accurately set time bounds #412
Comments
I am glad @pochedls brought this subject for the team's attention (thanks, @pochedls!). In PMP workflow I have used Below could be somewhat related (thanks @taylor13 for pointing this to us) |
I would note that even for latitude/longitude grids, creating bounds half-way between coordinate values is not always correct. For gaussian grids and other latxlon grids that are not equally spaced, you would get the wrong values, in general, for the bounds. I would only auto-generate bounds after determining whether a coordinate represented equal intervals (of whatever it measures, be it latitude, time, vertical position, or whatever). |
Thanks for reviewing this @taylor13. If a model doesn't provide bounds and a user needed to do an operation with bounds, I think most people would still want to use the generated bounds even if they were imperfect. Do you think a warning would be appropriate in this situation? The user could then decide if they wanted to not use the dataset (or email the modeling center to ask for the bounds?). |
Yes, I would certainly include a warning, and for most cases one might get away with inaccurate bounds, so generating them is probably what most users will want. |
I appreciate this discussion. Personally, for time dimension, i always thought time bounds are more reliable than the time points, because different data centers store their data at different points of the range (beginning, middle, or end). When the time bounds are missing, to generate the accurate bounds, one needs to know exactly how the data is recorded. I agree that we should provide option for these users (by providing similar function provided by setTimeBoundsFREQ). However, I was not aware of how to properly use If looping over multiple models, using If the data users don't know how the time is recorded, I think to provide an educated guess with a warning is still valuable. |
@chengzhuzhang Regarding |
Hi @lee1043 thank you for pointing to the code! I guess I was confused by the documentation as follows...
The native E3SM output records time at the end of time intervals, luckily it provides time bounds and xcdat can open the datasets and correct it with time centered. |
Thanks @pochedls for generating this feature request! The user should definitely get warnings when missing bounds are automatically generated at midpoint. The midpoint option is probably a better than nothing poor-man option. I'm not sure other options than midpoint would make sense. If somebody is going to rewrite/improve some functions, you may want to get rid of the I have tried to follow the source links in xcdat.bounds.BoundsAccessor, but I was quickly lost. I did find a reference to iris and a guess_bounds(bound_position=0.5) function. As a parameter name, So you might as well get rid of this Would it be possible to have a |
Regarding: "width (float, optional) – Width of the bounds relative to the position of the nearest points, by default 0.5." I agree with @jypeter that rarely would one specify anything but 0.5, but since that's the default, I don't see any harm retaining it. I think the documentation should make clear how the bounds are generated when data are unequally spaced. I don't know what the algorithm currently does, but when width=0.5, I would place the cell bound half-way between a cell's coordinate value and the next cell's coordinate value. For the first and last cells (where there is no "next cell"), I would place the bound such that the coordinate value lies half-way between its bounds (i.e., the bound would be located at the same distance from the coordinate value as the other bound). If that approach is adopted, then maybe the documentation would be clearer saying: "width (float, optional) indicates the location of a grid cell bound expressed as a fraction of the distance from a cell's coordinate value to the coordinate value of the neighboring cell. Only when width=0.5 will the cells be contiguously located. The bound for a cell with no neighbor (i.e., the first and last cell of a dimension) will be assumed to be the same distance from the cell's coordinate value as its other bound." |
Regarding the start / end of the time period. In thinking about how to implement this, I think by default we would just take the start/end of the month that the time point is in (e.g., I think the case where you would want to use |
@pochedls, I was addressing the behavior and documentation of Dataset.bounds.add_bounds(axis, width=0.5), not |
I have tried to figure out how the Good luck if somebody wants to explain how bounds are actually created in a docstring! xcdat is still in version Maybe somebody could want to use 0 or 1, if they wanted the upper and lower bounds to have the same values as the coordinates, but then I assume they would really know what they are doing, and I'd rather have them use a I'd rather have a clear (i.e not confusing) documentation saying that if there are no existing bounds, a warning will be raised and the bounds will be created at the midpoint (no other option). The documentation can possibly add that the created bounds of the |
I think we should probably remove the width parameter unless someone wants to figure out when someone would set this to |
I agree. I misinterpreted what "width" meant. Looking at the code, it calculates a fractional distance of "width" for one bound and "1-width" for the other bound. This ensures that all cells are contiguous. As J-Y @jypeter says, I'd rather have them custom create their own bounds. |
I linked to this discussion in response to a question about time bounds on the xarray forum in case you would like to add anything (I couldn't find how to tag you directly): pydata/xarray#7580 |
We appreciate the xCDAT plug @arfriedman! You brought up a great idea for us to stay active in the xarray discussions forum to be more involved in the community and also help xCDAT grow. |
@arfriedman – just an FYI that there is a PR under review to address the contents of this feature request. This matters for temporal averaging if your dataset does not have time bounds and you need to create them (if the dataset has time bounds they will be used for temporal averaging operations). |
Is your feature request related to a problem?
Following Discussion #411 – xCDAT should modify how it handles bounds in general and time bounds in particular. xCDAT currently uses the bounds within the dataset, but if they are not present it will generate bounds for each axis (by assuming that the bounds are the midpoint between each axis point). This is problematic for time coordinates because a) time points are frequently not saved out at the midpoint value and b) months are not equal in size so the halfway point between two months do not yield the correct bound.
This is illustrated in Slide 4 of climate_specific_tools.pdf.
For example, ERA5 monthly mean data is saved at midnight on the first day of the month:
xCDAT then autogenerates incorrect bounds:
Note that this issue also applies to vertical coordinate which are frequently not equally spaced.
Describe the solution you'd like
Addressing this issue would take a few changes:
xcdat.open_dataset()
(in function and documentation).setAxisTimeBoundsMonthly()
that set the time bounds at the start and end of the period (e.g., for the ERA5 case above, the bounds should be1979-01-01 00:00:00
and1979-02-01 00:00:00
).xcdat.open_dataset
or b) a global option as in CDAT'ssetAutoBounds()
function: Allow the user to determine which of t, x, y, z should have bounds creation, e.g., gen_bounds = None | List['X', 'Y', 'T', 'Z']isMonthly()
)Describe alternatives you've considered
I think that the minimal solution would be to disable temporal and vertical bounds generation by default on dataset open (users could still call xCDAT's functions to generate bounds at their own risk).
Additional context
I think the above changes would largely address the issue.
I did think about an edge case that I think would be mishandled by related tempoeral functions (e.g.,
xcdat.temporal.group_average()
). Consider a dataset with bounds (so xCDAT would not generate bounds) that are weird (but correct):If we were to take an annual average of this data (for 2022), xCDAT would first figure out the weights by taking the difference of the upper and lower bounds (31, 31, 28, ...) and use this vector as the weights in the annual average. But some of the data should not be included in the annual average; according to the bounds ~half of the January data is supposedly in December (and the weight should be 15 days for the annual average in 2022).
I don't think this would be an issue for the
average
function, but probably does affectclimatology
anddepartures
.I think this is likely an edge case and I'm not sure what CDAT does in this situation, but it came to mind in thinking about this issue.
The text was updated successfully, but these errors were encountered: