-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cast GDP float64 data to float32 as an option (default) #269
Cast GDP float64 data to float32 as an option (default) #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go for me for now. A user who would want to keep some variables float64 would have to pass False and do the conversion manually. I think this is acceptable.
I was able to discuss this a bit more with @JeffreyEarly : we should keep double precision for latitude and longitude estimates but we can convert to single precision for all other variables. (Even for uncertainty of longitude and latitude.) Since this PR applies to the whole raggedarray class I think it is not a good idea in the end. We should keep double precision. and manually convert what we need to convert for the GDP dataset. |
Sounds good, we can exclude specific variables from the cast. Do you know why double for velocity? |
That was a typo: latitude and longitude, now corrected. |
to_xarray() docstring estension.
So in conclusion, despite my commit, I don't think we should implement this PR. |
I'm confused. Why not cast variables other than lat and lon? |
If I understand correctly, this PR proposes to convert all variables to float32 (not only in the GDP cases!). Some users might want to retain the double precision for their specific variables. So I am not in favor of this behavior. |
OK, if you prefer manually casting that's fine. Better than adding code that won't be useful. As an aside as since it came up here, the RaggedArray class is currently only used for GDP. I can't imagine using it for other datasets over simple functions like I did in MOSAiC. Long term I'd like to remove it altogether; nothing to do for now. |
I'm confused by this. So we are doing nothing and the GDP ragged array will be 35 GB? I don't think it makes much sense to have all this precision. |
26GB not 35GB. We are doing the conversion by hand for the GDP, not by a default behavior of |
But it could be done in |
I think that's a great idea @philippemiron |
It sounds like the current direction is:
Please confirm. |
Sounds like a plan. |
@@ -510,6 +510,9 @@ def preprocess(index: int, **kwargs) -> xr.Dataset: | |||
# rename variables | |||
ds = ds.rename_vars({"longitude": "lon", "latitude": "lat"}) | |||
|
|||
# Cast float64 variables to float32 to reduce memory footprint. | |||
ds = gdp.cast_float64_variables_to_float32(ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where you should add variables_to_skip = ["lon", "lat", "time"]?`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the default, see the function definition.
Once this is done I will be able to get the final parquet file. |
@selipot Can this be merged? |
yes! |
This casts all float64 variables to float32 before returning the Dataset in the
RaggedArray.to_xarray()
method, but give an option to not do it if desired. I tested it with a small gdp-2.01 file (10 random drifters) and it works.Can you confirm that float32 is OK for longitude and latitude variables? We get precision of 7 digits. For a drifter in the 3-digit longitudes, precision is down to O(10 m). Is it sufficient?