Cast GDP float64 data to float32 as an option (default) #269

milancurcic · 2023-09-18T18:34:53Z

This casts all float64 variables to float32 before returning the Dataset in the RaggedArray.to_xarray() method, but give an option to not do it if desired. I tested it with a small gdp-2.01 file (10 random drifters) and it works.

Can you confirm that float32 is OK for longitude and latitude variables? We get precision of 7 digits. For a drifter in the 3-digit longitudes, precision is down to O(10 m). Is it sufficient?

selipot

Good to go for me for now. A user who would want to keep some variables float64 would have to pass False and do the conversion manually. I think this is acceptable.

clouddrift/raggedarray.py

selipot · 2023-09-18T20:51:59Z

I was able to discuss this a bit more with @JeffreyEarly : we should keep double precision for latitude and longitude estimates but we can convert to single precision for all other variables. (Even for uncertainty of longitude and latitude.)

Since this PR applies to the whole raggedarray class I think it is not a good idea in the end. We should keep double precision. and manually convert what we need to convert for the GDP dataset.

milancurcic · 2023-09-18T21:11:43Z

Sounds good, we can exclude specific variables from the cast. Do you know why double for velocity?

selipot · 2023-09-18T21:41:23Z

Sounds good, we can exclude specific variables from the cast. Do you know why double for velocity?

That was a typo: latitude and longitude, now corrected.

to_xarray() docstring estension.

selipot · 2023-09-18T21:49:21Z

So in conclusion, despite my commit, I don't think we should implement this PR.

milancurcic · 2023-09-18T22:22:38Z

I'm confused. Why not cast variables other than lat and lon?

selipot · 2023-09-18T23:10:47Z

I'm confused. Why not cast variables other than lat and lon?

If I understand correctly, this PR proposes to convert all variables to float32 (not only in the GDP cases!). Some users might want to retain the double precision for their specific variables. So I am not in favor of this behavior.

milancurcic · 2023-09-18T23:19:51Z

OK, if you prefer manually casting that's fine. Better than adding code that won't be useful.

As an aside as since it came up here, the RaggedArray class is currently only used for GDP. I can't imagine using it for other datasets over simple functions like I did in MOSAiC. Long term I'd like to remove it altogether; nothing to do for now.

philippemiron · 2023-09-18T23:58:43Z

I'm confused by this. So we are doing nothing and the GDP ragged array will be 35 GB? I don't think it makes much sense to have all this precision.

selipot · 2023-09-19T00:02:12Z

26GB not 35GB. We are doing the conversion by hand for the GDP, not by a default behavior of raggedarray.to_netcdf(). The manual conversion leads to ~16GB.

philippemiron · 2023-09-19T04:40:43Z

But it could be done in adapters.gdp1h and adapters.gdp6h so we don't have to do anything manually (and limit possible divergence) when regenerating the GDP ragged array.

selipot · 2023-09-19T14:14:52Z

I think that's a great idea @philippemiron

milancurcic · 2023-09-19T14:26:10Z

It sounds like the current direction is:

Move casting to adapters.gdp1h and adapters.gdp6h
Exclude time, latitude, and longitude from the cast

Please confirm.

selipot · 2023-09-19T14:30:52Z

Sounds like a plan.

selipot · 2023-09-20T23:17:07Z

clouddrift/adapters/gdp1h.py

@@ -510,6 +510,9 @@ def preprocess(index: int, **kwargs) -> xr.Dataset:
    # rename variables
    ds = ds.rename_vars({"longitude": "lon", "latitude": "lat"})

+    # Cast float64 variables to float32 to reduce memory footprint.
+    ds = gdp.cast_float64_variables_to_float32(ds)


Is this where you should add variables_to_skip = ["lon", "lat", "time"]?`

That's the default, see the function definition.

selipot · 2023-09-20T23:17:41Z

Once this is done I will be able to get the final parquet file.

milancurcic · 2023-09-21T16:38:12Z

@selipot Can this be merged?

selipot · 2023-09-21T16:51:05Z

yes!

) * Cast GDP float64 data to float32 as an option (default) * Update raggedarray.py to_xarray() docstring estension. * Move casting to adapters.gdp --------- Co-authored-by: Shane Elipot <selipot@miami.edu>

Cast GDP float64 data to float32 as an option (default)

eb75528

milancurcic added the arhicved-label-data-adapters Adapters for custom datasets into CloudDrift label Sep 18, 2023

milancurcic requested a review from selipot September 18, 2023 18:34

selipot requested changes Sep 18, 2023

View reviewed changes

clouddrift/raggedarray.py Outdated Show resolved Hide resolved

Update raggedarray.py

f1efcfe

to_xarray() docstring estension.

milancurcic closed this Sep 18, 2023

milancurcic reopened this Sep 19, 2023

Move casting to adapters.gdp

e109820

selipot requested changes Sep 20, 2023

View reviewed changes

selipot approved these changes Sep 21, 2023

View reviewed changes

milancurcic merged commit afbf318 into Cloud-Drift:main Sep 21, 2023

milancurcic deleted the enforce-gdp-float32 branch September 21, 2023 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cast GDP float64 data to float32 as an option (default) #269

Cast GDP float64 data to float32 as an option (default) #269

milancurcic commented Sep 18, 2023 •

edited

Loading

selipot left a comment

selipot commented Sep 18, 2023 •

edited

Loading

milancurcic commented Sep 18, 2023

selipot commented Sep 18, 2023

selipot commented Sep 18, 2023

milancurcic commented Sep 18, 2023

selipot commented Sep 18, 2023

milancurcic commented Sep 18, 2023

philippemiron commented Sep 18, 2023

selipot commented Sep 19, 2023

philippemiron commented Sep 19, 2023

selipot commented Sep 19, 2023

milancurcic commented Sep 19, 2023

selipot commented Sep 19, 2023

selipot Sep 20, 2023

milancurcic Sep 20, 2023

selipot commented Sep 20, 2023

milancurcic commented Sep 21, 2023

selipot commented Sep 21, 2023

Cast GDP float64 data to float32 as an option (default) #269

Cast GDP float64 data to float32 as an option (default) #269

Conversation

milancurcic commented Sep 18, 2023 • edited Loading

selipot left a comment

Choose a reason for hiding this comment

selipot commented Sep 18, 2023 • edited Loading

milancurcic commented Sep 18, 2023

selipot commented Sep 18, 2023

selipot commented Sep 18, 2023

milancurcic commented Sep 18, 2023

selipot commented Sep 18, 2023

milancurcic commented Sep 18, 2023

philippemiron commented Sep 18, 2023

selipot commented Sep 19, 2023

philippemiron commented Sep 19, 2023

selipot commented Sep 19, 2023

milancurcic commented Sep 19, 2023

selipot commented Sep 19, 2023

selipot Sep 20, 2023

Choose a reason for hiding this comment

milancurcic Sep 20, 2023

Choose a reason for hiding this comment

selipot commented Sep 20, 2023

milancurcic commented Sep 21, 2023

selipot commented Sep 21, 2023

milancurcic commented Sep 18, 2023 •

edited

Loading

selipot commented Sep 18, 2023 •

edited

Loading