Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast GDP float64 data to float32 as an option (default) #269

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

milancurcic
Copy link
Member

@milancurcic milancurcic commented Sep 18, 2023

This casts all float64 variables to float32 before returning the Dataset in the RaggedArray.to_xarray() method, but give an option to not do it if desired. I tested it with a small gdp-2.01 file (10 random drifters) and it works.

Can you confirm that float32 is OK for longitude and latitude variables? We get precision of 7 digits. For a drifter in the 3-digit longitudes, precision is down to O(10 m). Is it sufficient?

@milancurcic milancurcic added the arhicved-label-data-adapters Adapters for custom datasets into CloudDrift label Sep 18, 2023
@milancurcic milancurcic requested a review from selipot September 18, 2023 18:34
Copy link
Member

@selipot selipot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go for me for now. A user who would want to keep some variables float64 would have to pass False and do the conversion manually. I think this is acceptable.

clouddrift/raggedarray.py Outdated Show resolved Hide resolved
@selipot
Copy link
Member

selipot commented Sep 18, 2023

I was able to discuss this a bit more with @JeffreyEarly : we should keep double precision for latitude and longitude estimates but we can convert to single precision for all other variables. (Even for uncertainty of longitude and latitude.)

Since this PR applies to the whole raggedarray class I think it is not a good idea in the end. We should keep double precision. and manually convert what we need to convert for the GDP dataset.

@milancurcic
Copy link
Member Author

Sounds good, we can exclude specific variables from the cast. Do you know why double for velocity?

@selipot
Copy link
Member

selipot commented Sep 18, 2023

Sounds good, we can exclude specific variables from the cast. Do you know why double for velocity?

That was a typo: latitude and longitude, now corrected.

to_xarray() docstring estension.
@selipot
Copy link
Member

selipot commented Sep 18, 2023

So in conclusion, despite my commit, I don't think we should implement this PR.

@milancurcic
Copy link
Member Author

I'm confused. Why not cast variables other than lat and lon?

@selipot
Copy link
Member

selipot commented Sep 18, 2023

I'm confused. Why not cast variables other than lat and lon?

If I understand correctly, this PR proposes to convert all variables to float32 (not only in the GDP cases!). Some users might want to retain the double precision for their specific variables. So I am not in favor of this behavior.

@milancurcic
Copy link
Member Author

OK, if you prefer manually casting that's fine. Better than adding code that won't be useful.

As an aside as since it came up here, the RaggedArray class is currently only used for GDP. I can't imagine using it for other datasets over simple functions like I did in MOSAiC. Long term I'd like to remove it altogether; nothing to do for now.

@philippemiron
Copy link
Contributor

I'm confused by this. So we are doing nothing and the GDP ragged array will be 35 GB? I don't think it makes much sense to have all this precision.

@selipot
Copy link
Member

selipot commented Sep 19, 2023

26GB not 35GB. We are doing the conversion by hand for the GDP, not by a default behavior of raggedarray.to_netcdf(). The manual conversion leads to ~16GB.

@philippemiron
Copy link
Contributor

But it could be done in adapters.gdp1h and adapters.gdp6h so we don't have to do anything manually (and limit possible divergence) when regenerating the GDP ragged array.

@selipot
Copy link
Member

selipot commented Sep 19, 2023

I think that's a great idea @philippemiron

@milancurcic milancurcic reopened this Sep 19, 2023
@milancurcic
Copy link
Member Author

It sounds like the current direction is:

  • Move casting to adapters.gdp1h and adapters.gdp6h
  • Exclude time, latitude, and longitude from the cast

Please confirm.

@selipot
Copy link
Member

selipot commented Sep 19, 2023

Sounds like a plan.

@@ -510,6 +510,9 @@ def preprocess(index: int, **kwargs) -> xr.Dataset:
# rename variables
ds = ds.rename_vars({"longitude": "lon", "latitude": "lat"})

# Cast float64 variables to float32 to reduce memory footprint.
ds = gdp.cast_float64_variables_to_float32(ds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where you should add variables_to_skip = ["lon", "lat", "time"]?`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default, see the function definition.

@selipot
Copy link
Member

selipot commented Sep 20, 2023

Once this is done I will be able to get the final parquet file.

@milancurcic
Copy link
Member Author

@selipot Can this be merged?

@selipot
Copy link
Member

selipot commented Sep 21, 2023

yes!

@milancurcic milancurcic merged commit afbf318 into Cloud-Drift:main Sep 21, 2023
@milancurcic milancurcic deleted the enforce-gdp-float32 branch September 21, 2023 16:51
philippemiron pushed a commit to philippemiron/clouddrift that referenced this pull request Nov 16, 2023
)

* Cast GDP float64 data to float32 as an option (default)

* Update raggedarray.py

to_xarray() docstring estension.

* Move casting to adapters.gdp

---------

Co-authored-by: Shane Elipot <selipot@miami.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arhicved-label-data-adapters Adapters for custom datasets into CloudDrift
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants