GLAD dataset adapter #61

milancurcic · 2022-12-15T17:34:33Z

Part of #53.

Can be adapted from clouddrift-examples/data/glad.py into clouddrift/adapters.py.

selipot · 2022-12-15T19:47:58Z

So will we have also gdp, gdp6h, parcels etc adapters?

milancurcic · 2022-12-15T19:52:02Z

Yes, and anything else that we want and that our users ask for, assuming it's in scope.

miniufo · 2023-05-08T16:32:54Z

Hi guys, this is an interesting projects I've recently came across. I found that the ragged array data structure could also be applied to Lagrangian type of data like tropical cyclones best track datasets (see my repo). I've once tried to design a data struct (basically a wrapper of pandas.DataFrame) and adapt it to the GDP drifter dataset (6hr version, not hourly, see here). Since your ragged data struct follows the CF convention, I feel that it is much better to use this data struct to refactor my repo for tropical cyclone.

A much further thought is: is it possible to isolate the lagrangian data struct as a standalone package, like xarray, so that both GDP datasets, GLAD dataset, and tropical cyclone datasets (any specific lagrangian datasets in geoscience, including synthetic particles generated by numerical models) can be easily built on this data struct, with some additional efforts on parsing the datasets into the ragged array (using different adapters)?

Once very large dataset is being handled, how about the efficiency of ragged array? Pandas and xarray has many capabilities to deal with huge datasets (like out-of-core computation). Since the doc is still in development, I cannot know many details of your designs.

Just some thoughts here with this great package.

philippemiron · 2023-05-08T17:57:39Z

The main class of the package is designed to be used with any datasets. Look at the example notebooks here, https://github.com/Cloud-Drift/clouddrift-examples/tree/main/notebooks, in particular I think the numerical data could be adapted to your needs!

Happy to help if you have any questions.

PS: we are changing the name of the class from dataformat to raggedarray as part of #171.

milancurcic · 2023-05-09T15:25:21Z

Thanks @miniufo for your interest and ideas. To clarify the RaggedArray class is an intermediate data structure used internally to go from custom data formats -> xr.Dataset. It's not intended for use in analysis, and instead we define our Lagrangian analysis functions on the ragged array xr.Dataset. You're correct that TC tracks (and intensity and other vortex properties) are essentially Lagrangian and fit here very well.

You're welcome to use clouddrift's RaggedArray as a dependency in your library to make adapters for HURDAT2 and/or IBTracs or others.

Alternatively, we can also implement these adapters directly in clouddrift; we could work on that together if you'd like.

miniufo · 2023-05-09T20:52:31Z

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

Just try to understand your design. I do like to help if I can.

philippemiron · 2023-05-10T01:47:31Z

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

This is correct. Most of the analysis functions are based on xr.Dataset (some also supports pd.Series or np.array).

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

The idea of the RaggedArray class is to simplify this conversion. You can of course generate the ragged array yourself and use clouddrift analysis function afterwards.

In your case, if I understand correctly, you can probably just reshape the data, and create a RaggedArray object in a few lines. As Milan said, we could help you generate this, it should be easy considering it's a single .txt file.

Once you have this object, there are functions to easily convert to either an xr.Dataset, an Awkward Array, or output to a NetCDF or a parquet file.

Just try to understand your design. I do like to help if I can.

milancurcic · 2023-11-09T16:12:04Z

I haven't found a way to download the dataset (https://data.gulfresearchinitiative.org/data/R1.x134.073:0004) from the code. This is because there is no static dataset URL, but instead it's resolved dynamically via JavaScript (and quite likely server calls). We have a few options:

Instruct the user to download the dataset to the local file system before running the adapter;
Upload a copy of the dataset to S3 or some static source. I've used GitHub issues as file storage (you attach a file to a blank issue, close the issue, and you get a static URL to the file; however, this works for < 25MB; GLAD is 150MB)

2 would allow for a better user experience. Since the dataset is DOI'd and finalized, we could serve a copy from a place we control without worry that the upstream dataset may change. @selipot do we have an S3 bucket for the project that we could use?

selipot · 2023-11-09T16:23:11Z

We do not have a bucket but we could create one. We need to figure out the cost?

milancurcic · 2023-11-09T16:28:10Z

S3 Standard is $0.023 per GB, so for GLAD that would be $0.00345 per download, or 290 downloads per $1.

milancurcic · 2023-11-09T17:41:20Z

I now see that @philippemiron already had extracted a static URL from the backend in the GLAD example notebook. I'll check that it still works and we'll just use that if so.

milancurcic · 2023-11-09T17:43:17Z

It works; all good.

philippemiron · 2023-11-10T04:43:37Z

I think I looked at the Developer tools -> Network tabs at the time to find this direct link...! Glad to see it still works!

milancurcic · 2023-11-10T14:54:00Z

@philippemiron that's smart, I haven't thought of that, only looked in page source. :)

milancurcic added enhancement New feature or request arhicved-label-data-adapters Adapters for custom datasets into CloudDrift labels Dec 15, 2022

milancurcic self-assigned this Dec 15, 2022

milancurcic added this to Data adapters Dec 15, 2022

milancurcic moved this to Todo in Data adapters Dec 15, 2022

milancurcic mentioned this issue Nov 9, 2023

GLAD adapter and dataset accessor #318

Merged

7 tasks

milancurcic closed this as completed in #318 Nov 13, 2023

github-project-automation bot moved this from Todo to Done in Data adapters Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLAD dataset adapter #61

GLAD dataset adapter #61

milancurcic commented Dec 15, 2022

selipot commented Dec 15, 2022

milancurcic commented Dec 15, 2022

miniufo commented May 8, 2023

philippemiron commented May 8, 2023 •

edited

Loading

milancurcic commented May 9, 2023

miniufo commented May 9, 2023

philippemiron commented May 10, 2023

milancurcic commented Nov 9, 2023

selipot commented Nov 9, 2023

milancurcic commented Nov 9, 2023

milancurcic commented Nov 9, 2023

milancurcic commented Nov 9, 2023

philippemiron commented Nov 10, 2023

milancurcic commented Nov 10, 2023

GLAD dataset adapter #61

GLAD dataset adapter #61

Comments

milancurcic commented Dec 15, 2022

selipot commented Dec 15, 2022

milancurcic commented Dec 15, 2022

miniufo commented May 8, 2023

philippemiron commented May 8, 2023 • edited Loading

milancurcic commented May 9, 2023

miniufo commented May 9, 2023

philippemiron commented May 10, 2023

milancurcic commented Nov 9, 2023

selipot commented Nov 9, 2023

milancurcic commented Nov 9, 2023

milancurcic commented Nov 9, 2023

milancurcic commented Nov 9, 2023

philippemiron commented Nov 10, 2023

milancurcic commented Nov 10, 2023

philippemiron commented May 8, 2023 •

edited

Loading