Rework generated ragged array to utilize new conventions for coordinates #374

kevinsantana11 · 2024-02-22T15:10:15Z

fixes #372 and reworks the coordinates to id(traj) and time(obs)

selipot · 2024-02-23T19:43:09Z

Some tests. The following works but I get some surprising message:

from clouddrift.adapters import gdp1h, gdp6h
from clouddrift.adapters.gdp1h import GDP_DATA_URL, GDP_DATA_URL_EXPERIMENTAL

tmp_path = '/Users/selipot/Data/drifters/raw/2.01/'
tmp_path_exp = '/Users/selipot/Data/drifters/raw/exp/'
tmp_path_6h = '/Users/selipot/Data/drifters/raw/6h/'

ra = gdp1h.to_raggedarray(tmp_path=tmp_path,n_random_id=10)
ds_1 = ra.to_xarray()

https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234060218770.nc:  23%|██▎       | 0.99M/4.24M [00:11<00:31, 107kB/s] 
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_78809.nc: 100%|██████████| 727k/727k [00:11<00:00, 65.0kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234065704960.nc: 100%|██████████| 808k/808k [00:13<00:00, 63.5kB/s]]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_96844.nc: 100%|██████████| 1.03M/1.03M [00:13<00:00, 78.2kB/s] 107kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_89769.nc: 100%|██████████| 1.21M/1.21M [00:15<00:00, 79.5kB/s] 119kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_63123.nc: 100%|██████████| 1.49M/1.49M [00:18<00:00, 85.1kB/s] 265kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_101877.nc: 100%|██████████| 2.07M/2.07M [00:18<00:00, 116kB/s] 261kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234062951460.nc: 100%|██████████| 2.34M/2.34M [00:20<00:00, 119kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_92905.nc: 100%|██████████| 2.70M/2.70M [00:20<00:00, 136kB/s], 459kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234060218770.nc: 100%|██████████| 4.24M/4.24M [00:22<00:00, 201kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_132563.nc: 100%|██████████| 5.30M/5.30M [00:25<00:00, 217kB/s] 
Retrieving the number of obs: 100%|█████████████| 10/10 [00:00<00:00, 14.56it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")
Filling the Ragged Array:   0%|                          | 0/10 [00:00<?, ?it/s]/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:402: UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(
Filling the Ragged Array: 100%|█████████████████| 10/10 [00:00<00:00, 29.36it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")

Same thing with:

ra_exp = gdp1h.to_raggedarray(tmp_path=tmp_path_exp,n_random_id=10)
ds_exp = ra_exp.to_xarray()

Retrieving the number of obs: 100%|████████████| 10/10 [00:00<00:00, 110.78it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")
Filling the Ragged Array:   0%|                          | 0/10 [00:00<?, ?it/s]/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:402: UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(
Filling the Ragged Array: 100%|█████████████████| 10/10 [00:00<00:00, 32.26it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")

And this fails for me:

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,n_random_id=10)

Downloading GDP 6-hourly data to [/Users/selipot/Data/drifters/raw/6h/...](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/Data/drifters/raw/6h/...)
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_9704356.nc:  82%|████████▏ | 129k/156k [00:00<00:00, 192kB/s] 
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_9704356.nc: 100%|██████████| 156k/156k [00:00<00:00, 236kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_10001_15000/drifter_6h_71490.nc: 100%|██████████| 121k/121k [00:00<00:00, 193kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234068341240.nc: 100%|██████████| 140k/140k [00:00<00:00, 180kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_101854.nc: 100%|██████████| 153k/153k [00:00<00:00, 195kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_5001_10000/drifter_6h_9817223.nc: 100%|██████████| 178k/178k [00:00<00:00, 220kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_7700519.nc: 100%|██████████| 150k/150k [00:00<00:00, 185kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300534061287570.nc: 100%|██████████| 150k/150k [00:00<00:00, 167kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_10001_15000/drifter_6h_89820.nc: 100%|██████████| 173k/173k [00:00<00:00, 183kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234068246610.nc: 100%|██████████| 273k/273k [00:01<00:00, 248kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234065708110.nc: 100%|██████████| 455k/455k [00:01<00:00, 379kB/s]
Retrieving the number of obs: 100%|████████████| 10/10 [00:00<00:00, 155.31it/s]
[/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/conventions.py:436](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/conventions.py:436): SerializationWarning: variable 'WMO' has multiple fill values {-999999, '-999999'}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
[...]
ValueError: Variable 'CurrentProgram': Could not convert tuple of form (dims, data[, attrs, encoding]): ('traj', 1425) to Variable.

selipot

See my comment in the PR conversation

changeset: * add unit tests for gdp6h * fix bugs in gdp6h adapter code

kevinsantana11 · 2024-02-29T05:33:00Z

@selipot warning about the missing fields and bug in the 6h adapter code has been fixed. I also added unit test for the 6 hourly dataset.

On my previous commit I had overlooked the fact that the integration tests I had added weren't running with the CI command, this was because in my local environment I was running the newly added tests via python -m unittest tests/adapters/*.py when I was testing the integration test but didn't realize they weren't running as part of the larger suite. That should be fixed now as I've made sure those tests are indeed running in our CI along with the other test cases.

selipot · 2024-02-29T17:06:44Z

@kevinsantana11 I tested again and it works however for each case the coordinate variable id is of type |S17| but we want that variable to be int64 ...

selipot · 2024-02-29T17:14:51Z

I am getting this error when I tried adapters.gdp6h with a list of ids

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[24], [line 1](vscode-notebook-cell:?execution_count=24&line=1)
----> [1](vscode-notebook-cell:?execution_count=24&line=1) ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])

File [~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479), in to_raggedarray(drifter_ids, n_random_id, tmp_path)
    [423](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:423) def to_raggedarray(
    [424](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:424)     drifter_ids: Optional[list[int]] = None,
    [425](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:425)     n_random_id: Optional[int] = None,
    [426](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:426)     tmp_path: str = GDP_TMP_PATH,
    [427](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:427) ) -> RaggedArray:
    [428](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:428)     """Download and process individual GDP 6-hourly files and return a
    [429](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:429)     RaggedArray instance with the data.
    [430](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:430) 
   (...)
    [477](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:477)     >>> arr.to_parquet("gdp6h.parquet")
    [478](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:478)     """
--> [479](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479)     ids = download(GDP_DATA_URL, tmp_path, drifter_ids, n_random_id)
    [481](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:481)     ra = RaggedArray.from_files(
    [482](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:482)         indices=ids,
    [483](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:483)         preprocess_func=preprocess,
   (...)
    [489](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:489)         tmp_path=tmp_path,
    [490](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:490)     )
    [492](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:492)     # update dynamic global attributes

File [~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103), in download(url, tmp_path, drifter_ids, n_random_id)
     [97](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:97)         rng = np.random.RandomState(42)
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:98)         drifter_urls = list(rng.choice(drifter_urls, n_random_id, replace=False))
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:100) download_with_progress(
    [101](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:101)     [
    [102](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:102)         (url, os.path.join(tmp_path, os.path.basename(url)), None)
--> [103](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103)         for url in drifter_urls
    [104](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:104)     ]
    [105](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:105) )
    [107](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:107) # Download the metadata so we can order the drifter IDs by end date.
    [108](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:108) gdp_metadata = gdp.get_gdp_metadata()

UnboundLocalError: cannot access local variable 'drifter_urls' where it is not associated with a value

add in unit testing for gdp6h module

philippemiron · 2024-03-01T04:41:25Z

clouddrift/adapters/gdp6h.py

        drifter_urls: list[str] = []
        for dir in directory_list:
            urlpath = urllib.request.urlopen(os.path.join(url, dir))
            string = urlpath.read().decode("utf-8")
            filelist = list(set(re.compile(pattern).findall(string)))
            for f in filelist:
                drifter_urls.append(os.path.join(url, dir, f))
+    else:
+        drifter_urls = [f"{url}/{filename_pattern.format(id=did)}" for did in drifter_ids]


I don't think this can work, you are missing the /dir/ here. I think you might have removed a dictionnary we had before here that mapped id to a directory since those are not an direct match. So my fix (pasted below) is to get all the files looping the directories like you had it, and then keep only the selected drifter_ids if passed as an argument when creating the drifter_urls.

I was actually going to push a modification that I just worked out, but since you change other stuff, I will just add it here. I replace this whole block with:

# create list of drifter urls urlpath = urllib.request.urlopen(url) string = urlpath.read().decode("utf-8") drifter_urls: list[str] = [] for dir in directory_list: urlpath = urllib.request.urlopen(os.path.join(url, dir)) string = urlpath.read().decode("utf-8") filelist = list(set(re.compile(pattern).findall(string))) for f in filelist: if drifter_ids is None or int(f[:-3].split("_")[2]) in drifter_ids: drifter_urls.append(os.path.join(url, dir, f))

changeset: update mypy config so new files are picked up fix bug on malformed url in windows

… drifter ids

selipot · 2024-03-01T13:41:09Z

@kevinsantana11 not yet working for me.

gdp1h.to_raggedarray.to_xarray()

still returns string id variable and

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])

still returns

UnboundLocalError: cannot access local variable 'drifter_urls' where it is not associated with a value

philippemiron · 2024-03-01T14:46:37Z

Try reinstalling the branch, it is working for me.

In [20]: from clouddrift.adapters import gdp1h, gdp6h

In [21]: gdp1h.to_raggedarray(n_random_id=3).to_xarray().id
Retrieving the number of obs: 100%|██████████████| 3/3 [00:00<00:00, 111.07it/s]
Filling the Ragged Array: 100%|███████████████████| 3/3 [00:00<00:00, 28.14it/s]
Out[21]: 
<xarray.DataArray 'id' (traj: 3)> Size: 24B
array([          63123,          101877, 300234060218770])
Coordinates:
    id       (traj) int64 24B 63123 101877 300234060218770
Dimensions without coordinates: traj
Attributes:
    long_name:  Global Drifter Program Buoy ID
    units:      -

and

In [24]: gdp6h.to_raggedarray(tmp_path="/var/folders/_6/hdhmyzr120jgn1d_45q65zkh0000gn/T/clouddrift/gdp6h",drifter_ids=[37640])
Downloading GDP 6-hourly data to /var/folders/_6/hdhmyzr120jgn1d_45q65zkh0000gn/T/clouddrift/gdp6h...
Retrieving the number of obs: 100%|██████████████| 1/1 [00:00<00:00, 129.44it/s]
/Users/pmiron/micromamba/envs/dev/lib/python3.12/site-packages/xarray/conventions.py:440: SerializationWarning: variable 'WMO' has multiple fill values {'-999999', -999999}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
Filling the Ragged Array: 100%|███████████████████| 1/1 [00:00<00:00, 38.81it/s]
Out[24]: <clouddrift.raggedarray.RaggedArray at 0x284a40710>

selipot · 2024-03-01T15:02:25Z

Yes! It all seems to be working now!

Rework generated ragged array to utilize new conventions for coordina…

0541a66

…tes Cloud-Drift#372

kevinsantana11 requested a review from selipot February 22, 2024 15:10

selipot requested changes Feb 23, 2024

View reviewed changes

selipot assigned kevinsantana11 Feb 28, 2024

selipot added the enhancement New feature or request label Feb 28, 2024

kevinsantana11 mentioned this pull request Feb 29, 2024

⭐ Hurdat2 adapter #378

Merged

fix unit tests in adapters not running

b9a1aa3

changeset: * add unit tests for gdp6h * fix bugs in gdp6h adapter code

kevinsantana11 added 2 commits February 29, 2024 23:24

fix bug - gdp6h drifter id selection broken

e42ef45

add in unit testing for gdp6h module

Fix bug where dtype for id was a string vs int64 (desired)

6183701

philippemiron reviewed Mar 1, 2024

View reviewed changes

kevinsantana11 added 10 commits February 29, 2024 23:54

specify netcdf loading engine for windows

ecc46f5

platform logic for rowsize func

49432cd

skip big downloads

ed9df6e

remove platform specific

baf88e9

changeset: update mypy config so new files are picked up fix bug on malformed url in windows

ruff

519ed4c

platform specific paths

0cc6d35

use @philippemiron patch which fixes the download bug when specifying…

0b35eb9

… drifter ids

remove import

713198d

ordering

556796d

remove skipped

1372b2a

selipot approved these changes Mar 1, 2024

View reviewed changes

kevinsantana11 merged commit a370d68 into Cloud-Drift:main Mar 3, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework generated ragged array to utilize new conventions for coordinates #374

Rework generated ragged array to utilize new conventions for coordinates #374

kevinsantana11 commented Feb 22, 2024

selipot commented Feb 23, 2024

selipot left a comment

kevinsantana11 commented Feb 29, 2024

selipot commented Feb 29, 2024

selipot commented Feb 29, 2024

philippemiron Mar 1, 2024 •

edited

Loading

philippemiron Mar 1, 2024

selipot commented Mar 1, 2024

philippemiron commented Mar 1, 2024

selipot commented Mar 1, 2024

Rework generated ragged array to utilize new conventions for coordinates #374

Rework generated ragged array to utilize new conventions for coordinates #374

Conversation

kevinsantana11 commented Feb 22, 2024

selipot commented Feb 23, 2024

selipot left a comment

Choose a reason for hiding this comment

kevinsantana11 commented Feb 29, 2024

selipot commented Feb 29, 2024

selipot commented Feb 29, 2024

philippemiron Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

philippemiron Mar 1, 2024

Choose a reason for hiding this comment

selipot commented Mar 1, 2024

philippemiron commented Mar 1, 2024

selipot commented Mar 1, 2024

philippemiron Mar 1, 2024 •

edited

Loading