Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework generated ragged array to utilize new conventions for coordinates #374

Merged

Conversation

kevinsantana11
Copy link
Contributor

fixes #372 and reworks the coordinates to id(traj) and time(obs)

@selipot
Copy link
Member

selipot commented Feb 23, 2024

Some tests. The following works but I get some surprising message:

from clouddrift.adapters import gdp1h, gdp6h
from clouddrift.adapters.gdp1h import GDP_DATA_URL, GDP_DATA_URL_EXPERIMENTAL

tmp_path = '/Users/selipot/Data/drifters/raw/2.01/'
tmp_path_exp = '/Users/selipot/Data/drifters/raw/exp/'
tmp_path_6h = '/Users/selipot/Data/drifters/raw/6h/'

ra = gdp1h.to_raggedarray(tmp_path=tmp_path,n_random_id=10)
ds_1 = ra.to_xarray()
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234060218770.nc:  23%|██▎       | 0.99M/4.24M [00:11<00:31, 107kB/s] 
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_78809.nc: 100%|██████████| 727k/727k [00:11<00:00, 65.0kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234065704960.nc: 100%|██████████| 808k/808k [00:13<00:00, 63.5kB/s]]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_96844.nc: 100%|██████████| 1.03M/1.03M [00:13<00:00, 78.2kB/s] 107kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_89769.nc: 100%|██████████| 1.21M/1.21M [00:15<00:00, 79.5kB/s] 119kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_63123.nc: 100%|██████████| 1.49M/1.49M [00:18<00:00, 85.1kB/s] 265kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_101877.nc: 100%|██████████| 2.07M/2.07M [00:18<00:00, 116kB/s] 261kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234062951460.nc: 100%|██████████| 2.34M/2.34M [00:20<00:00, 119kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_92905.nc: 100%|██████████| 2.70M/2.70M [00:20<00:00, 136kB/s], 459kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_300234060218770.nc: 100%|██████████| 4.24M/4.24M [00:22<00:00, 201kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/hourly_product/v2.01/drifter_hourly_132563.nc: 100%|██████████| 5.30M/5.30M [00:25<00:00, 217kB/s] 
Retrieving the number of obs: 100%|█████████████| 10/10 [00:00<00:00, 14.56it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")
Filling the Ragged Array:   0%|                          | 0/10 [00:00<?, ?it/s]/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:402: UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(
Filling the Ragged Array: 100%|█████████████████| 10/10 [00:00<00:00, 29.36it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")

Same thing with:

ra_exp = gdp1h.to_raggedarray(tmp_path=tmp_path_exp,n_random_id=10)
ds_exp = ra_exp.to_xarray()
Retrieving the number of obs: 100%|████████████| 10/10 [00:00<00:00, 110.78it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:371): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")
Filling the Ragged Array:   0%|                          | 0/10 [00:00<?, ?it/s]/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:402: UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(
Filling the Ragged Array: 100%|█████████████████| 10/10 [00:00<00:00, 32.26it/s]
[/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/clouddrift/raggedarray.py:311): UserWarning: Variable ID requested but not found; skipping.
  warnings.warn(f"Variable {var} requested but not found; skipping.")

And this fails for me:

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,n_random_id=10)
Downloading GDP 6-hourly data to [/Users/selipot/Data/drifters/raw/6h/...](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/Data/drifters/raw/6h/...)
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_9704356.nc:  82%|████████▏ | 129k/156k [00:00<00:00, 192kB/s] 
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_9704356.nc: 100%|██████████| 156k/156k [00:00<00:00, 236kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_10001_15000/drifter_6h_71490.nc: 100%|██████████| 121k/121k [00:00<00:00, 193kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234068341240.nc: 100%|██████████| 140k/140k [00:00<00:00, 180kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_101854.nc: 100%|██████████| 153k/153k [00:00<00:00, 195kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_5001_10000/drifter_6h_9817223.nc: 100%|██████████| 178k/178k [00:00<00:00, 220kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_1_5000/drifter_6h_7700519.nc: 100%|██████████| 150k/150k [00:00<00:00, 185kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300534061287570.nc: 100%|██████████| 150k/150k [00:00<00:00, 167kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_10001_15000/drifter_6h_89820.nc: 100%|██████████| 173k/173k [00:00<00:00, 183kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234068246610.nc: 100%|██████████| 273k/273k [00:01<00:00, 248kB/s]
https://www.aoml.noaa.gov/ftp/pub/phod/buoydata/6h/netcdf_15001_current/drifter_6h_300234065708110.nc: 100%|██████████| 455k/455k [00:01<00:00, 379kB/s]
Retrieving the number of obs: 100%|████████████| 10/10 [00:00<00:00, 155.31it/s]
[/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/conventions.py:436](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.12/site-packages/xarray/conventions.py:436): SerializationWarning: variable 'WMO' has multiple fill values {-999999, '-999999'}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
[...]
ValueError: Variable 'CurrentProgram': Could not convert tuple of form (dims, data[, attrs, encoding]): ('traj', 1425) to Variable.

Copy link
Member

@selipot selipot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the PR conversation

@selipot selipot added the enhancement New feature or request label Feb 28, 2024
changeset:
* add unit tests for gdp6h
* fix bugs in gdp6h adapter code
@kevinsantana11
Copy link
Contributor Author

@selipot warning about the missing fields and bug in the 6h adapter code has been fixed. I also added unit test for the 6 hourly dataset.

On my previous commit I had overlooked the fact that the integration tests I had added weren't running with the CI command, this was because in my local environment I was running the newly added tests via python -m unittest tests/adapters/*.py when I was testing the integration test but didn't realize they weren't running as part of the larger suite. That should be fixed now as I've made sure those tests are indeed running in our CI along with the other test cases.

@selipot
Copy link
Member

selipot commented Feb 29, 2024

@kevinsantana11 I tested again and it works however for each case the coordinate variable id is of type |S17| but we want that variable to be int64 ...

@selipot
Copy link
Member

selipot commented Feb 29, 2024

I am getting this error when I tried adapters.gdp6h with a list of ids

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[24], [line 1](vscode-notebook-cell:?execution_count=24&line=1)
----> [1](vscode-notebook-cell:?execution_count=24&line=1) ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])

File [~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479), in to_raggedarray(drifter_ids, n_random_id, tmp_path)
    [423](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:423) def to_raggedarray(
    [424](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:424)     drifter_ids: Optional[list[int]] = None,
    [425](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:425)     n_random_id: Optional[int] = None,
    [426](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:426)     tmp_path: str = GDP_TMP_PATH,
    [427](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:427) ) -> RaggedArray:
    [428](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:428)     """Download and process individual GDP 6-hourly files and return a
    [429](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:429)     RaggedArray instance with the data.
    [430](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:430) 
   (...)
    [477](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:477)     >>> arr.to_parquet("gdp6h.parquet")
    [478](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:478)     """
--> [479](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:479)     ids = download(GDP_DATA_URL, tmp_path, drifter_ids, n_random_id)
    [481](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:481)     ra = RaggedArray.from_files(
    [482](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:482)         indices=ids,
    [483](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:483)         preprocess_func=preprocess,
   (...)
    [489](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:489)         tmp_path=tmp_path,
    [490](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:490)     )
    [492](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:492)     # update dynamic global attributes

File [~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103), in download(url, tmp_path, drifter_ids, n_random_id)
     [97](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:97)         rng = np.random.RandomState(42)
     [98](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:98)         drifter_urls = list(rng.choice(drifter_urls, n_random_id, replace=False))
    [100](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:100) download_with_progress(
    [101](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:101)     [
    [102](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:102)         (url, os.path.join(tmp_path, os.path.basename(url)), None)
--> [103](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:103)         for url in drifter_urls
    [104](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:104)     ]
    [105](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:105) )
    [107](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:107) # Download the metadata so we can order the drifter IDs by end date.
    [108](https://file+.vscode-resource.vscode-cdn.net/Users/selipot/projects.git/clouddrift/~/projects.git/clouddrift/clouddrift/adapters/gdp6h.py:108) gdp_metadata = gdp.get_gdp_metadata()

UnboundLocalError: cannot access local variable 'drifter_urls' where it is not associated with a value

drifter_urls: list[str] = []
for dir in directory_list:
urlpath = urllib.request.urlopen(os.path.join(url, dir))
string = urlpath.read().decode("utf-8")
filelist = list(set(re.compile(pattern).findall(string)))
for f in filelist:
drifter_urls.append(os.path.join(url, dir, f))
else:
drifter_urls = [f"{url}/{filename_pattern.format(id=did)}" for did in drifter_ids]
Copy link
Contributor

@philippemiron philippemiron Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can work, you are missing the /dir/ here. I think you might have removed a dictionnary we had before here that mapped id to a directory since those are not an direct match. So my fix (pasted below) is to get all the files looping the directories like you had it, and then keep only the selected drifter_ids if passed as an argument when creating the drifter_urls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually going to push a modification that I just worked out, but since you change other stuff, I will just add it here. I replace this whole block with:

  # create list of drifter urls
  urlpath = urllib.request.urlopen(url)
  string = urlpath.read().decode("utf-8")
  drifter_urls: list[str] = []
  for dir in directory_list:
      urlpath = urllib.request.urlopen(os.path.join(url, dir))
      string = urlpath.read().decode("utf-8")
      filelist = list(set(re.compile(pattern).findall(string)))
      for f in filelist:
          if drifter_ids is None or int(f[:-3].split("_")[2]) in drifter_ids:
              drifter_urls.append(os.path.join(url, dir, f))

@selipot
Copy link
Member

selipot commented Mar 1, 2024

@kevinsantana11 not yet working for me.

gdp1h.to_raggedarray.to_xarray()

still returns string id variable and

ra_6 = gdp6h.to_raggedarray(tmp_path=tmp_path_6h,drifter_ids=[37640])

still returns

UnboundLocalError: cannot access local variable 'drifter_urls' where it is not associated with a value

@philippemiron
Copy link
Contributor

Try reinstalling the branch, it is working for me.

In [20]: from clouddrift.adapters import gdp1h, gdp6h

In [21]: gdp1h.to_raggedarray(n_random_id=3).to_xarray().id
Retrieving the number of obs: 100%|██████████████| 3/3 [00:00<00:00, 111.07it/s]
Filling the Ragged Array: 100%|███████████████████| 3/3 [00:00<00:00, 28.14it/s]
Out[21]: 
<xarray.DataArray 'id' (traj: 3)> Size: 24B
array([          63123,          101877, 300234060218770])
Coordinates:
    id       (traj) int64 24B 63123 101877 300234060218770
Dimensions without coordinates: traj
Attributes:
    long_name:  Global Drifter Program Buoy ID
    units:      -

and

In [24]: gdp6h.to_raggedarray(tmp_path="/var/folders/_6/hdhmyzr120jgn1d_45q65zkh0000gn/T/clouddrift/gdp6h",drifter_ids=[37640])
Downloading GDP 6-hourly data to /var/folders/_6/hdhmyzr120jgn1d_45q65zkh0000gn/T/clouddrift/gdp6h...
Retrieving the number of obs: 100%|██████████████| 1/1 [00:00<00:00, 129.44it/s]
/Users/pmiron/micromamba/envs/dev/lib/python3.12/site-packages/xarray/conventions.py:440: SerializationWarning: variable 'WMO' has multiple fill values {'-999999', -999999}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
Filling the Ragged Array: 100%|███████████████████| 1/1 [00:00<00:00, 38.81it/s]
Out[24]: <clouddrift.raggedarray.RaggedArray at 0x284a40710>

@selipot
Copy link
Member

selipot commented Mar 1, 2024

Yes! It all seems to be working now!

@kevinsantana11 kevinsantana11 merged commit a370d68 into Cloud-Drift:main Mar 3, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

change to clouddrift.adapters.gdp1h()
3 participants