Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin reworking the Usage page in the docs #213

Merged
merged 7 commits into from
Jul 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 17 additions & 8 deletions clouddrift/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -696,12 +696,9 @@ def velocity_from_position(

Difference scheme can take one of three values:

1. "forward" (default): finite difference is evaluated as
dx[i] = dx[i+1] - dx[i];
2. "backward": finite difference is evaluated as
dx[i] = dx[i] - dx[i-1];
3. "centered": finite difference is evaluated as
dx[i] = (dx[i+1] - dx[i-1]) / 2.
#. "forward" (default): finite difference is evaluated as ``dx[i] = dx[i+1] - dx[i]``;
#. "backward": finite difference is evaluated as ``dx[i] = dx[i] - dx[i-1]``;
#. "centered": finite difference is evaluated as ``dx[i] = (dx[i+1] - dx[i-1]) / 2``.

Forward and backward schemes are effectively the same except that the
position at which the velocity is evaluated is shifted one element down in
Expand Down Expand Up @@ -977,6 +974,18 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
ValueError
If one of the variable in a criterion is not found in the Dataset
"""
# Normally we expect the ragged-array dataset to have a "count" variable.
# However, some datasets may have a "rowsize" variable instead, e.g. if they
# have not gotten up to speed with our new convention. We check for both.
if "count" in ds.variables:
count_var = "count"
elif "rowsize" in ds.variables:
count_var = "rowsize"
else:
raise ValueError(
"Ragged-array Dataset ds must have a 'count' or 'rowsize' variable."
)

mask_traj = xr.DataArray(data=np.ones(ds.dims["traj"], dtype="bool"), dims=["traj"])
mask_obs = xr.DataArray(data=np.ones(ds.dims["obs"], dtype="bool"), dims=["obs"])

Expand All @@ -990,7 +999,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
raise ValueError(f"Unknown variable '{key}'.")

# remove data when trajectories are filtered
traj_idx = np.insert(np.cumsum(ds["count"].values), 0, 0)
traj_idx = np.insert(np.cumsum(ds[count_var].values), 0, 0)
for i in np.where(~mask_traj)[0]:
mask_obs[slice(traj_idx[i], traj_idx[i + 1])] = False

Expand All @@ -1006,7 +1015,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
# apply the filtering for both dimensions
ds_sub = ds.isel({"traj": mask_traj, "obs": mask_obs})
# update the count
ds_sub["count"].values = segment(
ds_sub[count_var].values = segment(
ds_sub.ids, 0.5, count=segment(ds_sub.ids, -0.5)
)
return ds_sub
Expand Down
6 changes: 2 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# -- Project information -----------------------------------------------------

project = "CloudDrift"
copyright = "2022, CloudDrift"
copyright = "2022-2023, CloudDrift"
author = "Philippe Miron"

# -- General configuration ---------------------------------------------------
Expand Down Expand Up @@ -49,9 +49,7 @@

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "pydata_sphinx_theme" # alabaster, sphinx_rtd_theme
# html_theme = "sphinx_rtd_theme"
html_theme = "sphinx_book_theme" # alabaster, sphinx_rtd_theme

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
sphinx
pydata_sphinx_theme
sphinx-book-theme
248 changes: 237 additions & 11 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,240 @@
Usage
=====

CloudDrift provides an easy way to convert Lagrangian datasets into
The CloudDrift library provides functions for:

* Easy access to cloud-ready Lagrangian ragged-array datasets;
* Common Lagrangian analysis tasks on ragged arrays;
* Adapting custom Lagrangian datasets into ragged arrays.

Let's start by importing the library and accessing a ready-to-use ragged-array
dataset.

Accessing ragged-array Lagrangian datasets
------------------------------------------

We recommend to import the ``clouddrift`` using the ``cd`` shorthand, for convenience:

>>> import clouddrift as cd

CloudDrift provides a set of Lagrangian datasets that are ready to use.
They can be accessed via the ``datasets`` submodule.
In this example, we will load the NOAA's Global Drifter Program (GDP) hourly
dataset, which is hosted in a public AWS bucket as a cloud-optimized Zarr
dataset:

>>> ds = cd.datasets.gdp1h()
>>> ds
<xarray.Dataset>
Dimensions: (traj: 17324, obs: 165754333)
Coordinates:
ids (obs) int64 ...
lat (obs) float32 ...
lon (obs) float32 ...
time (obs) datetime64[ns] ...
Dimensions without coordinates: traj, obs
Data variables: (12/55)
BuoyTypeManufacturer (traj) |S20 ...
BuoyTypeSensorArray (traj) |S20 ...
CurrentProgram (traj) float64 ...
DeployingCountry (traj) |S20 ...
DeployingShip (traj) |S20 ...
DeploymentComments (traj) |S20 ...
... ...
sst1 (obs) float64 ...
sst2 (obs) float64 ...
typebuoy (traj) |S10 ...
typedeath (traj) int8 ...
ve (obs) float32 ...
vn (obs) float32 ...
Attributes: (12/16)
Conventions: CF-1.6
acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio...
contributor_name: NOAA Global Drifter Program
contributor_role: Data Acquisition Center
date_created: 2022-12-09T06:02:29.684949
doi: 10.25921/x46c-3620
... ...
processing_level: Level 2 QC by GDP drifter DAC
publisher_email: aoml.dftr@noaa.gov
publisher_name: GDP Drifter DAC
publisher_url: https://www.aoml.noaa.gov/phod/gdp
summary: Global Drifter Program hourly data
title: Global Drifter Program hourly drifting buoy collection

The ``gdp1h`` function returns an Xarray ``Dataset`` instance of the ragged-array dataset.
While the dataset is quite large, around a dozen GB, it is not downloaded to your
local machine. Instead, the dataset is accessed directly from the cloud, and only
the data that is needed for the analysis is downloaded. This is possible thanks to
the cloud-optimized Zarr format, which allows for efficient access to the data
stored in the cloud.

Let's look at some variables in this dataset:

>>> ds.lon
<xarray.DataArray 'lon' (obs: 165754333)>
[165754333 values with dtype=float32]
Coordinates:
ids (obs) int64 ...
lat (obs) float32 ...
lon (obs) float32 ...
time (obs) datetime64[ns] ...
Dimensions without coordinates: obs
Attributes:
long_name: Longitude
units: degrees_east

You see that this array is very long--it has 165754333 elements.
This is because in a ragged array, many varying-length arrays are laid out as a
contiguous 1-dimensional array in memory.

Let's look at the dataset dimensions:

>>> ds.dims
Frozen({'traj': 17324, 'obs': 165754333})

The ``traj`` dimension has 17324 elements, which is the number of individual
trajectories in the dataset.
The sum of their lengths equals the length of the ``obs`` dimension.
Internally, these dimensions, their lengths, and the ``count`` (or ``rowsize``)
variables are used internally to make CloudDrift's analysis functions aware of
the bounds of each contiguous array within the ragged-array data structure.

Doing common analysis tasks on ragged arrays
--------------------------------------------

Now that we have a ragged-array dataset loaded as an Xarray ``Dataset`` instance,
let's do some common analysis tasks on it.
Our dataset is on a remote server and fairly large (a dozen GB or so), so let's
first subset it to several trajectories so that we can more easily work with it.
The variable ``ID`` is the unique identifier for each trajectory:

>>> ds.ID[:10].values
array([2578, 2582, 2583, 2592, 2612, 2613, 2622, 2623, 2931, 2932])

>>> from clouddrift.analysis import subset

``subset`` allows you to subset a ragged array by some criterion.
In this case, we will subset it by the ``ID`` variable:

>>> ds_sub = subset(ds, {"ID": list(ds.ID[:5])})
>>> ds_sub
<xarray.Dataset>
Dimensions: (traj: 5, obs: 13612)
Coordinates:
ids (obs) int64 2578 2578 2578 2578 ... 2612 2612 2612
lat (obs) float32 ...
lon (obs) float32 ...
time (obs) datetime64[ns] ...
Dimensions without coordinates: traj, obs
Data variables: (12/55)
BuoyTypeManufacturer (traj) |S20 ...
BuoyTypeSensorArray (traj) |S20 ...
CurrentProgram (traj) float64 ...
DeployingCountry (traj) |S20 ...
DeployingShip (traj) |S20 ...
DeploymentComments (traj) |S20 ...
... ...
sst1 (obs) float64 ...
sst2 (obs) float64 ...
typebuoy (traj) |S10 ...
typedeath (traj) int8 ...
ve (obs) float32 ...
vn (obs) float32 ...
Attributes: (12/16)
Conventions: CF-1.6
acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio...
contributor_name: NOAA Global Drifter Program
contributor_role: Data Acquisition Center
date_created: 2022-12-09T06:02:29.684949
doi: 10.25921/x46c-3620
... ...
processing_level: Level 2 QC by GDP drifter DAC
publisher_email: aoml.dftr@noaa.gov
publisher_name: GDP Drifter DAC
publisher_url: https://www.aoml.noaa.gov/phod/gdp
summary: Global Drifter Program hourly data
title: Global Drifter Program hourly drifting buoy collection

You see that we now have a subset of the original dataset, with 5 trajectories
and a total of 13612 observations.
This subset is small enough to quickly and easily work with for demonstration
purposes.
Let's see how we can compute the mean and maximum velocities of each trajectory.
To start, we'll need to obtain the velocities over all trajectory times.
Although the GDP dataset already comes with velocity variables, we won't use
them here so that we can learn how to compute them ourselves from positions.
``clouddrift`` provides the ``velocity_from_position`` function that allows you
to do just that.

>>> from clouddrift.analysis import velocity_from_position

At a minimum ``velocity_from_position`` requires three input parameters:
consecutive x- and y-coordinates and time, so we could do:

>>> u, v = velocity_from_position(ds_sub.lon, ds_sub.lat, ds_sub.time)

``velocity_from_position`` returns two arrays, ``u`` and ``v``, which are the
zonal and meridional velocities, respectively.
By default, it assumes that the coordinates are in degrees, and it handles the
great circle path calculation and longitude wraparound under the hood.
However, recall that ``ds_sub.lon``, ``ds_sub.lat``, and ``ds_sub.time`` are
ragged arrays, so we need a different approach to calculate velocities while
respecting the trajectory boundaries.
For this, we can use the ``ragged_apply`` function, which applies a function
to each trajectory in a ragged array, and returns the concatenated result.

>>> from clouddrift.analysis import apply_ragged
>>> u, v = apply_ragged(velocity_from_position, [ds_sub.lon, ds_sub.lat, ds_sub.time], ds_sub.rowsize)

``u`` and ``v`` here are still ragged arrays, which means that the five
contiguous trajectories are concatenated into 1-dimensional arrays.

Now, let's compute the velocity magnitude in meters per second.
The time in this dataset is loaded in nanoseconds by default:

>>> ds_sub.time.values
array(['2005-04-15T20:00:00.000000000', '2005-04-15T21:00:00.000000000',
'2005-04-15T22:00:00.000000000', ...,
'2005-10-02T03:00:00.000000000', '2005-10-02T04:00:00.000000000',
'2005-10-02T05:00:00.000000000'], dtype='datetime64[ns]')

So, to obtain the velocity magnitude in meters per second, we'll need to
multiply our velocities by ``1e9``.

>>> velocity_magnitude = np.sqrt(u**2 + v**2) * 1e9
>>> velocity_magnitude
array([0.28053388, 0.6164632 , 0.89032112, ..., 0.2790803 , 0.20095603,
0.20095603])

>>> velocity_magnitude.mean(), velocity_magnitude.max()
(0.22115242718877506, 1.6958275672626286)

However, these aren't the results we are looking for! Recall that we have the
velocity magnitude of five different trajectories concatenated into one array.
This means that we need to use ``apply_ragged`` again to compute the mean and
maximum values:

>>> apply_ragged(np.mean, [velocity_magnitude], ds_sub.rowsize)
array([0.32865148, 0.17752435, 0.1220523 , 0.13281067, 0.14041268])
>>> apply_ragged(np.max, [velocity_magnitude], ds_sub.rowsize)
array([1.69582757, 1.36804354, 0.97343434, 0.60353528, 1.05044213])

And there you go! We used ``clouddrift`` to:

#. Load a real-world Lagrangian dataset from the cloud;
#. Subset the dataset by trajectory IDs;
#. Compute the velocity vectors and their magnitudes for each trajectory;
#. Compute the mean and maximum velocity magnitudes for each trajectory.

``clouddrift`` offers many more functions for common Lagrangian analysis tasks.
Please explore the `API <https://cloud-drift.github.io/clouddrift/api.html>`_
to learn about other functions and how to use them.

Adapting custom Lagrangian datasets into ragged arrays
------------------------------------------------------

CloudDrift provides an easy way to convert custom Lagrangian datasets into
`contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_.

.. code-block:: python
Expand All @@ -26,14 +259,7 @@ CloudDrift provides an easy way to convert Lagrangian datasets into

This snippet is specific to the hourly GDP dataset, however, you can use the
``RaggedArray`` class directly to convert other custom datasets into a ragged
array structure that is analysis ready via Xarray or Awkward Array packages.
We provide step-by-step guides to convert the individual trajectories from the
Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the
`CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical
Lagrangian experiment in our
`repository of example Jupyter Notebooks <https://github.com/cloud-drift/clouddrift-examples>`_.
array structure that is analysis ready via Xarray or Awkward Array packages.
The functions to do that are defined in the ``clouddrift.adapters`` submodule.
You can use these examples as a reference to ingest your own or other custom
Lagrangian datasets into ``RaggedArray``.

In the future, ``clouddrift`` will be including functions to perform typical
oceanographic Lagrangian analyses.
Lagrangian datasets into ``RaggedArray``.