-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* update streaming tutorial * change fsspec from S3 to http protocol * intersphinx into dandi and fsspec * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> * Update docs/gallery/advanced_io/streaming.py * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Yaroslav Halchenko <debian@onerussian.com> * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Yaroslav Halchenko <debian@onerussian.com> * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Yaroslav Halchenko <debian@onerussian.com> * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Yaroslav Halchenko <debian@onerussian.com> * Apply suggestions from code review undo formatting changes * Apply suggestions from code review remove changes to images tutorial * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Ryan Ly <rly@lbl.gov> * Update docs/gallery/advanced_io/streaming.py * Update docs/gallery/advanced_io/streaming.py Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com> Co-authored-by: Yaroslav Halchenko <debian@onerussian.com> Co-authored-by: Ryan Ly <rly@lbl.gov>
- Loading branch information
1 parent
4d59cc7
commit 92aa13f
Showing
2 changed files
with
109 additions
and
88 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,98 +1,118 @@ | ||
''' | ||
.. _streaming: | ||
Streaming from an S3 Bucket | ||
=========================== | ||
Streaming NWB files | ||
=================== | ||
It is possible to read data directly from an S3 bucket, such as data from the `DANDI Archive | ||
<https://dandiarchive.org/>`_. This is especially useful for reading small pieces of data | ||
from a large NWB file stored remotely. In fact, there are two different ways to do this supported by PyNWB. | ||
You can read specific sections within individual data files directly from remote stores such as the | ||
`DANDI Archive <https://dandiarchive.org/>`_. This is especially useful for reading small pieces of data | ||
from a large NWB file stored | ||
remotely. First, you will need to get the location of the file. The code below illustrates how to do this on DANDI | ||
using the dandi API library. | ||
Method 1: ROS3 | ||
~~~~~~~~~~~~~~ | ||
ROS3 stands for "read only S3" and is a driver created by the HDF5 group that allows HDF5 to read HDF5 files | ||
stored on s3. Using this method requires that your HDF5 library is installed with the ROS3 driver enabled. This | ||
is not the default configuration, so you will need to make sure you install the right version of h5py that has this | ||
advanced configuration enabled. You can install HDF5 with the ROS3 driver from `conda-forge | ||
<https://conda-forge.org/>`_ using ``conda``. You may first need to uninstall a currently installed version of h5py. | ||
Getting the location of the file on DANDI | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
''' | ||
The :py:class:`~dandi.dandiapi.DandiAPIClient` can be used to get the S3 URL of any NWB file stored in the DANDI | ||
Archive. If you have not already, install the latest release of the ``dandi`` package. | ||
.. code-block:: bash | ||
pip install dandi | ||
Now you can get the url of a particular NWB file using the dandiset ID and the path of that file within the dandiset. | ||
.. code-block:: python | ||
from dandi.dandiapi import DandiAPIClient | ||
dandiset_id = '000006' # ephys dataset from the Svoboda Lab | ||
filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb' # 450 kB file | ||
with DandiAPIClient() as client: | ||
asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath) | ||
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True) | ||
Streaming Method 1: ROS3 | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
ROS3 is one of the supported methods for reading data from a remote store. ROS3 stands for "read only S3" and is a | ||
driver created by the HDF5 Group that allows HDF5 to read HDF5 files stored remotely in s3 buckets. Using this method | ||
requires that your HDF5 library is installed with the ROS3 driver enabled. This is not the default configuration, | ||
so you will need to make sure you install the right version of ``h5py`` that has this advanced configuration enabled. | ||
You can install HDF5 with the ROS3 driver from `conda-forge <https://conda-forge.org/>`_ using ``conda``. You may | ||
first need to uninstall a currently installed version of ``h5py``. | ||
.. code-block:: bash | ||
pip uninstall h5py | ||
conda install -c conda-forge "h5py>=3.2" | ||
Now instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This | ||
will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily, | ||
just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to | ||
download the sliced data (and only the sliced data) to memory. | ||
.. code-block:: python | ||
from pynwb import NWBHDF5IO | ||
#################### | ||
# .. code-block:: bash | ||
# | ||
# pip uninstall h5py | ||
# conda install -c conda-forge "h5py>=3.2" | ||
# | ||
|
||
#################### | ||
# The ``DandiAPIClient`` can be used to get the S3 URL to an NWB file of interest stored in the DANDI Archive. | ||
# If you have not already, install the latest release of the ``dandi`` package. | ||
# | ||
# .. code-block:: bash | ||
# | ||
# pip install dandi | ||
# | ||
# .. code-block:: python | ||
# | ||
# from dandi.dandiapi import DandiAPIClient | ||
# | ||
# dandiset_id = '000006' # ephys dataset from the Svoboda Lab | ||
# filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb' # 450 kB file | ||
# with DandiAPIClient() as client: | ||
# asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath) | ||
# s3_path = asset.get_content_url(follow_redirects=1, strip_query=True) | ||
|
||
#################### | ||
# Finally, instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This | ||
# will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily, | ||
# just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to | ||
# download the sliced data (and only the sliced data) to memory. | ||
# | ||
# .. code-block:: python | ||
# | ||
# from pynwb import NWBHDF5IO | ||
# | ||
# with NWBHDF5IO(s3_path, mode='r', load_namespaces=True, driver='ros3') as io: | ||
# nwbfile = io.read() | ||
# print(nwbfile) | ||
# print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:]) | ||
|
||
#################### | ||
# Method 2: s3fs | ||
# ~~~~~~~~~~~~~~ | ||
# s3fs is a library that creates a virtual filesystem for an S3 store. With this approach, a virtual file is created | ||
# for the file and virtual filesystem layer will take care of requesting data from the s3 bucket whenever data is | ||
# read from the virtual file. | ||
# | ||
# First install s3fs: | ||
# | ||
# .. code-block:: bash | ||
# | ||
# pip install s3fs | ||
# | ||
# Then in Python: | ||
# | ||
# .. code-block:: python | ||
# | ||
# import s3fs | ||
# import pynwb | ||
# import h5py | ||
# | ||
# fs = s3fs.S3FileSystem(anon=True) | ||
# | ||
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb') | ||
# file = h5py.File(f) | ||
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True) | ||
# | ||
# io.read() | ||
# | ||
# The above snippet opens an arbitrary file on DANDI. You can use the ``DandiAPIClient`` to find the s3 path, | ||
# but you will need to adjust this url to give it a prefix of "s3://dandiarchive/" as shown above. | ||
# | ||
# The s3fs approach has the advantage of being more robust that ROS3. Sometimes s3 requests are interrupted, | ||
# and s3fs has internal mechanisms to retry these requests automatically, whereas ROS3 does not. However, it may not | ||
# be available on all platforms. s3fs does not currently work for Windows. | ||
with NWBHDF5IO(s3_url, mode='r', load_namespaces=True, driver='ros3') as io: | ||
nwbfile = io.read() | ||
print(nwbfile) | ||
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:]) | ||
Streaming Method 2: fsspec | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
fsspec is another data streaming approach that is quite flexible and has several performance advantages. This library | ||
creates a virtual filesystem for remote stores. With this approach, a virtual file is created for the file and | ||
the virtual filesystem layer takes care of requesting data from the S3 bucket whenever data is | ||
read from the virtual file. Note that this implementation is completely unaware of internals of the HDF5 format | ||
and thus can work for **any** file, not only for the purpose of use with H5PY and PyNWB. | ||
First install ``fsspec`` and the dependencies of the :py:class:`~fsspec.implementations.http.HTTPFileSystem`: | ||
.. code-block:: bash | ||
pip install fsspec requests aiohttp | ||
Then in Python: | ||
.. code-block:: python | ||
import fsspec | ||
import pynwb | ||
import h5py | ||
from fsspec.implementations.cached import CachingFileSystem | ||
# first, create a virtual filesystem based on the http protocol and use | ||
# caching to save accessed data to RAM. | ||
fs = CachingFileSystem( | ||
fs=fsspec.filesystem("http"), | ||
cache_storage="nwb-cache", # Local folder for the cache | ||
) | ||
# next, open the file | ||
with fs.open(s3_url, "rb") as f: | ||
with h5py.File(f) as file: | ||
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io: | ||
nwbfile = io.read() | ||
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:]) | ||
fsspec is a library that can be used to access a variety of different store formats, including (at the time of | ||
writing): | ||
.. code-block:: python | ||
from fsspec.registry import known_implementations | ||
known_implementations.keys() | ||
file, memory, dropbox, http, https, zip, tar, gcs, gs, gdrive, sftp, ssh, ftp, hdfs, arrow_hdfs, webhdfs, s3, s3a, wandb, oci, adl, abfs, az, cached, blockcache, filecache, simplecache, dask, dbfs, github, git, smb, jupyter, jlab, libarchive, reference | ||
The S3 backend, in particular, may provide additional functionality for accessing data on DANDI. See the | ||
`fsspec documentation on known implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations>`_ | ||
for a full updated list of supported store formats. | ||
''' | ||
|
||
# sphinx_gallery_thumbnail_path = 'figures/gallery_thumbnails_streaming.png' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters