-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update streaming tutorial #1526
Conversation
* change fsspec from S3 to http protocol * intersphinx into dandi and fsspec
@yarikoptic can you please take a look at and lmk what you think? |
Codecov Report
@@ Coverage Diff @@
## dev #1526 +/- ##
=======================================
Coverage 90.65% 90.65%
=======================================
Files 25 25
Lines 2494 2494
Branches 466 466
=======================================
Hits 2261 2261
Misses 148 148
Partials 85 85
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
# .. code-block:: python | ||
# | ||
# import s3fs | ||
# import pynwb | ||
# import h5py | ||
# | ||
# fs = s3fs.S3FileSystem(anon=True) | ||
# | ||
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb') | ||
# file = h5py.File(f) | ||
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True) | ||
# | ||
# io.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the s3fs
builds on fsspec
and since the s3fs
code is simpler than using the full fsspec
I think it would be nice to add a subsection to the Streaming Method 2: fsspec
section that shows this code example for s3fs
as well. In this way you can still keep the tutorial to the two main options but still mention thes3fs
option, which may be sufficient and simpler for many users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So s3fs is a specific implementation of the fsspec protocol. I changed over to using the more generic fsspec library with http for three reasons:
- This way we can use the http path as input, which is the same path that the ros3 driver takes, so it is clear how the two methods relate to each other.
- Showing it this way makes it pretty clear how you would extend this to other backends.
- This method also shows how to use caching, which I think is an important feature that a lot of users will want to leverage.
Do you really think this workflow is more complex than the s3fs syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing the fsspec
example as the main example makes sense. If you think that showing the full s3fs
example in addition is too confusing, then maybe we can just add a note along the lines of "The s3fs package also builds on fsspec and can be used to access S3 in a similar manner. s3fs provides additional convenience methods for interacting with S3 more specifically while fsspec provides greater flexibility with regard to supported remote file systems. Note, when using s3fs requires the use of s3
URIs instead of http
. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best would be if someone checks for what s3fs might be providing "on top" or in addition to pure straight fsspec. E.g. it could be some better choices of blocksize etc. I would have at least compared performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true that s3fs provides additional convenience methods for interacting with S3 more specifically?
s3fs implements in S3FileSystem
which inherits from fsspec's AsyncFileSystem
abstract class. My impression is that it just implements the method of a fsspec.FileSystem
without much else besides maybe credential controls that we don't use. You can switch to an s3 backend by simply changing fs=fsspec.filesystem("http")
to fs=fsspec.filesystem("s3")
, but then you'll have to supply the s3 path, not the http path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell from the documentation what the specific differences are, but my impression is that s3fs adds support for async
, improved logging, and S3 specific implementation of the file system functions. Purely to read an NWB file, this may not be big difference. Pending further investigation, maybe what we could do is to add a note to indicate that there are a growing number of libraries build on ffspec that provide functionality specific for different file systems (e.g., S3, Google Cloud, or Azure) and link to https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations. We could then post alternate examples, such as the code using S3FS, as GitHub Gists online and link to them from the same note. This would keep the docs clean and at the same time provide examples for folks that may want to use the more specific libraries build on fsspec. Given that we don't know how all the various options perform, we could also add a "disclaimer" along those lines.
Thanks @bendichter for adding the |
Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments / recommendations
# .. code-block:: python | ||
# | ||
# import s3fs | ||
# import pynwb | ||
# import h5py | ||
# | ||
# fs = s3fs.S3FileSystem(anon=True) | ||
# | ||
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb') | ||
# file = h5py.File(f) | ||
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True) | ||
# | ||
# io.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best would be if someone checks for what s3fs might be providing "on top" or in addition to pure straight fsspec. E.g. it could be some better choices of blocksize etc. I would have at least compared performance.
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
undo formatting changes
remove changes to images tutorial
Co-authored-by: Ryan Ly <rly@lbl.gov>
This looks good to me. And fsspec works on Windows! |
Co-authored-by: Ryan Ly <rly@lbl.gov>
HDF5 just published the following blog post https://www.hdfgroup.org/2022/08/cloud-storage-options-for-hdf5/ discussing the various cloud storage options with HDF5 and some of the trade-offs to consider. If we need a reference to point to, this may be a good link to point folks to for further reading about available options. |
At some point it may be useful to also look at HSDS. From a PyNWB perspective it may be as simple as handing in a file handle from h5pyd, but I'm not sure. I've never used HSDS so I'm not sure how involved the setup on the server and client side are. @yarikoptic has DANDI ever looked at HSDS. |
fix #1525