Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: support fsspec / opening from a file object #1525

Closed
3 tasks done
yarikoptic opened this issue Aug 2, 2022 · 8 comments · Fixed by #1526
Closed
3 tasks done

[Feature]: support fsspec / opening from a file object #1525

yarikoptic opened this issue Aug 2, 2022 · 8 comments · Fixed by #1526
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation

Comments

@yarikoptic
Copy link
Contributor

What would you like to see added to PyNWB?

ROS3 driver provides a nice way to stream data without downloading file in full, as shown in this example. But it is S3 specific.

Is your feature request related to a problem?

somewhat. AFAIK ros3 driver doesn't provide ability for local sparse caching etc.

What solution would you like?

Although i have not looked into how it is implemented, I bet ros3 driver is working through HTTP Range requests. Such method is generic and could work for any http/https URLs. That relies in the basis of many solutions, such as https://github.com/fsspec/filesystem_spec providing uniform access to a wide (not just S3) range of resources, where fsspec provides generic file interface, see e.g. examples in fsspec doc. Another example for such use could be https://github.com/datalad/datalad-fuse/ where we provide fsspec-based FUSE file system for any datalad dataset where fsspec provides seamless, locally cached, sparse access to any remote file, nwb included.

So, by supporting opening from a file handle, it should be possible to open some fsspec'ed file. NWBHDF5IO seems to take path, not file handle. Would it be "easy" to support operating with a pre-opened file handle? If so -- might be just a matter of an example documentation to demonstrate such a cool feature.

Do you have any interest in helping implement the feature?

No.

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Aug 2, 2022

Would it be "easy" to support operating with a pre-opened file handle?

#1499 shows what you are describing for S3fs. NWBHDF5IO already supports passing of an h5py.File handle

{'name': 'file', 'type': h5py.File, 'doc': 'a pre-existing h5py.File object', 'default': None},

I bet ros3 driver is working through HTTP Range requests. Such method is generic and could work for any http/https URLs.

I honestly don't know if this would work out-of-the-box or not, but it would be interesting to try. @yarikoptic could you test if this is working, and if it does we could update the PyNWB tutorial on streaming accordingly.

@oruebel oruebel added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation labels Aug 2, 2022
@yarikoptic
Copy link
Contributor Author

yarikoptic commented Aug 4, 2022

oh nice -- thanks @bendichter for that one -- s3fs is fsspec based so we are "all set" and it is indeed just a matter of documentation and investigation of "efficiency"!

NB AFAIK needs not yet released version of pynwb/hdmf so I had to install them from git repos.

Here is @bendichter code snippet adjusted to use caching filesystem and access via https url directly (could be any http, not necessarily S3) with a versionId (since why not? ;)) and by default accessing URL from dandi:000003 of a few GBs in size:

import fsspec

# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)

import pynwb
import h5py
from time import time

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fsspec.filesystem("http"),
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
            # cache_check=600,
            # block_size=1024,
            # check_files=True,
            # expiry_times=True,
            # same_names=True
)

import sys

if sys.argv[-1].startswith('http'):
    url = sys.argv[-1]
else:
    # and 11GB file to munch on. Would fetch about 80MB of data
    url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"

print(f"Accessing {url}")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order!  Seems to be crucial.
t0 = time()
with fs.open(url, 'rb') as f:
    with h5py.File(f) as file:
        with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
            out = io.read()
            print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec")

and here is demonstration on how "well" it works on that file:

$> rm -rf nwb-cache ; for l in 1 2; do python -W ignore cached-fsspec.py  ; done; du -sch nwb-cache             
Accessing https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ
Read something which gives me 2504 long str representation in 146.336 sec
python -W ignore cached-fsspec.py  5.79s user 0.96s system 4% cpu 2:29.09 total
Accessing https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ
Read something which gives me 2504 long str representation in 0.447 sec
83M	nwb-cache
83M	total

which shows that at the cost of the 83MB in local cache, we could perform that .read in 0.5 instead of 146 seconds! IMHO quite nice ;)

on original one which @bendichter used it would look like following but less impressive since the file itself is just 400KB so it kinda boring -- fsspec just fetches the entire file really (default block size is 5MB IIRC)

$> rm -rf nwb-cache ; for l in 1 2; do python -W ignore cached-fsspec.py "http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS" ; done; du -sch nwb-cache
Accessing http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS
Read something which gives me 1347 long str representation in 0.659 sec
Accessing http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS
Read something which gives me 1347 long str representation in 0.092 sec
448K	nwb-cache
448K	total

I have not tried any other than "http" but here is the list of all implementations it claims to support -- so might be quite cool to try some other relevant to typical use cases (e.g. ssh):

$> python -c 'from fsspec.registry import known_implementations as ki; print(" ".join(sorted(list(ki))))'
abfs adl arrow_hdfs az blockcache cached dask dbfs dropbox file filecache ftp gcs gdrive git github gs \
 hdfs http https jlab jupyter libarchive memory oci reference s3 s3a sftp simplecache smb ssh tar wandb webhdfs zip
My initial attempts before I realized on importance of properly closing those opened files for this use-case. Has other possibly useful insights on `du` etc ;) .
import fsspec
import pynwb
import h5py

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fsspec.filesystem("http"),
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
            # cache_check=600,
            # block_size=1024,
            # check_files=True,
            # expiry_times=True,
            # same_names=True
)

# http "version" with versionId of "s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35"
url = "http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS"
f = fs.open(url, 'rb')
file = h5py.File(f)
io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)

print(io.read())

since this file is not particularly "rich", local caching does not shave off much (about 400-500 ms) but works (stores locally the entire copy of that file since its size of 400kb is less than default blocksize for fsspec of ~5MB IIRC). So I went for a more interesting example, the dandi://dandi/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb (from here) which is https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ

adjusted demo script to use that URL instead:

import fsspec
import pynwb
import h5py

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fsspec.filesystem("http"),
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
            # cache_check=600,
            # block_size=1024,
            # check_files=True,
            # expiry_times=True,
            # same_names=True
)

url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"
f = fs.open(url, 'rb')
file = h5py.File(f)
io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)

print(io.read())

Since it is quite sizeable (11GB) and my home internet sucks, I was hoping for this to be more demonstrative but it was not really...

Run times were: 2:44.60 and 2:41.85 with cache size 83MB

$> rm -rf nwb-cache; time python cached-fsspec.py; time python cached-fsspec.py; du -scb nwb-cache
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.3.0 because version 1.5.1 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'core' version 2.2.5 because version 2.5.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
root pynwb.file.NWBFile at 0x140396477138352
Fields:
  acquisition: {
    position_sensor0 <class 'pynwb.behavior.SpatialSeries'>,
    position_sensor1 <class 'pynwb.behavior.SpatialSeries'>
  }
  devices: {
    Device <class 'pynwb.device.Device'>,
    Implant <class 'pynwb.device.Device'>
  }
  electrode_groups: {
    shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank3 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank4 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank5 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank6 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank7 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank8 <class 'pynwb.ecephys.ElectrodeGroup'>
  }
  electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
  experimenter: ['Yuta Senzai']
  file_create_date: [datetime.datetime(2021, 1, 16, 20, 24, 13, 973543, tzinfo=tzoffset(None, -18000))]
  identifier: 23647bce-e14c-4498-a45d-8b5dcd572360
  institution: NYU
  lab: Buzsaki
  processing: {
    behavior <class 'pynwb.base.ProcessingModule'>,
    ecephys <class 'pynwb.base.ProcessingModule'>
  }
  related_publications: ['DOI:10.1016/j.neuron.2016.12.011']
  session_description: Mouse in open exploration and theta maze.
  session_id: YutaMouse20-140321
  session_start_time: 2014-03-21 00:00:00-04:00
  stimulus: {
    DS1.ch4 <class 'pynwb.misc.AnnotationSeries'>,
    DS2.ch4 <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim_briefLD <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim_briefLaser <class 'pynwb.misc.AnnotationSeries'>,
    PeakStim5V <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_10ms <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_1sec <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_1sec_weak <class 'pynwb.misc.AnnotationSeries'>,
    Ripple <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_5V_End <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_5V_Start <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_End <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_Start <class 'pynwb.misc.AnnotationSeries'>,
    TroughStim5V <class 'pynwb.misc.AnnotationSeries'>
  }
  subject: subject pynwb.file.Subject at 0x140396476974992
Fields:
  age: 207 days, 0:00:00
  genotype: SST-Cre
  species: Mus musculus
  subject_id: YutaMouse20

  timestamps_reference_time: 2014-03-21 00:00:00-04:00
  units: units <class 'pynwb.misc.Units'>

python cached-fsspec.py  5.49s user 1.07s system 3% cpu 2:44.60 total
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.3.0 because version 1.5.1 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'core' version 2.2.5 because version 2.5.0 is already loaded.
  warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
root pynwb.file.NWBFile at 0x139635971343840
Fields:
  acquisition: {
    position_sensor0 <class 'pynwb.behavior.SpatialSeries'>,
    position_sensor1 <class 'pynwb.behavior.SpatialSeries'>
  }
  devices: {
    Device <class 'pynwb.device.Device'>,
    Implant <class 'pynwb.device.Device'>
  }
  electrode_groups: {
    shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank3 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank4 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank5 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank6 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank7 <class 'pynwb.ecephys.ElectrodeGroup'>,
    shank8 <class 'pynwb.ecephys.ElectrodeGroup'>
  }
  electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
  experimenter: ['Yuta Senzai']
  file_create_date: [datetime.datetime(2021, 1, 16, 20, 24, 13, 973543, tzinfo=tzoffset(None, -18000))]
  identifier: 23647bce-e14c-4498-a45d-8b5dcd572360
  institution: NYU
  lab: Buzsaki
  processing: {
    behavior <class 'pynwb.base.ProcessingModule'>,
    ecephys <class 'pynwb.base.ProcessingModule'>
  }
  related_publications: ['DOI:10.1016/j.neuron.2016.12.011']
  session_description: Mouse in open exploration and theta maze.
  session_id: YutaMouse20-140321
  session_start_time: 2014-03-21 00:00:00-04:00
  stimulus: {
    DS1.ch4 <class 'pynwb.misc.AnnotationSeries'>,
    DS2.ch4 <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim_briefLD <class 'pynwb.misc.AnnotationSeries'>,
    MS_PulseStim_briefLaser <class 'pynwb.misc.AnnotationSeries'>,
    PeakStim5V <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_10ms <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_1sec <class 'pynwb.misc.AnnotationSeries'>,
    PulseStim_1sec_weak <class 'pynwb.misc.AnnotationSeries'>,
    Ripple <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_5V_End <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_5V_Start <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_End <class 'pynwb.misc.AnnotationSeries'>,
    StimDuringRun_Start <class 'pynwb.misc.AnnotationSeries'>,
    TroughStim5V <class 'pynwb.misc.AnnotationSeries'>
  }
  subject: subject pynwb.file.Subject at 0x139635971180192
Fields:
  age: 207 days, 0:00:00
  genotype: SST-Cre
  species: Mus musculus
  subject_id: YutaMouse20

  timestamps_reference_time: 2014-03-21 00:00:00-04:00
  units: units <class 'pynwb.misc.Units'>

python cached-fsspec.py  5.75s user 0.84s system 4% cpu 2:41.85 total
11694714459	nwb-cache
11694714459	total

$> du -scm nwb-cache 
83	nwb-cache
83	total

NB above reminded me about critical difference between du -b and du -m (or any other like -h) on sparse file -- -b would report "--apparent-size" i.e. all 11G, whenever all those other units the actual consumption on the file system.

So -- we shaved off only 3 sec. The logical test would be to compare against performance on a present locally file altogether (I might do later on a remote host)

But may be the best would be to include such access method into a sweep of IO benchmarks @bendichter (I think) did in the past -- where that would be and either you think this case would fit there well?

@bendichter
Copy link
Contributor

Oh wow, this would allow use to do lazy read from gdrive, dropbox, and ssh? That's exciting! fsspec is indeed quite powerful! I could see lots of labs using this for internal stuff.

It seems like the CachingFileSystem would be a good thing to add to the PyNWB tutorial on data streaming.

@bendichter
Copy link
Contributor

@yarikoptic, since I am new to fsspec, maybe I can write something up and you can take a look to ensure it is following best practices.

@oruebel
Copy link
Contributor

oruebel commented Aug 4, 2022

Thanks @yarikoptic for investigating fsspec. This looks very useful and having the option to also read from GDrive, Dropbox etc. would be very nice to have indeed.

@yarikoptic
Copy link
Contributor Author

@yarikoptic, since I am new to fsspec, maybe I can write something up and you can take a look to ensure it is following best practices.

I am not expert in it either, but I would be happy to have a look and adjust if needed. We could even run it by fsspec folks! They were typically quite responsive.

Thanks @yarikoptic for investigating fsspec. This looks very useful and having the option to also read from GDrive, Dropbox etc. would be very nice to have indeed.

;) it might be indeed nice to inspect other filesystems support on how well it would work out -- haven't tried any other yet... will do now for ssh and will report back.

@yarikoptic
Copy link
Contributor Author

will do now for ssh and will report back.

to cool things down a bit: TL;DR: sparse caching would not work for SSH (at the moment at least -- I do not think that anything forbids implementing that support. Filed fsspec/filesystem_spec#1012)

Meanwhile here is adjusted script which can be fed up with ssh:// URLs only to crash
import sys
import fsspec

# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)

import pynwb
import h5py
import urllib

from time import time

if '://' in sys.argv[-1]:
    url = sys.argv[-1]
else:
    # and 11GB file to munch on. Would fetch about 80MB of data
    url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"

# figure out filesystem, lets add some mappings
fsmap = {'https': 'http'}

#In [8]: urllib.parse.urlparse('ssh://lkasjdf')
#Out[8]: ParseResult(scheme='ssh', netloc='lkasjdf', path='', params='', query='', fragment='')

url_ = urllib.parse.urlparse(url)
scheme = url_.scheme
scheme = fsmap.get(scheme, scheme)

if scheme == 'http':
    # for http -- everything goes into fs.open
    fspath = url
    fskw = {}
elif scheme == 'ssh':
    fspath = url_.path.lstrip('/') # consider it from $HOME for now
    fskw = dict(
        host=url_.netloc.split(':', 1)[0],
        port=int(url_.netloc.split(':', 1)[1]) if ':' in url_.netloc else 22
    )
else:
    raise NotImplementedError(f"Do not know how to handle {scheme}")

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fsspec.filesystem(scheme, **fskw),
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
    # cache_check=600,
    # block_size=1024,
    # check_files=True,
    # expiry_times=True,
    # same_names=True
)


print(f"Accessing {url} as {fspath} on {fs} ")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order!  Seems to be crucial.
t0 = time()
with fs.open(fspath, 'rb') as f:
    with h5py.File(f) as file:
        with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
            out = io.read()
            print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants