-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: support fsspec / opening from a file object #1525
Comments
#1499 shows what you are describing for S3fs. Line 213 in 11d5850
I honestly don't know if this would work out-of-the-box or not, but it would be interesting to try. @yarikoptic could you test if this is working, and if it does we could update the PyNWB tutorial on streaming accordingly. |
oh nice -- thanks @bendichter for that one -- s3fs is fsspec based so we are "all set" and it is indeed just a matter of documentation and investigation of "efficiency"! NB AFAIK needs not yet released version of pynwb/hdmf so I had to install them from git repos. Here is @bendichter code snippet adjusted to use caching filesystem and access via https url directly (could be any http, not necessarily S3) with a versionId (since why not? ;)) and by default accessing URL from dandi:000003 of a few GBs in size: import fsspec
# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)
import pynwb
import h5py
from time import time
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
fs=fsspec.filesystem("http"),
# target_protocol='blockcache',
cache_storage="nwb-cache",
# cache_check=600,
# block_size=1024,
# check_files=True,
# expiry_times=True,
# same_names=True
)
import sys
if sys.argv[-1].startswith('http'):
url = sys.argv[-1]
else:
# and 11GB file to munch on. Would fetch about 80MB of data
url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"
print(f"Accessing {url}")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order! Seems to be crucial.
t0 = time()
with fs.open(url, 'rb') as f:
with h5py.File(f) as file:
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
out = io.read()
print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec") and here is demonstration on how "well" it works on that file:
which shows that at the cost of the 83MB in local cache, we could perform that on original one which @bendichter used it would look like following but less impressive since the file itself is just 400KB so it kinda boring -- fsspec just fetches the entire file really (default block size is 5MB IIRC) $> rm -rf nwb-cache ; for l in 1 2; do python -W ignore cached-fsspec.py "http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS" ; done; du -sch nwb-cache
Accessing http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS
Read something which gives me 1347 long str representation in 0.659 sec
Accessing http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS
Read something which gives me 1347 long str representation in 0.092 sec
448K nwb-cache
448K total I have not tried any other than "http" but here is the list of all implementations it claims to support -- so might be quite cool to try some other relevant to typical use cases (e.g. ssh):
My initial attempts before I realized on importance of properly closing those opened files for this use-case. Has other possibly useful insights on `du` etc ;) .import fsspec
import pynwb
import h5py
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
fs=fsspec.filesystem("http"),
# target_protocol='blockcache',
cache_storage="nwb-cache",
# cache_check=600,
# block_size=1024,
# check_files=True,
# expiry_times=True,
# same_names=True
)
# http "version" with versionId of "s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35"
url = "http://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35?versionId=C1uOWMhd6uxiKUQG.U8xkxeYb.EySYbS"
f = fs.open(url, 'rb')
file = h5py.File(f)
io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)
print(io.read()) since this file is not particularly "rich", local caching does not shave off much (about 400-500 ms) but works (stores locally the entire copy of that file since its size of 400kb is less than default blocksize for fsspec of ~5MB IIRC). So I went for a more interesting example, the dandi://dandi/000003/sub-YutaMouse20/sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb (from here) which is https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ adjusted demo script to use that URL instead: import fsspec
import pynwb
import h5py
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
fs=fsspec.filesystem("http"),
# target_protocol='blockcache',
cache_storage="nwb-cache",
# cache_check=600,
# block_size=1024,
# check_files=True,
# expiry_times=True,
# same_names=True
)
url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"
f = fs.open(url, 'rb')
file = h5py.File(f)
io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)
print(io.read()) Since it is quite sizeable (11GB) and my home internet sucks, I was hoping for this to be more demonstrative but it was not really... Run times were: 2:44.60 and 2:41.85 with cache size 83MB $> rm -rf nwb-cache; time python cached-fsspec.py; time python cached-fsspec.py; du -scb nwb-cache
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.3.0 because version 1.5.1 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'core' version 2.2.5 because version 2.5.0 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
root pynwb.file.NWBFile at 0x140396477138352
Fields:
acquisition: {
position_sensor0 <class 'pynwb.behavior.SpatialSeries'>,
position_sensor1 <class 'pynwb.behavior.SpatialSeries'>
}
devices: {
Device <class 'pynwb.device.Device'>,
Implant <class 'pynwb.device.Device'>
}
electrode_groups: {
shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank3 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank4 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank5 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank6 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank7 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank8 <class 'pynwb.ecephys.ElectrodeGroup'>
}
electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
experimenter: ['Yuta Senzai']
file_create_date: [datetime.datetime(2021, 1, 16, 20, 24, 13, 973543, tzinfo=tzoffset(None, -18000))]
identifier: 23647bce-e14c-4498-a45d-8b5dcd572360
institution: NYU
lab: Buzsaki
processing: {
behavior <class 'pynwb.base.ProcessingModule'>,
ecephys <class 'pynwb.base.ProcessingModule'>
}
related_publications: ['DOI:10.1016/j.neuron.2016.12.011']
session_description: Mouse in open exploration and theta maze.
session_id: YutaMouse20-140321
session_start_time: 2014-03-21 00:00:00-04:00
stimulus: {
DS1.ch4 <class 'pynwb.misc.AnnotationSeries'>,
DS2.ch4 <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim_briefLD <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim_briefLaser <class 'pynwb.misc.AnnotationSeries'>,
PeakStim5V <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_10ms <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_1sec <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_1sec_weak <class 'pynwb.misc.AnnotationSeries'>,
Ripple <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_5V_End <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_5V_Start <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_End <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_Start <class 'pynwb.misc.AnnotationSeries'>,
TroughStim5V <class 'pynwb.misc.AnnotationSeries'>
}
subject: subject pynwb.file.Subject at 0x140396476974992
Fields:
age: 207 days, 0:00:00
genotype: SST-Cre
species: Mus musculus
subject_id: YutaMouse20
timestamps_reference_time: 2014-03-21 00:00:00-04:00
units: units <class 'pynwb.misc.Units'>
python cached-fsspec.py 5.49s user 1.07s system 3% cpu 2:44.60 total
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'hdmf-common' version 1.3.0 because version 1.5.1 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
/home/yoh/deb/gits/pkg-exppsy/hdmf-upstream/src/hdmf/spec/namespace.py:532: UserWarning: Ignoring cached namespace 'core' version 2.2.5 because version 2.5.0 is already loaded.
warn("Ignoring cached namespace '%s' version %s because version %s is already loaded."
root pynwb.file.NWBFile at 0x139635971343840
Fields:
acquisition: {
position_sensor0 <class 'pynwb.behavior.SpatialSeries'>,
position_sensor1 <class 'pynwb.behavior.SpatialSeries'>
}
devices: {
Device <class 'pynwb.device.Device'>,
Implant <class 'pynwb.device.Device'>
}
electrode_groups: {
shank1 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank2 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank3 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank4 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank5 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank6 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank7 <class 'pynwb.ecephys.ElectrodeGroup'>,
shank8 <class 'pynwb.ecephys.ElectrodeGroup'>
}
electrodes: electrodes <class 'hdmf.common.table.DynamicTable'>
experimenter: ['Yuta Senzai']
file_create_date: [datetime.datetime(2021, 1, 16, 20, 24, 13, 973543, tzinfo=tzoffset(None, -18000))]
identifier: 23647bce-e14c-4498-a45d-8b5dcd572360
institution: NYU
lab: Buzsaki
processing: {
behavior <class 'pynwb.base.ProcessingModule'>,
ecephys <class 'pynwb.base.ProcessingModule'>
}
related_publications: ['DOI:10.1016/j.neuron.2016.12.011']
session_description: Mouse in open exploration and theta maze.
session_id: YutaMouse20-140321
session_start_time: 2014-03-21 00:00:00-04:00
stimulus: {
DS1.ch4 <class 'pynwb.misc.AnnotationSeries'>,
DS2.ch4 <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim_briefLD <class 'pynwb.misc.AnnotationSeries'>,
MS_PulseStim_briefLaser <class 'pynwb.misc.AnnotationSeries'>,
PeakStim5V <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_10ms <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_1sec <class 'pynwb.misc.AnnotationSeries'>,
PulseStim_1sec_weak <class 'pynwb.misc.AnnotationSeries'>,
Ripple <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_5V_End <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_5V_Start <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_End <class 'pynwb.misc.AnnotationSeries'>,
StimDuringRun_Start <class 'pynwb.misc.AnnotationSeries'>,
TroughStim5V <class 'pynwb.misc.AnnotationSeries'>
}
subject: subject pynwb.file.Subject at 0x139635971180192
Fields:
age: 207 days, 0:00:00
genotype: SST-Cre
species: Mus musculus
subject_id: YutaMouse20
timestamps_reference_time: 2014-03-21 00:00:00-04:00
units: units <class 'pynwb.misc.Units'>
python cached-fsspec.py 5.75s user 0.84s system 4% cpu 2:41.85 total
11694714459 nwb-cache
11694714459 total
$> du -scm nwb-cache
83 nwb-cache
83 total
NB above reminded me about critical difference between So -- we shaved off only 3 sec. The logical test would be to compare against performance on a present locally file altogether (I might do later on a remote host) But may be the best would be to include such access method into a sweep of IO benchmarks @bendichter (I think) did in the past -- where that would be and either you think this case would fit there well? |
backreferencing some of the fsspec issues which might be "useful" if there would be desire to implement some solution based on fsspec:
|
Oh wow, this would allow use to do lazy read from gdrive, dropbox, and ssh? That's exciting! fsspec is indeed quite powerful! I could see lots of labs using this for internal stuff. It seems like the |
@yarikoptic, since I am new to fsspec, maybe I can write something up and you can take a look to ensure it is following best practices. |
Thanks @yarikoptic for investigating fsspec. This looks very useful and having the option to also read from GDrive, Dropbox etc. would be very nice to have indeed. |
I am not expert in it either, but I would be happy to have a look and adjust if needed. We could even run it by fsspec folks! They were typically quite responsive.
;) it might be indeed nice to inspect other filesystems support on how well it would work out -- haven't tried any other yet... will do now for |
to cool things down a bit: TL;DR: sparse caching would not work for SSH (at the moment at least -- I do not think that anything forbids implementing that support. Filed fsspec/filesystem_spec#1012) Meanwhile here is adjusted script which can be fed up with ssh:// URLs only to crashimport sys
import fsspec
# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)
import pynwb
import h5py
import urllib
from time import time
if '://' in sys.argv[-1]:
url = sys.argv[-1]
else:
# and 11GB file to munch on. Would fetch about 80MB of data
url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"
# figure out filesystem, lets add some mappings
fsmap = {'https': 'http'}
#In [8]: urllib.parse.urlparse('ssh://lkasjdf')
#Out[8]: ParseResult(scheme='ssh', netloc='lkasjdf', path='', params='', query='', fragment='')
url_ = urllib.parse.urlparse(url)
scheme = url_.scheme
scheme = fsmap.get(scheme, scheme)
if scheme == 'http':
# for http -- everything goes into fs.open
fspath = url
fskw = {}
elif scheme == 'ssh':
fspath = url_.path.lstrip('/') # consider it from $HOME for now
fskw = dict(
host=url_.netloc.split(':', 1)[0],
port=int(url_.netloc.split(':', 1)[1]) if ':' in url_.netloc else 22
)
else:
raise NotImplementedError(f"Do not know how to handle {scheme}")
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
fs=fsspec.filesystem(scheme, **fskw),
# target_protocol='blockcache',
cache_storage="nwb-cache",
# cache_check=600,
# block_size=1024,
# check_files=True,
# expiry_times=True,
# same_names=True
)
print(f"Accessing {url} as {fspath} on {fs} ")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order! Seems to be crucial.
t0 = time()
with fs.open(fspath, 'rb') as f:
with h5py.File(f) as file:
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
out = io.read()
print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec") |
What would you like to see added to PyNWB?
ROS3 driver provides a nice way to stream data without downloading file in full, as shown in this example. But it is S3 specific.
Is your feature request related to a problem?
somewhat. AFAIK ros3 driver doesn't provide ability for local sparse caching etc.
What solution would you like?
Although i have not looked into how it is implemented, I bet ros3 driver is working through HTTP Range requests. Such method is generic and could work for any http/https URLs. That relies in the basis of many solutions, such as https://github.com/fsspec/filesystem_spec providing uniform access to a wide (not just S3) range of resources, where
fsspec
provides genericfile
interface, see e.g. examples in fsspec doc. Another example for such use could be https://github.com/datalad/datalad-fuse/ where we provide fsspec-based FUSE file system for any datalad dataset where fsspec provides seamless, locally cached, sparse access to any remote file, nwb included.So, by supporting opening from a file handle, it should be possible to open some fsspec'ed file.
NWBHDF5IO
seems to take path, not file handle. Would it be "easy" to support operating with a pre-opened file handle? If so -- might be just a matter of an example documentation to demonstrate such a cool feature.Do you have any interest in helping implement the feature?
No.
Code of Conduct
The text was updated successfully, but these errors were encountered: