-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFTPFile(AbstractBufferedFile) for sparse access to remote file over ssh #1012
Comments
As far as I remember, you do indeed get sparse random access in paramiko/sftp, but it is implemented in the lower level, so fsspec doesn't need to. @efiop , we discussed a while ago ditching paramiko (and the older HDFS backend), maybe the time has come? |
@martindurant Sounds good. I suppose we only really care about basic operations like open/read/write/etc here for the most part, so potential inconsistencies in @yarikoptic Hey Yaroslav, long time no see 😉 Could you try https://github.com/fsspec/sshfs just to see if it will work for you? |
Hi @efiop! Indeed long time. I really hope you are doing ok. Ran into ronf/asyncssh#504 so had to move aside my ~/.ssh and tune up my ugly quick script to ask for password The script with which I am "exploring" the fsspec on a type of files in interestimport sys
import fsspec
# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)
import pynwb
import h5py
import urllib
from time import time
if '://' in sys.argv[-1]:
url = sys.argv[-1]
else:
# and 11GB file to munch on. Would fetch about 80MB of data
url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"
# figure out filesystem, lets add some mappings
fsmap = {'https': 'http'}
#In [8]: urllib.parse.urlparse('ssh://lkasjdf')
#Out[8]: ParseResult(scheme='ssh', netloc='lkasjdf', path='', params='', query='', fragment='')
url_ = urllib.parse.urlparse(url)
scheme = url_.scheme
scheme = fsmap.get(scheme, scheme)
if scheme == 'http':
# for http -- everything goes into fs.open
fspath = url
fskw = {}
elif scheme in ('ssh', 'sshfs'):
fspath = url_.path.lstrip('/') # consider it from $HOME for now
import getpass
fskw = dict(
host=url_.netloc.split(':', 1)[0],
port=int(url_.netloc.split(':', 1)[1]) if ':' in url_.netloc else 22,
# cannot use keys so will demand password
password=getpass.getpass("Password:"),
)
else:
raise NotImplementedError(f"Do not know how to handle {scheme}")
if scheme == 'sshfs':
from sshfs import SSHFileSystem
fs = SSHFileSystem(**fskw)
else:
fs = fsspec.filesystem(scheme, **fskw)
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
fs=fs,
# target_protocol='blockcache',
cache_storage="nwb-cache",
# cache_check=600,
# block_size=1024,
# check_files=True,
# expiry_times=True,
# same_names=True
)
print(f"Accessing {url} as {fspath} on {fs} ")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order! Seems to be crucial.
t0 = time()
with fs.open(fspath, 'rb') as f:
with h5py.File(f) as file:
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
out = io.read()
print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec") and running it results in $> time python -Wignore cached-fsspec.py sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4
Password:
Accessing sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4 as c4057c5e-7af5-4370-878f-ccfc971aeba4 on <fsspec.implementations.cached.CachingFileSystem object at 0x7f6bf6aadfc0>
Traceback (most recent call last):
File "/home/yoh/proj/dandi/trash/cached-fsspec.py", line 71, in <module>
with fs.open(fspath, 'rb') as f:
File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
return lambda *args, **kw: getattr(type(self), item).__get__(self)(
File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/spec.py", line 1034, in open
f = self._open(
File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
return lambda *args, **kw: getattr(type(self), item).__get__(self)(
File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 342, in _open
f.cache = MMapCache(f.blocksize, f._fetch_range, f.size, fn, blocks)
AttributeError: 'SSHFile' object has no attribute '_fetch_range'
^C
python -Wignore cached-fsspec.py 2.53s user 0.18s system 1% cpu 2:57.46 total so it stalls after traceback and requires Ctrl-C it. If you need a sample of that file to try exactly that script -- it is this one https://dandiarchive.s3.amazonaws.com/blobs/c40/57c/c4057c5e-7af5-4370-878f-ccfc971aeba4 . versions of fsspec and sshfs and asyncssh AFAIK are all "bleeding edge from github".
|
BTW -- just to make sure -- having installed (and imported) |
@yarikoptic Correct, sshfs is not part of the registry for now. Ok, so looks like it is still missing some functionality, but seems closer than paramiko-based fs. You might be able to get it working with a little bit of effort. I haven't looked too deeply into that though, so I might be wrong. |
Ref: #748 which I found while also wondering about
AttributeError: 'SFTPFile' object has no attribute 'blocksize'
error ;-)I guess there is no "sparse" cache due to not implemented "range" request support for sftp as it is done for e.g. HTTPFile?
But it seems that sftp itself does allow for range requests, e.g.:
so I guess it should be well be possible provide such support in fsspec... didn't look in detail yet anywhere but paramiko does seems to support the seekable BufferedFile
so may be it is really just a quick patch away? ;)
The text was updated successfully, but these errors were encountered: