Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFTPFile(AbstractBufferedFile) for sparse access to remote file over ssh #1012

Open
yarikoptic opened this issue Aug 4, 2022 · 5 comments
Open

Comments

@yarikoptic
Copy link
Contributor

Ref: #748 which I found while also wondering about AttributeError: 'SFTPFile' object has no attribute 'blocksize' error ;-)

I guess there is no "sparse" cache due to not implemented "range" request support for sftp as it is done for e.g. HTTPFile?

But it seems that sftp itself does allow for range requests, e.g.:

$> curl --silent --range 0-0,-1 sftp://yoh@secret.datalad.org:2222/home/yoh/c4057c5e-7af5-4370-878f-ccfc971aeba4 | hexdump
0000000 0089                                   
0000001

so I guess it should be well be possible provide such support in fsspec... didn't look in detail yet anywhere but paramiko does seems to support the seekable BufferedFile

$> git grep -p 'def seek'
paramiko/_winapi.py=class MemoryMap(object):
paramiko/_winapi.py:    def seek(self, pos):
paramiko/file.py=class BufferedFile(ClosingContextManager):
paramiko/file.py:    def seekable(self):
paramiko/file.py:    def seek(self, offset, whence=0):
paramiko/sftp_file.py=class SFTPFile(BufferedFile):
paramiko/sftp_file.py:    def seekable(self):
paramiko/sftp_file.py:    def seek(self, offset, whence=0):

so may be it is really just a quick patch away? ;)

@martindurant
Copy link
Member

As far as I remember, you do indeed get sparse random access in paramiko/sftp, but it is implemented in the lower level, so fsspec doesn't need to.

@efiop , we discussed a while ago ditching paramiko (and the older HDFS backend), maybe the time has come?

@efiop
Copy link
Member

efiop commented Aug 4, 2022

@martindurant Sounds good. I suppose we only really care about basic operations like open/read/write/etc here for the most part, so potential inconsistencies in sshfs shouldn't be a giant problem? If I understand correctly, you propose we just make sshfs default in the registry, right?

@yarikoptic Hey Yaroslav, long time no see 😉 Could you try https://github.com/fsspec/sshfs just to see if it will work for you?

@yarikoptic
Copy link
Contributor Author

Hi @efiop! Indeed long time. I really hope you are doing ok.

Ran into ronf/asyncssh#504 so had to move aside my ~/.ssh and tune up my ugly quick script to ask for password

The script with which I am "exploring" the fsspec on a type of files in interest
import sys
import fsspec

# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)

import pynwb
import h5py
import urllib

from time import time

if '://' in sys.argv[-1]:
    url = sys.argv[-1]
else:
    # and 11GB file to munch on. Would fetch about 80MB of data
    url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"

# figure out filesystem, lets add some mappings
fsmap = {'https': 'http'}

#In [8]: urllib.parse.urlparse('ssh://lkasjdf')
#Out[8]: ParseResult(scheme='ssh', netloc='lkasjdf', path='', params='', query='', fragment='')

url_ = urllib.parse.urlparse(url)
scheme = url_.scheme
scheme = fsmap.get(scheme, scheme)

if scheme == 'http':
    # for http -- everything goes into fs.open
    fspath = url
    fskw = {}
elif scheme in ('ssh', 'sshfs'):
    fspath = url_.path.lstrip('/') # consider it from $HOME for now
    import getpass
    fskw = dict(
        host=url_.netloc.split(':', 1)[0],
        port=int(url_.netloc.split(':', 1)[1]) if ':' in url_.netloc else 22,
        # cannot use keys so will demand password
        password=getpass.getpass("Password:"),
    )
else:
    raise NotImplementedError(f"Do not know how to handle {scheme}")

if scheme == 'sshfs':
    from sshfs import SSHFileSystem
    fs = SSHFileSystem(**fskw)
else:
    fs = fsspec.filesystem(scheme, **fskw)

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fs,
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
    # cache_check=600,
    # block_size=1024,
    # check_files=True,
    # expiry_times=True,
    # same_names=True
)


print(f"Accessing {url} as {fspath} on {fs} ")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order!  Seems to be crucial.
t0 = time()
with fs.open(fspath, 'rb') as f:
    with h5py.File(f) as file:
        with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
            out = io.read()
            print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec")

and running it results in

$> time python -Wignore cached-fsspec.py sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4
Password:
Accessing sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4 as c4057c5e-7af5-4370-878f-ccfc971aeba4 on <fsspec.implementations.cached.CachingFileSystem object at 0x7f6bf6aadfc0> 
Traceback (most recent call last):
  File "/home/yoh/proj/dandi/trash/cached-fsspec.py", line 71, in <module>
    with fs.open(fspath, 'rb') as f:
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/spec.py", line 1034, in open
    f = self._open(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 342, in _open
    f.cache = MMapCache(f.blocksize, f._fetch_range, f.size, fn, blocks)
AttributeError: 'SSHFile' object has no attribute '_fetch_range'


^C
python -Wignore cached-fsspec.py   2.53s user 0.18s system 1% cpu 2:57.46 total

so it stalls after traceback and requires Ctrl-C it. If you need a sample of that file to try exactly that script -- it is this one https://dandiarchive.s3.amazonaws.com/blobs/c40/57c/c4057c5e-7af5-4370-878f-ccfc971aeba4 . versions of fsspec and sshfs and asyncssh AFAIK are all "bleeding edge from github".
FWIW:

> /home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py(342)_open()
-> f.cache = MMapCache(f.blocksize, f._fetch_range, f.size, fn, blocks)
(Pdb) p f
<sshfs.file.SSHFile object at 0x7fec96775240>
(Pdb) p f.__module__
'sshfs.file'
(Pdb) p dir(f)
['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_close', '_closed', '_file', '_open_file', 'blocksize', 'close', 'closed', 'fileno', 'flush', 'fs', 'fsync', 'isatty', 'kwargs', 'loop', 'max_requests', 'mode', 'path', 'read', 'readable', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']

@yarikoptic
Copy link
Contributor Author

BTW -- just to make sure -- having installed (and imported) sshfs does not automagically make sshfs known for fsspec.registry.known_implementations and available for fsspec.filesystem right? (I would have assumed it would through some entrypoint or alike)

@efiop
Copy link
Member

efiop commented Aug 4, 2022

@yarikoptic Correct, sshfs is not part of the registry for now.

Ok, so looks like it is still missing some functionality, but seems closer than paramiko-based fs. You might be able to get it working with a little bit of effort. I haven't looked too deeply into that though, so I might be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants