Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"File name too long" with HTTP presigned requests #1504

Open
cgrass opened this issue Jan 16, 2024 Discussed in #1490 · 6 comments
Open

"File name too long" with HTTP presigned requests #1504

cgrass opened this issue Jan 16, 2024 Discussed in #1490 · 6 comments

Comments

@cgrass
Copy link

cgrass commented Jan 16, 2024

Discussed in #1490

Originally posted by cgrass January 4, 2024
Hello,
I'm new to python and fsspec, so hopefully there is an obvious answer to my question. Thanks for the help!

Issue

I need to fetch a file from an s3 bucket using https and a pre-signed URL. The URL produced is quite long, commonly over 1500 characters (mostly query params). If I use the code below, I get an OS exception: File name too long.

def test_this():
    kwargs = {}
    spec = fsspec.filesystem("https", **kwargs)
    frompath = {longurl}
    topath = "/var/"  ### in prod this is a real dest
    response = spec.get(frompath, topath, recursive=False, **kwargs)
    print(response)

I walked through the code and can see that the rpath is typically appended to the lpath for storage in asyn.py:

lpaths = other_paths(
    rpaths,
    lpath,
    exists=exists,
    flatten=not source_is_str,
)

def other_paths(
    paths: list[str],
    path2: str | list[str],
    exists: bool = False,
    flatten: bool = False,
) -> list[str]:
    if isinstance(path2, str):
        path2 = path2.rstrip("/")

        if flatten:
            path2 = ["/".join((path2, p.split("/")[-1])) for p in paths]
        else:
            cp = common_prefix(paths)
            if exists:
                cp = cp.rsplit("/", 1)[0]
            if not cp and all(not s.startswith("/") for s in paths):
                path2 = ["/".join([path2, p]) for p in paths]
            else:
                path2 = [p.replace(cp, path2, 1) for p in paths]
    else:
        assert len(paths) == len(path2)
    return path2

Question

Am I misusing the lib or protocol? Is there a way to configure fsspec to scrub query params (or use some arbitrary string) for the appended destination string?

Setup

MacOS Ventura M1 Pro
fsspec-2023.12.2
Python 3.9, 3.10 (tried both)

Linked bug report

@cgrass
Copy link
Author

cgrass commented Jan 16, 2024

In the original discussion I suggested fixing the bug by breaking up the other_paths use cases into distinct implementations. in that design the local path could be constructed using a hash of the source filename. e.g., get() for single files should be simple:

  1. destination (path2/lpath) is file -> write rpath contents to destination file. because user defines lpath, it's up to them to ensure reasonable length.
  2. destination (path2/lpath) is dir -> hash (md5) filename to construct lpath. this prevents long rpath filenames from being used directly to construct lpath.

i think it would also be helpful to return the constructed lpath from get().

@martindurant
Copy link
Member

We assume that the local filename should match the remote one when copying to inside a directory - this is what any copy operation would guarantee. The question is, what part of the remote URL we consider the "filename". Specifically for HTTP, it's not obvious whether query parameters are or are not part; but maybe the correct information is available in the headers.

@cgrass
Copy link
Author

cgrass commented Jan 18, 2024

I don't think you can guarantee that behavior; the source system might have fundamentally different path limits or requirements than the dest system.

if we assume that the query param is removed entirely and a legal 1000 character filename is part of the url, what can/should fsspec do to copy the file locally?

@martindurant
Copy link
Member

I think that in cases where the local filesystem can't handle a name, the caller should supply explicit names to write to using list inputs, or use get_file() instead of get().

@cgrass
Copy link
Author

cgrass commented Jan 24, 2024

the caller should supply explicit names to write to using list inputs

it seems awkward to force callers to create a list if they are interacting with a single file. especially if that requirement is only valid when rpath is longer than the local filesystem can handle.

or use get_file() instead of get()

I created a couple unit tests today and it works great when rpath and lpath are file locations. thanks for pointing out that alternative method! should it work for all fs implementations? i saw that it's unimplemented in async.py, but it's not clear to me if that will be a problem or not in a live env.

I found that passing in a dir for lpath failed with: IsADirectoryError: [Errno 21] Is a directory: '/var/myloc/'. If lpath must be a file location you might want to update the docs/comments.

here is the test that shows the behavior:

def test_http_output():
    kwargs = {}
    fs = fsspec.implementations.http.HTTPFileSystem(fsspec.filesystem("https", **kwargs))
    expected_output_path = "/var/myloc/"
    rpath = {longS3SignedUrl}
    lpath = expected_output_path
    fs.get_file(rpath, lpath)

@martindurant
Copy link
Member

if that requirement is only valid when rpath is longer than the local filesystem can handle

I don't think there is a general solution to this

should it work for all fs implementations?

Yes!

I found that passing in a dir for lpath failed

Indeed, this copies from a file path to a file path, which is why it gets around the case of auto-generated names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants