Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create async iterator version of _lsdir() #670

Merged
merged 5 commits into from
Dec 12, 2022
Merged

Conversation

rlamy
Copy link
Contributor

@rlamy rlamy commented Dec 7, 2022

When dealing with large directories, using _ls()/_lsdir() forces us to wait until all the contents of the directory have been received, which can take a significant amount of time. This PR creates an _iterdir() async iterator which allows the results to be handled as soon as they're received from the network.

@martindurant
Copy link
Member

Please add a docstring to _iterdict .

Do you think there should also be a sync version of this, iterdict() ?

@rlamy
Copy link
Contributor Author

rlamy commented Dec 9, 2022

I don't think it makes a lot of sense to wrap _iterdir() with a sync iterator. We'd have to switch between sync and async code on each iteration, which seems inefficient.

@martindurant
Copy link
Member

It may be inefficient, but perhaps still desirable for a case where the caller only wants files in the first few pages out of many.

@efiop
Copy link
Member

efiop commented Dec 9, 2022

Would it be possible to make _lsdir return an iterator so that the user could stop iterating when he had enough results? Introducing iterdir basically does that but with a brand new method, that seems excessive. My understanding was that _iterdir is more of an internal async method and not really a public API.

@martindurant
Copy link
Member

_iterdir is more of an internal async method and not really a public API

As written, maybe (but all async methods are underscored); but actually DVC plans to call it, right? Which suggests that it is useful, and a sync version of it would be too. I don't actually know whether you can simply wrap an async iterator and get a sync iterator as you would expect for a normal method. If yes, do it; if no, then it's probably too much effort.

@efiop
Copy link
Member

efiop commented Dec 9, 2022

For now this will be used in another project related to dvc and for that purpose, we could use it as a hidden method and that was my understanding. Maybe @rlamy could clarify.

Maybe it makes sense to have an official iterdir/_iterdir method, I'm not sure, I'm just trying to understand how this is going to fit into the spec and whether it won't be something undesirable considering that ls/_ls already exists (btw, I got confused at called it _lsdir in my comment before, but I really meant _ls, because _lsdir is not official). Just to clarify, this PR is fine as a hidden method that is used by a hidden _lsdir, I only worry about the official spec, as @martindurant suggested adding a sync iterdir, which would make it official in my view.

@rlamy
Copy link
Contributor Author

rlamy commented Dec 9, 2022

@efiop Indeed, for our (dvc-related) current purposes, it's fine to use a private hidden method.

Having iterdir in the spec could make sense, as there is precedent in the stdlib (pathlib.Path.iterdir). However, wrapping an async iterator with a sync iterator isn't trivial. I guess something like the following could work, but it probably ignores a number of edge cases:

def iterdir(self, ...):
    aiter = self.iterdir(...)
    while True:
        try:
            entry = sync(self.loop, aiter.__anext__)
            yield entry
        except StopAsyncIteration:
            break

@martindurant
Copy link
Member

OK, going to merge this as an internal implementation detail for now, and we can consider more visible changes at another time.

@martindurant martindurant merged commit 5917684 into fsspec:main Dec 12, 2022
@rlamy rlamy deleted the iterdir branch December 13, 2022 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants