Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-119169: Skip reversing sibling directories in os.[f]walk(topdown=False) #119473

Closed
wants to merge 4 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented May 23, 2024

In os.walk(topdown=False), don't bother reversing walk_dirs. This means that sibling directories are visited in a different order, but 1) that order is arbitrary and comes from os.scandir(), and 2) unlike in top-down mode, users can't influence which directories are visited or in what order.

This change caused test_walk_bottom_up to fail. I think this test made assertions that were too specific and relied on os.scandir() returning things in a specific order. The test code is pretty hard to understand once you get into the details. I've replaced it with a version of the same test from test_pathlib_abc.py. The new test checks that every directory is yielded before all its ancestors, and after all its descendants, but does not check the order of its yielded siblings.

…wn=False)`

In `os.walk(topdown=False)`, don't bother reversing `walk_dirs`. This means
that sibling directories are visited in a different order, but 1) that
order is arbitrary and comes from `os.scandir()`, and 2) unlike in top-down
mode, users can't influence which directories are visited or in what order.

This change caused `test_walk_bottom_up` to fail. I think this test made
assertions that were too specific and relied on `os.scandir()` returning
things in a specific order, and the test code is pretty hard to understand
once you get into the details. I've replaced it with a version of the same
test from `test_pathlib_abc.py`.
@barneygale barneygale changed the title GH-119169: Skip reversing sibling directories in os.walk(topdown=False) GH-119169: Skip reversing sibling directories in os.[f]walk(topdown=False) May 30, 2024
@barneygale
Copy link
Contributor Author

Hey @serhiy-storchaka, just for context: this change doesn't provide much of a speedup in itself, but it enables more significant speedups seen in GH-119186. I've separated it out here because it's an observable change that breaks a test. But I suspect that doesn't matter as the scandir() result order is arbitrary and the order of sibling visits can't be influenced. What do you think? Thank you!

@serhiy-storchaka
Copy link
Member

If os.walk(topdown=False) reverses the order of visiting directories, this should be fixed.

@barneygale
Copy link
Contributor Author

It presently visits sibling directories in the same order they appear in dirnames.

What I'm proposing here is that we don't need to visit sibling directories in the same order they appear in dirnames in bottom-up mode, because:

  1. The order is arbitrary (it comes from os.scandir())
  2. Users aren't able to influence the order mid-walk by re-ordering dirnames (this is only possible in top-down mode)

Perhaps a worked example would be helpful? I can put one together.

@barneygale
Copy link
Contributor Author

barneygale commented Jul 8, 2024

Here's an example, first with the current main:

>>> os.mkdir('mydir')
>>> os.mkdir('mydir/subdir1')
>>> os.mkdir('mydir/subdir2')
>>> for entry in os.walk('mydir', topdown=False):
...     print(entry)
...     
('mydir/subdir2', [], [])
('mydir/subdir1', [], [])
('mydir', ['subdir2', 'subdir1'], [])

And with this PR:

>>> for entry in os.walk('mydir', topdown=False):
...     print(entry)
...     
('mydir/subdir1', [], [])
('mydir/subdir2', [], [])
('mydir', ['subdir2', 'subdir1'], [])

Note how the entry for subdir1 is now yielded before subdir2, and how this no longer corresponds to the order in dirnames for mydir.

If we think the new behaviour is acceptable (e.g. because we never guaranteed the order of sibling visits) then it unlocks an optimization: we can add paths to the stack while scanning directories rather than in a separate loop afterwards.

@serhiy-storchaka
Copy link
Member

I think that it would be confusing if os.walk(), grep.grep(), or manual recursive use of os.listdir() and os.scandir() produce different results. The order depends on the file system, but at least it is consistent.

The fact that the current code needs to reverse the lists of directories is just an implementation detail. It is not exposed to the user. If the code used deque instead of a list, it would not need to reverse them.

BTW, how much it makes difference? What it in comparison of using deque?

@barneygale
Copy link
Contributor Author

That makes sense, thanks.

BTW, how much it makes difference?

Taking into account the speedups in GH-119186 and GH-121433, it's about 1% for os.walk() and 6% for os.fwalk(). Nothing special.

What it in comparison of using deque?

I'll get back to you.

@barneygale
Copy link
Contributor Author

Hm, the deque would help with a breadth-first implementation, but I don't think it helps here without a lot of code changes. I'll close this PR and its two dependents.

@barneygale barneygale closed this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants