-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-101362: Optimise pathlib by deferring path normalisation #101560
Conversation
`PurePath` now normalises and splits paths only when necessary, e.g. when `.name` or `.parent` is accessed. The result is cached. This speeds up path object construction by around 4x. `PurePath.__fspath__()` now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impure `Path` methods, and also speeds up pickling, `p.joinpath('bar')` and `p / 'bar'`. This also fixes pythonGH-76846 and pythonGH-85281 by unifying path constructors and adding an `__init__()` method.
Constructing path objects is up to 4x faster with one argument: $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo/bar")'
1000000 loops, best of 5: 2.01 usec per loop # before
1000000 loops, best of 5: 495 nsec per loop # after More than 2x faster with two arguments: $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'PurePath("foo", "bar")'
1000000 loops, best of 5: 2.28 usec per loop # before
1000000 loops, best of 5: 1.02 usec per loop # after ~~And ~25% faster when joining arguments:~~ [edit: no longer true! ] $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath; p = PurePath("foo")' 'p.joinpath("bar")'
1000000 loops, best of 5: 1.66 usec per loop # before
1000000 loops, best of 5: 1.3 usec per loop # after But it's 12% slower when the path needs normalization, as with $ ./python -m timeit -n 1000000 -s 'from pathlib import PurePath' 'str(PurePath("foo/bar"))'
1000000 loops, best of 5: 2.96 usec per loop # before
1000000 loops, best of 5: 3.31 usec per loop # after
[edit: resolved! see comment] $ ./python -m timeit -n 20 -s 'from pathlib import Path' 'list(Path().rglob("*"))'
20 loops, best of 5: 53.4 msec per loop # before
20 loops, best of 5: 66.5 msec per loop # after
[edit: no longer true! this can't be properly fixed until other stuff lands] $ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").read_text()'
100000 loops, best of 5: 26.1 usec per loop # before
100000 loops, best of 5: 21.2 usec per loop # after
$ ./python -m timeit -n 100000 -s 'from pathlib import Path' 'Path("README.rst").exists()'
100000 loops, best of 5: 5.45 usec per loop # before
100000 loops, best of 5: 2.97 usec per loop # after |
I've found a couple other small optimizations which are best tackled in other PRs, so I'm marking this PR as a 'draft' for now. |
I've undone the change to
Still a tiny bit slower than pre-PR. The rest of the speedups/slowdowns mentioned in my previous comment are still there. |
The change to I think I need to solve that issue first, so I'm going to mark this PR as a draft (again!) |
This PR has strayed too far from the original implementation. I'm going to abandon it. New PR here: |
PurePath
now normalises and splits paths only when necessary, e.g. when.name
or.parent
is accessed. The result is cached. This speeds up path object construction by around 4x.edit: will fix separately.PurePath.__fspath__()
now returns an unnormalised path, which should be transparent to filesystem APIs (else pathlib's normalisation is broken!). This extends the earlier performance improvement to most impurePath
methods, and also speeds upp.joinpath('bar')
andp / 'bar'
.This also fixes GH-76846 and GH-85281 by unifying path constructors and adding anedit: will fix separately.__init__()
method.