-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urllib.parse.urljoin is broken in python 3.5 #69589
Comments
urllib.parse.urljoin does not conform the RFC 1808 in case of joining relative URL’s containing ‘..’ path components anymore. Examples: Python 3.4:
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/..'
Python 3.5:
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/'
Python 3.4:
>>> urllib.parse.urljoin('a/’, '..')
''
Python 3.5:
>>> urllib.parse.urljoin('a/', '..')
'/'
Python 3.4:
>>> urllib.parse.urljoin('a/’, '../..')
'..'
Python 3.5:
>>> urllib.parse.urljoin('a/', '../..')
'/' Python 3.4 conforms RFC 1808 in these scenarios, but Python 3.5 does not. |
It's a change made in 3.5 that resolution of relative URLs confirms to the RFC 3986. See https://bugs.python.org/issue22118 for details. |
See also this change: changeset: 95683:fc0e79387a3a Patch by Demian Brecht. |
It is true that 3.5 is meant to follow RFC 3986, which obsoletes RFC 1808 and specifies slightly different behaviour for abnormal cases. This change is documented under urljoin(), and also in “What’s New in 3.5”. Pavel’s first case is one of these differences in the RFCs, and I don’t think it is a bug. According to <https://tools.ietf.org/html/rfc3986.html#section-5.2.4\>, “The remove_dot_segments algorithm respects [the base’s] hierarchy by removing extra dot-segments rather than treating them as an error or leaving them to be misinterpreted by dereference implementations.” For Pavel’s second and third cases, RFC 3986 doesn’t cover them directly because the base URL is relative. The RFC only covers absolute base URLs, which start with a scheme like “http:”. The documentation doesn’t really bless these cases either: ‘Construct a full (“absolute”) URL’. However there is explicit support in the source code ("" in urllib.parse.uses_relative). It looks like 3.5 is strict in following the RFC’s Remove Dot Segments algorithm. Step 2C says that for “/../” or “/..”, the parent segment is removed, but the input is always replaced with “/”: “a/..” → “/” I would prefer a less strict interpretation of the spirit of the algorithm. Do not introduce a slash in the input if you did not remove one from the output buffer: “a/..” → empty URL Python 3.4 and earlier did not behave sensibly if you extend the relative URL: >>> urljoin("a/", "..")
''
>>> urljoin("a/", "../..")
'..'
>>> urljoin("a/", "../../..")
''
>>> urljoin("a/", "../../../..")
'../' Pavel, what behaviour would you expect in these cases? My empty URL interpretation, or perhaps a more sensible version of the Python 3.4 behaviour? What is your use case? One related more serious (IMO) regression I noticed compared to 3.4, where the path becomes a host name: >>> urljoin("file:///base", "/dummy/..//host/oops")
'file://host/oops' |
Trying this out on main (3.14.0 alpha 0), this now behaves as follows: import urllib.parse
print(urllib.parse.urljoin('http://a.com', '..')) # http://a.com/
print(urllib.parse.urljoin("a/", "..")) # ..
print(urllib.parse.urljoin('a/', '../..')) # ../.. This is neither aligned with 3.4 nor 3.5 output, as listed above. 3.4 and 3.5 are both very much EOL, but I'm not totally sure if 3.14's behavior aligns with the RFC...so I will leave it open unless someone with the expertise here says otherwise. If it's left open, perhaps the title should be updated to reflect that this applies to more current versions. |
RFC 3986 only covers absolute base URI. For relative base URI we can implement what looks most useful to us. The bug mentioned in #69589 (comment) has been fixed: >>> urljoin("file:///base", "/dummy/..//host/oops")
'file:////host/oops' The results of other examples is aligned with 3.5 (I think that a temporary difference was a bug #125926). |
* Preserve double slashes in path. * Fix the case when the base path is relative and the relative reference path starts with '..'.
I was wrong, The algorithms in RFC 3986 define also behavior for the case when the base path is relative. #126679 should fix >>> import urllib.parse
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com'
>>> urllib.parse.urljoin('a/', '..')
''
>>> urllib.parse.urljoin('a/', '../..')
''
>>> urllib.parse.urljoin('a/', '../../..')
''
>>> urllib.parse.urljoin('a/', '../../../..')
'' |
…e.urljoin() * Preserve double slashes in path. * Fix the case when the base path is relative and the relative reference path starts with '..'.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: