Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse.urljoin is broken in python 3.5 #69589

Open
PavelIvanov mannequin opened this issue Oct 14, 2015 · 8 comments
Open

urllib.parse.urljoin is broken in python 3.5 #69589

PavelIvanov mannequin opened this issue Oct 14, 2015 · 8 comments
Assignees
Labels
3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@PavelIvanov
Copy link
Mannequin

PavelIvanov mannequin commented Oct 14, 2015

BPO 25403
Nosy @orsenthil, @ezio-melotti, @berkerpeksag, @vadmium, @kilowu, @tirkarthi, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-10-14.11:49:00.323>
labels = ['type-bug', 'library']
title = 'urllib.parse.urljoin is broken in python 3.5'
updated_at = <Date 2021-12-02.11:20:59.906>
user = 'https://bugs.python.org/PavelIvanov'

bugs.python.org fields:

activity = <Date 2021-12-02.11:20:59.906>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2015-10-14.11:49:00.323>
creator = 'Pavel Ivanov'
dependencies = []
files = []
hgrepos = []
issue_num = 25403
keywords = ['3.5regression']
message_count = 5.0
messages = ['252986', '253031', '253032', '253067', '407502']
nosy_count = 8.0
nosy_names = ['orsenthil', 'ezio.melotti', 'berker.peksag', 'martin.panter', 'kilowu', 'Pavel Ivanov', 'xtreak', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25403'
versions = ['Python 3.5', 'Python 3.6']

Linked PRs

@PavelIvanov
Copy link
Mannequin Author

PavelIvanov mannequin commented Oct 14, 2015

urllib.parse.urljoin does not conform the RFC 1808 in case of joining relative URL’s containing ‘..’ path components anymore.

Examples:

Python 3.4: 
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/..'
Python 3.5:
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '..')
''
Python 3.5:
>>> urllib.parse.urljoin('a/', '..')
'/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '../..')
'..'
Python 3.5:
>>> urllib.parse.urljoin('a/', '../..')
'/'

Python 3.4 conforms RFC 1808 in these scenarios, but Python 3.5 does not.

@PavelIvanov PavelIvanov mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 14, 2015
@kilowu
Copy link
Mannequin

kilowu mannequin commented Oct 15, 2015

It's a change made in 3.5 that resolution of relative URLs confirms to the RFC 3986. See https://bugs.python.org/issue22118 for details.

@vstinner
Copy link
Member

See also this change:

changeset: 95683:fc0e79387a3a
user: Berker Peksag <berker.peksag@gmail.com>
date: Thu Apr 16 02:31:14 2015 +0300
files: Lib/test/test_urlparse.py Lib/urllib/parse.py Misc/NEWS
description:
Issue bpo-23703: Fix a regression in urljoin() introduced in 901e4e52b20a.

Patch by Demian Brecht.

@vadmium
Copy link
Member

vadmium commented Oct 16, 2015

It is true that 3.5 is meant to follow RFC 3986, which obsoletes RFC 1808 and specifies slightly different behaviour for abnormal cases. This change is documented under urljoin(), and also in “What’s New in 3.5”. Pavel’s first case is one of these differences in the RFCs, and I don’t think it is a bug. According to <https://tools.ietf.org/html/rfc3986.html#section-5.2.4\>,

“The remove_dot_segments algorithm respects [the base’s] hierarchy by removing extra dot-segments rather than treating them as an error or leaving them to be misinterpreted by dereference implementations.”

For Pavel’s second and third cases, RFC 3986 doesn’t cover them directly because the base URL is relative. The RFC only covers absolute base URLs, which start with a scheme like “http:”. The documentation doesn’t really bless these cases either: ‘Construct a full (“absolute”) URL’. However there is explicit support in the source code ("" in urllib.parse.uses_relative).

It looks like 3.5 is strict in following the RFC’s Remove Dot Segments algorithm. Step 2C says that for “/../” or “/..”, the parent segment is removed, but the input is always replaced with “/”:

“a/..” → “/”
“a/../..” → “/..” → “/”

I would prefer a less strict interpretation of the spirit of the algorithm. Do not introduce a slash in the input if you did not remove one from the output buffer:

“a/..” → empty URL
“a/../..” → “..” → empty URL

Python 3.4 and earlier did not behave sensibly if you extend the relative URL:

>>> urljoin("a/", "..")
''
>>> urljoin("a/", "../..")
'..'
>>> urljoin("a/", "../../..")
''
>>> urljoin("a/", "../../../..")
'../'

Pavel, what behaviour would you expect in these cases? My empty URL interpretation, or perhaps a more sensible version of the Python 3.4 behaviour? What is your use case?

One related more serious (IMO) regression I noticed compared to 3.4, where the path becomes a host name:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file://host/oops'

@vadmium vadmium removed the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 16, 2015
@iritkatriel
Copy link
Member

iritkatriel commented Dec 1, 2021

See also bpo-37235, bpo-40594.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@savannahostrowski
Copy link
Member

Trying this out on main (3.14.0 alpha 0), this now behaves as follows:

import urllib.parse

print(urllib.parse.urljoin('http://a.com', '..')) # http://a.com/
print(urllib.parse.urljoin("a/", "..")) # ..
print(urllib.parse.urljoin('a/', '../..')) # ../..

This is neither aligned with 3.4 nor 3.5 output, as listed above.

3.4 and 3.5 are both very much EOL, but I'm not totally sure if 3.14's behavior aligns with the RFC...so I will leave it open unless someone with the expertise here says otherwise. If it's left open, perhaps the title should be updated to reflect that this applies to more current versions.

@serhiy-storchaka
Copy link
Member

RFC 3986 only covers absolute base URI. For relative base URI we can implement what looks most useful to us.

The bug mentioned in #69589 (comment) has been fixed:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file:////host/oops'

The results of other examples is aligned with 3.5 (I think that a temporary difference was a bug #125926).

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 11, 2024
* Preserve double slashes in path.
* Fix the case when the base path is relative and the relative reference
  path starts with '..'.
@serhiy-storchaka
Copy link
Member

I was wrong, The algorithms in RFC 3986 define also behavior for the case when the base path is relative.

#126679 should fix urljoin() for these corner cases.

>>> import urllib.parse
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com'
>>> urllib.parse.urljoin('a/', '..')
''
>>> urllib.parse.urljoin('a/', '../..')
''
>>> urllib.parse.urljoin('a/', '../../..')
''
>>> urllib.parse.urljoin('a/', '../../../..')
''

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Nov 11, 2024
…e.urljoin()

* Preserve double slashes in path.
* Fix the case when the base path is relative and the relative reference
  path starts with '..'.
@serhiy-storchaka serhiy-storchaka added the 3.14 new features, bugs and security fixes label Nov 11, 2024
@serhiy-storchaka serhiy-storchaka self-assigned this Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants