urllib.parse.urljoin is broken in python 3.5 #69589

PavelIvanov · 2015-10-14T11:49:00Z

BPO	25403
Nosy	@orsenthil, @ezio-melotti, @berkerpeksag, @vadmium, @kilowu, @tirkarthi, @iritkatriel

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-10-14.11:49:00.323>
labels = ['type-bug', 'library']
title = 'urllib.parse.urljoin is broken in python 3.5'
updated_at = <Date 2021-12-02.11:20:59.906>
user = 'https://bugs.python.org/PavelIvanov'

bugs.python.org fields:

activity = <Date 2021-12-02.11:20:59.906>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2015-10-14.11:49:00.323>
creator = 'Pavel Ivanov'
dependencies = []
files = []
hgrepos = []
issue_num = 25403
keywords = ['3.5regression']
message_count = 5.0
messages = ['252986', '253031', '253032', '253067', '407502']
nosy_count = 8.0
nosy_names = ['orsenthil', 'ezio.melotti', 'berker.peksag', 'martin.panter', 'kilowu', 'Pavel Ivanov', 'xtreak', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25403'
versions = ['Python 3.5', 'Python 3.6']

Linked PRs

gh-69589, gh-84774: Fix path normalization in urllib.parse.urljoin() #126679

PavelIvanov · 2015-10-14T11:49:00Z

urllib.parse.urljoin does not conform the RFC 1808 in case of joining relative URL’s containing ‘..’ path components anymore.

Examples:

Python 3.4: 
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/..'
Python 3.5:
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '..')
''
Python 3.5:
>>> urllib.parse.urljoin('a/', '..')
'/'

Python 3.4: 
>>> urllib.parse.urljoin('a/’, '../..')
'..'
Python 3.5:
>>> urllib.parse.urljoin('a/', '../..')
'/'

Python 3.4 conforms RFC 1808 in these scenarios, but Python 3.5 does not.

kilowu · 2015-10-15T07:00:23Z

It's a change made in 3.5 that resolution of relative URLs confirms to the RFC 3986. See https://bugs.python.org/issue22118 for details.

vstinner · 2015-10-15T07:05:48Z

See also this change:

changeset: 95683:fc0e79387a3a
user: Berker Peksag <berker.peksag@gmail.com>
date: Thu Apr 16 02:31:14 2015 +0300
files: Lib/test/test_urlparse.py Lib/urllib/parse.py Misc/NEWS
description:
Issue bpo-23703: Fix a regression in urljoin() introduced in 901e4e52b20a.

Patch by Demian Brecht.

vadmium · 2015-10-16T02:19:29Z

It is true that 3.5 is meant to follow RFC 3986, which obsoletes RFC 1808 and specifies slightly different behaviour for abnormal cases. This change is documented under urljoin(), and also in “What’s New in 3.5”. Pavel’s first case is one of these differences in the RFCs, and I don’t think it is a bug. According to <https://tools.ietf.org/html/rfc3986.html#section-5.2.4\>,

“The remove_dot_segments algorithm respects [the base’s] hierarchy by removing extra dot-segments rather than treating them as an error or leaving them to be misinterpreted by dereference implementations.”

For Pavel’s second and third cases, RFC 3986 doesn’t cover them directly because the base URL is relative. The RFC only covers absolute base URLs, which start with a scheme like “http:”. The documentation doesn’t really bless these cases either: ‘Construct a full (“absolute”) URL’. However there is explicit support in the source code ("" in urllib.parse.uses_relative).

It looks like 3.5 is strict in following the RFC’s Remove Dot Segments algorithm. Step 2C says that for “/../” or “/..”, the parent segment is removed, but the input is always replaced with “/”:

“a/..” → “/”
“a/../..” → “/..” → “/”

I would prefer a less strict interpretation of the spirit of the algorithm. Do not introduce a slash in the input if you did not remove one from the output buffer:

“a/..” → empty URL
“a/../..” → “..” → empty URL

Python 3.4 and earlier did not behave sensibly if you extend the relative URL:

>>> urljoin("a/", "..")
''
>>> urljoin("a/", "../..")
'..'
>>> urljoin("a/", "../../..")
''
>>> urljoin("a/", "../../../..")
'../'

Pavel, what behaviour would you expect in these cases? My empty URL interpretation, or perhaps a more sensible version of the Python 3.4 behaviour? What is your use case?

One related more serious (IMO) regression I noticed compared to 3.4, where the path becomes a host name:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file://host/oops'

iritkatriel · 2021-12-01T23:03:54Z

See also bpo-37235, bpo-40594.

savannahostrowski · 2024-09-14T22:18:43Z

Trying this out on main (3.14.0 alpha 0), this now behaves as follows:

import urllib.parse

print(urllib.parse.urljoin('http://a.com', '..')) # http://a.com/
print(urllib.parse.urljoin("a/", "..")) # ..
print(urllib.parse.urljoin('a/', '../..')) # ../..

This is neither aligned with 3.4 nor 3.5 output, as listed above.

3.4 and 3.5 are both very much EOL, but I'm not totally sure if 3.14's behavior aligns with the RFC...so I will leave it open unless someone with the expertise here says otherwise. If it's left open, perhaps the title should be updated to reflect that this applies to more current versions.

serhiy-storchaka · 2024-11-07T07:47:50Z

RFC 3986 only covers absolute base URI. For relative base URI we can implement what looks most useful to us.

The bug mentioned in #69589 (comment) has been fixed:

>>> urljoin("file:///base", "/dummy/..//host/oops")
'file:////host/oops'

The results of other examples is aligned with 3.5 (I think that a temporary difference was a bug #125926).

* Preserve double slashes in path. * Fix the case when the base path is relative and the relative reference path starts with '..'.

serhiy-storchaka · 2024-11-11T11:29:54Z

I was wrong, The algorithms in RFC 3986 define also behavior for the case when the base path is relative.

#126679 should fix urljoin() for these corner cases.

>>> import urllib.parse
>>> urllib.parse.urljoin('http://a.com', '..')
'http://a.com'
>>> urllib.parse.urljoin('a/', '..')
''
>>> urllib.parse.urljoin('a/', '../..')
''
>>> urllib.parse.urljoin('a/', '../../..')
''
>>> urllib.parse.urljoin('a/', '../../../..')
''

…e.urljoin() * Preserve double slashes in path. * Fix the case when the base path is relative and the relative reference path starts with '..'.

PavelIvanov mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 14, 2015

vadmium removed the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 16, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-app bot mentioned this issue Nov 11, 2024

gh-69589, gh-84774: Fix path normalization in urllib.parse.urljoin() #126679

Open

serhiy-storchaka mentioned this issue Nov 11, 2024

3.14.0a1 urljoin regression wrt // #125974

Open

serhiy-storchaka added the 3.14 new features, bugs and security fixes label Nov 11, 2024

serhiy-storchaka self-assigned this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib.parse.urljoin is broken in python 3.5 #69589

urllib.parse.urljoin is broken in python 3.5 #69589

PavelIvanov mannequin commented Oct 14, 2015 •

edited by bedevere-app bot

Loading

PavelIvanov mannequin commented Oct 14, 2015

kilowu mannequin commented Oct 15, 2015

vstinner commented Oct 15, 2015

vadmium commented Oct 16, 2015

iritkatriel commented Dec 1, 2021 •

edited by serhiy-storchaka

Loading

savannahostrowski commented Sep 14, 2024

serhiy-storchaka commented Nov 7, 2024

serhiy-storchaka commented Nov 11, 2024

urllib.parse.urljoin is broken in python 3.5 #69589

urllib.parse.urljoin is broken in python 3.5 #69589

Comments

PavelIvanov mannequin commented Oct 14, 2015 • edited by bedevere-app bot Loading

Linked PRs

PavelIvanov mannequin commented Oct 14, 2015

kilowu mannequin commented Oct 15, 2015

vstinner commented Oct 15, 2015

vadmium commented Oct 16, 2015

iritkatriel commented Dec 1, 2021 • edited by serhiy-storchaka Loading

savannahostrowski commented Sep 14, 2024

serhiy-storchaka commented Nov 7, 2024

serhiy-storchaka commented Nov 11, 2024

PavelIvanov mannequin commented Oct 14, 2015 •

edited by bedevere-app bot

Loading

iritkatriel commented Dec 1, 2021 •

edited by serhiy-storchaka

Loading