-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems on Windows with username or hostname containing non-ASCII characters #3463
Comments
Hello, what is the pip version ? |
On my machine I got 8.0.2. The other setup I saw had a fresh Python 2.7.11 installation. Don't remember pip version, but it must have been the one bundled with the Python 2.7.11 Windows installer. |
In the past problem with non-ASCII usernames on Windows was raised in issue #1713. In that case the traceback was also pointing to |
Faced this again on a new training course. This time username had no non-ASCII characters, but Windows had Finnish locale and thus used This time our workaround was downloading the package and using Would someone be interested to help me with the fix/PR if I try to debug why this actually happens? What timeline would we have to get the fix into Python 2.7.12? This is a very annoying bug and gives new users a bad first impression about pip and Python. It must be fixed in a version distributed with Python because the bug prevents upgrading pip itself. |
Pure speculation, but that sounds like it might be where pip is trying to put the download cache. Maybe as a workaround you could try using the |
Yes,
Based on this analysis the real bug would be in optparse. Not sure what's the best way to avoid it on pip side. |
Yep, that analysis is exactly right. It's frustrating, because we don't even try to display the default value. It's also only an issue on Python 2 (of course... :-() A bug report against cpython might be worth it. But it'd be nice to work around it in pip, as "upgrade to 2.7.12" seems like a bit of a heavy handed recommendation for this. The only fix I can see within pip is to set the default for |
Just realized that it is likely that there are also other problems related to this. My students got errors when installing a package, not when showing help like I do, and I think their tracebacks were also different than what I get. Unfortunately I cannot reproduce that problem nor did I ask them to send me tracebacks. I can try creating a new Windows virtual machine with the main user having a non-ASCII user name and can also play with locale settings. We also continue the latest training next Monday and I can debug this in my student's machine then. |
Yeah, this isn't just a problem with optparse. I installed new Windows 7 machine with main user Päivi, installed Python 2.7.11 and got the following traceback when running Exception:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\pip\basecommand.py", line 211, in main
status = self.run(options, args)
File "c:\python27\lib\site-packages\pip\commands\install.py", line 294, in run
requirement_set.prepare_files(finder)
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 334, in prepare_files
functools.partial(self._prepare_file, finder))
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 321, in _walk_req_to_install
more_reqs = handler(req_to_install)
File "c:\python27\lib\site-packages\pip\req\req_set.py", line 461, in _prepare_file
req_to_install.populate_link(finder, self.upgrade)
File "c:\python27\lib\site-packages\pip\req\req_install.py", line 250, in populate_link
self.link = finder.find_requirement(self, upgrade)
File "c:\python27\lib\site-packages\pip\index.py", line 486, in find_requirement
all_versions = self._find_all_versions(req.name)
File "c:\python27\lib\site-packages\pip\index.py", line 404, in _find_all_versions
index_locations = self._get_index_urls_locations(project_name)
File "c:\python27\lib\site-packages\pip\index.py", line 378, in _get_index_urls_locations
page = self._get_page(main_index_url)
File "c:\python27\lib\site-packages\pip\index.py", line 818, in _get_page
return HTMLPage.get_page(link, session=self.session)
File "c:\python27\lib\site-packages\pip\index.py", line 928, in get_page
"Cache-Control": "max-age=600",
File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 477, in get
return self.request('GET', url, **kwargs)
File "c:\python27\lib\site-packages\pip\download.py", line 373, in request
return super(PipSession, self).request(method, url, *args, **kwargs)
File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "c:\python27\lib\site-packages\pip\_vendor\requests\sessions.py", line 605, in send
r.content
File "c:\python27\lib\site-packages\pip\_vendor\requests\models.py", line 750, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "c:\python27\lib\site-packages\pip\_vendor\requests\models.py", line 673, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 307, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "c:\python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 243, in read
data = self._fp.read(amt)
File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 54, in read
self.__callback(self.__buf.getvalue())
File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\controller.py", line 244, in cache_response
self.serializer.dumps(request, response, body=body),
File "c:\python27\lib\site-packages\pip\download.py", line 276, in set
return super(SafeFileCache, self).set(*args, **kwargs)
File "c:\python27\lib\site-packages\pip\_vendor\cachecontrol\caches\file_cache.py", line 99, in set
with self.lock_class(name) as lock:
File "c:\python27\lib\site-packages\pip\_vendor\lockfile\mkdirlockfile.py", line 18, in __init__
LockBase.__init__(self, path, threaded, timeout)
File "c:\python27\lib\site-packages\pip\_vendor\lockfile\__init__.py", line 189, in __init__
hash(self.path)))
File "c:\python27\lib\ntpath.py", line 85, in join
result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128) |
Also tested that using |
OK, I suspect this is just symptomatic of the usual "Python 2.7 is rubbish at working with Unicode" situation. It's possible that changing This probably needs someone with Python 2/Unicode experience on Windows to look into it. I'm a Python 3 person myself, so I can't really offer much more. Sorry. At least you have a workaround, which is something I guess... |
Little more debugging. The problem occurs on self.unique_name = os.path.join(dirname,
"%s%s.%s%s" % (self.hostname,
self.tname,
self.pid,
hash(self.path))) The reason is that u'c:\\users\\p\xe4ivi\\appdata\\local\\pip\\cache\\http\\3\\f\\b\\1\\f'
'P\xe4ivi-PC' This would explain why the problem doesn't occur in my main Windows virtual machine which has only ASCII characters in its host name. Unfortunately the default machine name Windows creates is based on the user name (like 'Päivi-PC' in my case). |
Do you @pfmoore have any Windows/Python2/Unicode gurus in the pip team? You are right that |
Fixing lockfile lib might be difficult as it's deprecated and pip has policy of bundling only released versions. The question is if after fixing lockfile we could have another release of lockfile. From https://pypi.python.org/pypi/lockfile
|
@pekkaklarck On reflection, although this is triggered by Windows usernames with non-ASCII characters, I guess it's a platform-independent issue. So it's really just an issue of Python 2's Unicode model. There seem to be multiple issues here though - the issue with help is because optparse tries to interpolate the default value of |
We should be able to fix the optparse issue by avoiding the default value like @pfmoore said I think yes? The lockfile issue is trickier :( There's a replacement for lockfile but CacheControl can't use it yet so it's likely that fixing it would require first updating CacheControl to use the new lockfile replacement, then possible fixing that lockfile replacement if it has the same problem. |
Little more debugging and testing. It seems to me that pip expects paths to be byte strings on Python 2. At least both OSX code and general UNIX code in the # OSX:
path = os.path.expanduser("~/Library/Caches")
# UNIX:
path = os.getenv("XDG_CACHE_HOME", os.path.expanduser("~/.cache")) I would assume Windows code returning Unicode from the same method is a bug. I already tested that simply encoding the Windows path with the MBCS codec fixes both
|
I could at least as reasonably argue that the other code paths not returning Unicode is a bug (we should be using Unicode internally, not passing round values that may be one or may be another). But I will concede that returning different things depending on the OS is a bug. That same argument does mean that returning different things depending on the Python version is at least questionable, if not just as much of a bug, though. Not having pip break for users is more important than arguing about Unicode purity, though. I just want to make sure we don't end up maintaining a fragile hack (and playing whack-a-mole with Unicode/str bugs throughout the code...) To answer your questions:
A question of my own - is it possible for user names, profile directories or whatever to not be encodable using the MBCS codec? You should probably use |
I think the ideal situation would be for pip to be all unicode internally. |
+1. |
All Unicode internally is definitely an ideal situation, but I'm not sure is that a practical target at the moment:
Due to the above reasons, I believe pip should use "native strings" (i.e. |
AFAIK, the MBCS codec is somehow dynamic and Windows ends up using whatever actual encoding the system uses. If that's the case, using it with data returned by Windows APIs ought to be safe. Both Jython and IronPython nowadays support pip, and at least we recommend using it in Robot Framework installation instructions also on those platforms. Not sure how widely pip actually is used on them, nor do I know what kind of changes could possibly break it. We can try summoning @jimbaker and @jdhardy here to comment. |
Should I create a PR fixing this by encoding Unicode paths returned by Windows APIs to bytes using MBCS codec on Python 2? Should I try also encoding with ASCII if MBCS is not found? Or should ASCII be used first and MBCS only if that fails? If everything fails, should there be an error or should the Unicode string be left through and hope for the best? |
Honestly, I'm not sure I'd want to take that approach. And without CI on Windows, having a Windows-specific change like that seems high-risk. Maybe the first thing to do would be to submit a PR containing some tests that demonstrate this bug. That may require getting Appveyor testing set up (or some pretty major monkeypatching of os.platform to exercise the code path on Unix...) That way, we'd have some means of being sure any fix works (and stays working) without needing someone on Windows with a non-ASCII username to check it out. |
I don't think this can really be tested outside Windows. The current functionality uses Windows API or registry, and neither of them work elsewhere. They could be mocked but then we'd be testing the mock. The fix should probably be implemented in a separate helper method converting Unicode paths to bytes. That could be tested directly, but the implementation would use the MBCS codec that's only available on Windows. |
I'm strongly against this. It goes against every piece of advice I have seen (or given!) on how to handle Unicode, which is to use purely Unicode within your application, and convert to and from byte strings at well-defined "boundaries". To respond to your individual points:
So we raise issues against those dependencies, or patch around the problem (i.e. treat the dependency as "outside the boundary" and convert to/from Unicode when interfacing to it).
Without evidence, this is pure speculation. General experience has been that a pure-Unicode model is far less likely to have problems, but we're both just making unsupported statements at this point.
I'll not get into a Python 2 vs Python 3 debate here (suffice it to say that my view is that if you want proper support for non-ASCII data, you should use Python 3). But "converting to Unicode just to convert back" is essentially the definition of the "Unicode internally" strategy, and the benefit is basically "far fewer encoding bugs". So IMO, the benefit of going to Unicode is precisely the fixing of this (and probably a number of other) bugs with encodings. I consider that benefit worth the cost. You're claiming that a "native string internally" strategy can provide the same benefit. My experience says you're wrong - it tends to simply replace current bugs with different ones. But note that simply fixing one function isn't "native string internally" either - it's just patching over the issue for now and hoping there aren't other bugs elsewhere that will be triggered by the change.
This is an approach that has been taken by a lot of projects, and as far as I know has never been an issue. Python 2/3 compatibility code is a fact of life, the fact that some of it is needed to cover conversion to Unicode isn't likely to be that much of an extra burden. But again, this feels like an unsubstantiated claim. The only way to know for sure is to actually try coding it. By the way, I'll also point out that there's a big issue with your "native strings internally" proposal that you've not considered - specifically under Python 2 where a "native string" is a bytestring. Without knowing the encoding, a Python 2 string is meaningless, and strings can come into pip from a variety of sources. For example, the filesystem (os.fsencoding), the registry as here (native Unicode, so choose your encoding), or files such as requirements files (UTF-8). Do you propose normalising everything to a particular encoding (which basically means you're doing a by-hand version of Unicode everywhere) or are you willing to take the risk that 2 strings in different encodings are needed in the same piece of code - that's how we get encoding errors... Anyway, I remain -1 on trying to force things to work by encoding stuff until bugs stop appearing. I'm +1 on long-term going to all-Unicode internally. I think that it's a bug for |
I'd like to separate the discussion about getting this really severe bug fixed ASAP and how pip handles bytes/Unicode. The fact is that pip uses bytes internally on Python 2 and that seems to work pretty well except for this particular issue. I have demonstrated that the problem can be fixed easily, without resorting to hacks, and I'm willing to create a pull request. I'd obviously be happy if someone else is interested to change pip to use Unicode internally also on Python 2 and fixes this issue along the way. The only thing I really care is that this issue would be fixed in the next Python 2.7.x release. Letting us in the non-ASCII world to deal with such bugs, even when there is a simple fix available, would be very much against the practicality beats purity principle. |
@pekkaklarck Agreed, which is why I suggested a fix for the optparse issue, which was the original subject of this PR, and suggested converting the data before passing it to the lockfile module (which should probably have been raised as a separate issue, but that's a minor point). I'm 100% in favour of getting pip to work with non-ASCII data. But we should be able to do that in a way that doesn't need to be ripped out and reworked when we move to the all-Unicode approach. |
Had a change to investigate this on my student's machine today. Learning:
|
@pfmoore: If we add a helper method to encode the Unicode path returned by Windows APIs to bytes, it obviously needs to be removed if pip is later changed to use Unicode internally. That wouldn't be a big task, though. The helper method itself would be pretty simple and the changes to the current Windows code would be something like this: if PY2 and isinstance(path, unicode):
path = _windows_unicode_path_to_bytes(path) Although I only really care about this issue being fixed, I'd like to put my 0,02€ in about Unicode vs bytes in this particular case. I fully agree that programs in general should use Unicode internally for presenting text. In this case there several reasons I doubt changing how pip handles paths is a good idea:
|
My temporal fix in optparse.py: 🙈 |
Encountered this again in another training. This time the machine had hostname "Kotiläppäri" (home laptop in Finnish). Used the pip version included with Python 2.7.12. |
Two related problems are fixed: - Previously non-ASCII characters in hostname blew up `pip install` completely. This is rather severe. - Non-ASCII characters in username crashed printing help text. Not so bad but definitely annoying. Non-ASCII usernames are pretty common in non-English speaking countries. That also makes non-ASCII hostnames pretty common, because Windows creates hostname based on the username by default. The reason for these failures was that `user_cache_dir` returned Unicode on Windows and bytes elsewhere, and rest of the codebase was expecting paths to be bytes. It could be argued that pip should always use Unicode internally, but that would require a lot more changes and fixing also some of the vendored dependencies. We can also argue, like PEP-519 does, that paths should not be considered to be strings at all, and thus the "all Unicode internally" guideline wouldn't apply in this case. Fixes pypa#3463. See that issue for more discussion and details.
PR #3970 fixes both the more severe problem with |
Hi, 4 days that I'm looking for a solution for my problem quite simple : Why pip doesn't work on my pc ? "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 5: ordinal not in range(128)" It's incredible that in all over the world there have so little people whith name including a non-ascii letter ! Before I did not like Python. But some software use Pyhton so I install Python :-( The software (platformIO) that I want to use still failed in a later place in a Python mystery
This is REALLY FUNNY : "some Windows users", "may fix" ! Sorry for this long post but it's so frustating that so many software use Pyhton, and Python was so messy Again thanks to pekkaklarck, you are making this "thing" less buggy. "I have a dream that one day" serious developers use reliable language. |
@Champal To be fair with Python, this kind of problems shouldn't occur anymore in Python 3. The underlying issue here is mixing bytes and Unicode, but that only causes problems if username or hostname contains non-ASCII characters. On Python 3 you simply cannot mix bytes and Unicode like that. Good news is that my PR got a positive review and hopefully we get final issues resolved before pip 8.2 release. Until that using |
Your PR is good but only for pip But with Pyhon 3.5 there is also an other problem : "virtualenv fails with Python 3.5 on Windows" pypa/virtualenv#796 Ahhhhhhhhhhhhh ! To return to the source of the problem : platfromIO (http://platformio.org/). Ok let's go to Scons. http://scons.org/ "a next-generation build tool" Ah ah ah ! (with python 2.7 !) So I will wait the upgrade to Pyhton 3 and perhaps this pypa/virtualenv#796 will be fixed before ... The suspense is unbearable .. No I'm joking, I don't care, I have no hope, it's just a distraction ... Good luck, I going away and stop disturbe your thread. |
PR #4000 now merged. Can this issue now be closed? There's a lot of discussion here about various Unicode-related problems, and I don't want to unilaterally close this in case there are other problems here that the PR didn't resolve. @pekkaklarck if you're happy that the issue is now fixed, can you close it? |
PR #4000 fixes the original problem and I would say this issue can be closed. Should a milestone or some labels be set before? |
No, AIUI we use milestones for "we need to fix this in x.y" not for "this has been fixed in x.y". So I'll close this and just note here that the fix should appear in the next release (8.2) |
When organizing a Python training recently, one participant failed to use pip after a fresh Python 2.7.11 installation on Windows. Quick investigation showed that the reason problem was
ä
in her username. We failed to workaround that even by creating a new account and needed to usepython setup.py install
instead.I now tried to reproduce the problem on my virtual machine. My main account there has only ASCII characters in the username but I created another for testing purposes. Clearly everything is not correct:
Interestingly installation and uninstallation seem to work fine also with this account. I guess the difference with the problem I saw earlier could be that my main user/admin doesn't have non-ASCII characters.
UPDATE: It later turned out that
pip install
is totally broken if the hostname has non-ASCII characters. That explains why creating a new account with just ASCII characters in the username didn't didn't work when I encountered this first time and also why I couldn't reproduce that more severe problem with just an account with non-ASCII username.A workaround for both of these problems is using
--no-cache-dir
. Both problems are also fixed by PR #3970 that hopefully gets merged and released at some point.The text was updated successfully, but these errors were encountered: