Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding brokenness in Python REPL input (both 2.7 and 3.6) #2656

Closed
rprichard opened this issue Jul 3, 2017 · 3 comments
Closed

Comments

@rprichard
Copy link

For the scenarios below, use an ordinary console window (not mintty, winpty, ConEmu, etc). Set the active code page to 437 and use a TrueType font, like Lucida Console or Consolas. (Selecting a Raster font can prevent the console from showing many characters.)

C:\>chcp 437
Active code page: 437

Scenario 1: Python 3.6.2rc1 in an ordinary console

NB: sys.stdin.encoding is utf-8.

  1. Start Python:

    C:\>C:\msys64\mingw64\bin\python3.6.exe
    Python 3.6.2rc1 (default, Jun 26 2017, 07:26:57)  [GCC 6.3.0 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    
  2. Select an English-layout keyboard, then copy and paste len('ö') into the console. The ö character is dropped:

    >>> len('')
    0
    
  3. Switch the keyboard to German, then try again. There is an error:

    >>> len('ö')
      File "<stdin>", line 0
    
        ^
    SyntaxError: 'utf-8' codec can't decode byte 0x94 in position 5: invalid start byte
    

Scenario 2: Python 2.7.13 in an ordinary console

NB: sys.stdin.encoding is cp437.

  1. Start Python:

    C:\>C:\msys64\mingw64\bin\python2.7.exe
    Python 2.7.13 (default, Jan 17 2017, 13:56:44)  [GCC 6.3.0 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    
  2. Select an English-layout keyboard, then copy-and-paste len('ö') into the console. The ö is dropped, as with Python 3.

  3. Select a German-layout keyboard, then copy-and-paste len('ö') into the console. Something (GNU Readline?) converts the ö into \224. The \224 is treated as a single unit for typing purposes. (e.g. one backspace removes the whole thing.):

    >>> len('\224')
    1
    
  4. Copy and paste u'ö' ; u'\224' into the console. The output is visually inconsistent. The u'ö'-become-u'\224' identifies U+00F6, but the ASCII u'\224' becomes U+0094 (NB: 0o224 == 0x94):

    >>> u'\224' ; u'\224'
    u'\xf6'
    u'\x94'
    
  5. Copy and paste ö into the REPL. Nothing appears. (This particular oddity does not affect Python 3.6.2rc1.) Try typing a ö into the REPL (use the on-screen keyboard if you must), and nothing appears.

Scenario 3: Python 3.6.2rc1 from mintty

NB: sys.stdin.encoding is cp1252.

"Native" console programs tend not to work with a Cygwin pty. Common advice is to use winpty, which lets the program use console I/O instead.

  1. Start Python:

    rprichard@VBWIN7 MINGW32 ~
    $ /mingw64/bin/python3.6.exe -i
    Python 3.6.2rc1 (default, Jun 26 2017, 07:26:57)  [GCC 6.3.0 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    
  2. Select either an English or German keyboard. I don't think it matters.

  3. Copy and paste x = '…' ; " ".join("%x" % ord(c) for c in x). The result:

    >>> x = '…' ; " ".join("%x" % ord(c) for c in x)
    'e2 20ac a6'
    

    mintty turns [U+2026] into its UTF-8 representation: E2 80 A6. Each of the three bytes is interpreted using Windows-1252, which produces U+00E2 U+20AC U+00A6.

I'm mostly just including this scenario for reference. It doesn't work with the official Python releases either, and I don't think it's expected to work. How would Python know that its pipes are going to be encoded as UTF-8, rather than Windows-1252? Maybe it could work, somehow? I saw some discussion somewhere recently about detecting named pipes that are really Cygwin ptys, and that might tell us something about its encoding.

Scenario 4: Use sys.stdin.readline() instead:

  1. Start either MinGW Python 2 or 3, with any keyboard.

  2. Copy and paste: import sys ; " ".join("%x" % ord(c) for c in sys.stdin.readline()). On the next line, paste, ö…. The result is '94 2e a' on Python 2 and 'f6 2026 a' on Python 3. Both of these are correct, indicating that the issue is with the REPL's line entry, not with general Python stdin reading.


The first two scenarios work fine with official Python releases of 2.7.11 and 3.6.1. The third and fourth scenarios behave the same way with the official release. FWIW, I'm pretty sure the official Python releases do not use GNU Readline:

  • MinGW Python has Unix-like line editing, whereas official Python has Windows-like line editing.
  • MinGW Python has a readline module I can import from the REPL; the official release doesn't.

See this issue, rprichard/winpty#121. I suspect the problem is really with GNU Readline. I dug into that library a bit. I think it's using ordinary C narrow strings, but if isatty(0) is true, then it uses the MSVSCRT _getch function to read input. Based on my testing, that function always returns input in the console's code page, and it also ignores characters that aren't in the current keyboard layout. I'm guessing that a proper MinGW Readline port should use *Console* wide APIs instead and explicit UTF-8 <-> UTF-16 conversions.

I don't think MSVCRT has a proper UTF-8 locale setting: https://stackoverflow.com/questions/4324542/what-is-the-windows-equivalent-for-en-us-utf-8-locale.

lazka referenced this issue in lazka/MINGW-packages Jul 7, 2017
…a real Windows console

CPython uses isatty() to detect a terminal and change some settings
like line buffering and interactive mode. Use is_cygpty() to make
this also work under mintty.
See https://github.com/Alexpux/MINGW-packages/issues/2645

This also removes the bash script which forced the interactive mode
when python3 was started without arguments. This is no longer needed as
Python now detects the terminal output and does this automatically.

Also use is_cygpty() to detect when not under mintty and disable the readline
module there, as using it breaks input of certain characters and
leads to errors on shutdown when it tries to save the readline history.
(The readline module is not available in the official Python build)
See https://github.com/Alexpux/MINGW-packages/issues/2656
lazka referenced this issue in lazka/MINGW-packages Aug 14, 2017
…a real Windows console

CPython uses isatty() to detect a terminal and change some settings
like line buffering and interactive mode. Use is_cygpty() to make
this also work under mintty.
See https://github.com/Alexpux/MINGW-packages/issues/2645

This also removes the bash script which forced the interactive mode
when python3 was started without arguments. This is no longer needed as
Python now detects the terminal output and does this automatically.

Also use is_cygpty() to detect when not under mintty and disable the readline
module there, as using it breaks input of certain characters and
leads to errors on shutdown when it tries to save the readline history.
(The readline module is not available in the official Python build)
See https://github.com/Alexpux/MINGW-packages/issues/2656
Alexpux referenced this issue Aug 14, 2017
…a real Windows console (#2675)

CPython uses isatty() to detect a terminal and change some settings
like line buffering and interactive mode. Use is_cygpty() to make
this also work under mintty.
See https://github.com/Alexpux/MINGW-packages/issues/2645

This also removes the bash script which forced the interactive mode
when python3 was started without arguments. This is no longer needed as
Python now detects the terminal output and does this automatically.

Also use is_cygpty() to detect when not under mintty and disable the readline
module there, as using it breaks input of certain characters and
leads to errors on shutdown when it tries to save the readline history.
(The readline module is not available in the official Python build)
See https://github.com/Alexpux/MINGW-packages/issues/2656
@lazka
Copy link
Member

lazka commented Aug 19, 2017

Python 2 and 3 now disable readline in case no cygwin terminal is detected. Maybe it should be disabled altogether, but I tried to keep the change minimal for now. winpty + python work for me now at least.

@lazka
Copy link
Member

lazka commented Aug 19, 2017

See #2675 and #2806

@lazka
Copy link
Member

lazka commented Jan 25, 2018

mingw python no longer uses readline when run in a normal terminal (cmd/winpty). So I assume the core issue here is fixed.

If there is something missing please say so.

@lazka lazka closed this as completed Jan 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants