Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python startup fails with a fatal error if a command line argument contains an invalid Unicode character #80064

Closed
Neui mannequin opened this issue Feb 1, 2019 · 18 comments
Labels
3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@Neui
Copy link
Mannequin

Neui mannequin commented Feb 1, 2019

BPO 35883
Nosy @ncoghlan, @vstinner, @ezio-melotti, @eryksun, @miss-islington, @Neui, @jmberg
PRs
  • bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters #24843
  • [3.9] bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (GH-24843) #24905
  • [3.8] bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (GH-24843) #24906
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-10-18.22:15:29.662>
    created_at = <Date 2019-02-01.16:55:38.226>
    labels = ['interpreter-core', '3.8', 'type-crash', '3.10', 'expert-unicode', '3.9']
    title = 'Python startup fails with a fatal error if a command line argument contains an invalid Unicode character'
    updated_at = <Date 2021-10-18.22:15:29.662>
    user = 'https://github.com/Neui'

    bugs.python.org fields:

    activity = <Date 2021-10-18.22:15:29.662>
    actor = 'iritkatriel'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-10-18.22:15:29.662>
    closer = 'iritkatriel'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2019-02-01.16:55:38.226>
    creator = 'Neui'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35883
    keywords = ['patch']
    message_count = 18.0
    messages = ['334703', '334705', '334707', '334712', '334732', '369811', '369812', '369813', '369814', '369819', '369820', '388612', '388613', '388614', '388616', '388965', '388970', '389732']
    nosy_count = 8.0
    nosy_names = ['ncoghlan', 'vstinner', 'ezio.melotti', 'SilentGhost', 'eryksun', 'miss-islington', 'Neui', 'jberg']
    pr_nums = ['24843', '24905', '24906']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue35883'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    Linked PRs

    @Neui
    Copy link
    Mannequin Author

    Neui mannequin commented Feb 1, 2019

    When an invalid unicode character is given to argv (cli arguments), then python abort()s with an fatal error about an character not in range (ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]).

    I am wondering if this behaviour should change to replace those with U+FFFD REPLACEMENT CHARACTER (like .decode(..., 'replace')) or even with something similar/better (see https://docs.python.org/3/library/codecs.html#error-handlers )

    The reason for this is that other applications can use the invalid character since it is just some data (like GDB for use as an argument to the program to be debugged), where in python this becomes an limitation, since the script (if specified) never runs.

    The main motivation for me is that there is an command-not-found debian package that gets the wrongly-typed command as a command argument. If that then contains an invalid unicode character, it then just fails rather saying it couldn't find the/a similar command. If this doesn't get changed, it either then has to accept that this is a limitation, use an other way of passing the command or re-write it in not python.

    # Requires bash 4.2+
    # Specifying a script omits the first two lines
    $ python3.6 $'\U7fffbeba'
    Failed checking if argv[0] is an import path entry
    ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]
    Fatal Python error: no mem for sys.argv
    ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]

    Current thread 0x00007fd212eaf740 (most recent call first):
    Aborted (core dumped)

    $ python3.6 --version
    Python 3.6.7
    
    $ uname -a
    Linux nopea 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    
    $ lsb_release -a
    No LSB modules are available.
    Distributor ID:	Ubuntu
    Description:	Ubuntu 18.04.1 LTS
    Release:	18.04
    Codename:	bionic

    GDB backtrace just before throwing the error: (note that it's argc=2 since first argument is a script)
    #0 find_maxchar_surrogates (begin=begin@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, end=end@entry=0xa847d0 L"", maxchar=maxchar@entry=0x7fffffffde94,
    num_surrogates=num_surrogates@entry=0x7fffffffde98) at ../Objects/unicodeobject.c:1626
    #1 0x00000000004cee4b in PyUnicode_FromUnicode (u=u@entry=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=12) at ../Objects/unicodeobject.c:2017
    #2 0x00000000004db856 in PyUnicode_FromWideChar (w=0xa847a0 L'\x7fffbeba' <repeats 12 times>, size=<optimized out>, size@entry=-1) at ../Objects/unicodeobject.c:2502
    #3 0x000000000043253d in makeargvobject (argc=argc@entry=2, argv=argv@entry=0xa82268) at ../Python/sysmodule.c:2145
    #4 0x0000000000433228 in PySys_SetArgvEx (argc=2, argv=0xa82268, updatepath=1) at ../Python/sysmodule.c:2264
    #5 0x00000000004332c1 in PySys_SetArgv (argc=<optimized out>, argv=<optimized out>) at ../Python/sysmodule.c:2277
    #6 0x000000000043a5bd in Py_Main (argc=argc@entry=3, argv=argv@entry=0xa82260) at ../Modules/main.c:733
    #7 0x0000000000421149 in main (argc=3, argv=0x7fffffffe178) at ../Programs/python.c:69

    Similar issues:
    https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode command-line arguments in embedded Python" (actually 'fixed' since it now abort()s)
    https://bugs.python.org/issue2128 "sys.argv is wrong for unicode strings"

    @Neui Neui mannequin added type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 1, 2019
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Feb 1, 2019

    I'm on 4.15.0-44-generic and I cannot reproduce the crash. I get "python3: can't open file '������': [Errno 2] No such file or directory"

    Could you try this on a different machine / installation?

    @SilentGhost SilentGhost mannequin added type-crash A hard crash of the interpreter, possibly with a core dump and removed type-bug An unexpected behavior, bug, or error labels Feb 1, 2019
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Feb 1, 2019

    Hm, this seems to be due to how the terminal emulator handles those special characters, actually. I can reproduce in another terminal.

    @Neui
    Copy link
    Mannequin Author

    Neui mannequin commented Feb 1, 2019

    I'd say that the terminal is not really relevant here, but rather the locale settings because it uses wide string functions. Prefixing it with LC_ALL=C produces the same output as you had on my Ubuntu machine. I also get that output when running it in Cygwin (and MSYS2), although it seems setting LC_ALL has no effect.

    @SilentGhost SilentGhost mannequin added the 3.7 (EOL) end of life label Feb 1, 2019
    @eryksun
    Copy link
    Contributor

    eryksun commented Feb 1, 2019

    In Unix, Python 3.6 decodes the char * command line arguments via mbstowcs. In Linux, I see the following misbehavior of mbstowcs when decoding an overlong UTF-8 sequence:

        >>> mbstowcs = ctypes.CDLL(None, use_errno=True).mbstowcs
        >>> arg = bytes(x + 128 for x in [1 + 124, 63, 63, 59, 58, 58])
        >>> mbstowcs(None, arg, 0)
        1
        >>> buf = (ctypes.c_int * 2)()
        >>> mbstowcs(buf, arg, 2)
        1
        >>> hex(buf[0])
        '0x7fffbeba'

    This shouldn't be an issue in 3.7, at least not with the default UTF-8 mode configuration. With this mode, Py_DecodeLocale calls _Py_DecodeUTF8Ex using the surrogateescape error handler 1.

    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    Pretty sure this is an issue still, I see it on current git master.

    This seems to work around it?

    https://p.sipsolutions.net/603927f1537226b3.txt

    Basically, it seems that mbstowcs() and mbrtowc() on glibc with utf-8 just blindly decode even invalid UTF-8 to a too large wchar_t, rather than failing.

    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    A simple test case is something like

    ./python -c 'import sys; print(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))' "$(echo -ne '\xfa\xbd\x83\x96\x80')"

    Which you'd probably expect to print

    b'\xfa\xbd\x83\x96\x80'

    i.e. the same bytes that were passed in, but currently that fails.

    @jmberg jmberg mannequin added 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes labels May 24, 2020
    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    In fact that python one-liner works with just about everything else that you can throw at it, just not something that "looks like utf-8 but isn't".

    And of course adding LC_CTYPE=ascii or something like that fixes it, as you'd expect. Then the "surrogateescape" works fine, since mbstowcs() won't try to decode it as utf-8.

    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    And wrt. _Py_DecodeUTF8Ex() - it doesn't seem to help. But that's probably because I'm not __ANDROID__, nor __APPLE__, and then regardless of current_locale being non-zero or not, we end up in decode_current_locale() where the impedance mismatch happens.

    Setting PYTHONUTF8=1 in the environment works too, in that case we do get into _Py_DecodeUTF8Ex().

    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    Like I said above, it could be argued that the bug is in glibc, and then

    https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt

    could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side.

    Arguably, that makes glibc in violation of RFC 3629, since it says:

    1. UTF-8 definition

    [...]

    In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
    accessible range) are encoded using sequences of 1 to 4 octets.

    [...]

      (hexadecimal)    |              (binary)
    

    --------------------+---------------------------------------------
    0000 0000-0000 007F | 0xxxxxxx
    0000 0080-0000 07FF | 110xxxxx 10xxxxxx
    0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
    0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    [...]

    Implementations of the decoding algorithm above MUST protect against
    decoding invalid sequences.

    [...]

    Here's a simple test program:

    https://p.sipsolutions.net/ac091b4ea4b7f742.txt

    @jmberg
    Copy link
    Mannequin

    jmberg mannequin commented May 24, 2020

    I've also filed https://sourceware.org/bugzilla/show_bug.cgi?id=26034 for glibc, because that's where really the issues seems to be?

    But perhaps python should be forgiving of glibc errors here.

    @eryksun eryksun added topic-unicode and removed 3.7 (EOL) end of life labels Mar 12, 2021
    @vstinner
    Copy link
    Member

    I wrote PR 24843 to fix this issue. With this fix, os.fsencode(sys.argv[1]) returns the original byte sequence as expected.

    --

    I dislike the replace error handler since it loses information. The PEP-383 surrogateescape error handler exists to prevent losing information.

    The root issue is that Py_DecodeLocale() creates wide characters outside Python Unicode valid range: [U+0000; U+10ffff].

    On Linux, Py_DecodeLocale() usually calls mbstowcs() of the C library. The problem is that the the glibc UTF-8 decoder doesn't respect the RFC 3629, it doesn't reject characters outside [U+0000; U+10ffff] range. The following issue requests to change the glibc UTF-8 codec to respect the RFC 3629, but it's open since 2006:
    https://sourceware.org/bugzilla/show_bug.cgi?id=2373

    Even if the glibc changes, Python should behave the same on old glibc version.

    My PEP modifies Py_DecodeLocale() to check if there are characters outside [U+0000; U+10ffff] range and use the surrogateescape error handler in that case.

    @vstinner
    Copy link
    Member

    https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode command-line arguments in embedded Python" (actually 'fixed' since it now abort()s)

    This issue is different: it is about the Py_Main() function called explicitly when Python is embedded in an application. Python fails if the command line contains a *wide character* outside the [U+0000; U+10ffff] range.

    This issue is about Python on Linux in which case Py_BytesMain() is used to decode *bytes* from the command line.

    @vstinner vstinner changed the title Change invalid unicode characters to replacement characters in argv Python startup fails with a fatal error if a command line argument contains an invalid Unicode character Mar 13, 2021
    @vstinner
    Copy link
    Member

    This shouldn't be an issue in 3.7, at least not with the default UTF-8 mode configuration. With this mode, Py_DecodeLocale calls _Py_DecodeUTF8Ex using the surrogateescape error handler [1].

    Right, enabling explicitly the Python UTF-8 Mode works around the issue:
    https://docs.python.org/dev/library/os.html#python-utf-8-mode

    $ python3.10 -c 'import sys; print(ascii(sys.argv))' $'\U7fffbeba'
    Fatal Python error: init_interp_main: failed to update the Python config
    Python runtime state: core initialized
    ValueError: character U+7fffbeba is not in range [U+0000; U+10ffff]

    Current thread 0x00007effa1891740 (most recent call first):
    <no Python frame>

    $ python3.10 -X utf8 -c 'import sys; print(ascii(sys.argv))' $'\U7fffbeba'
    ['-c', '\udcfd\udcbf\udcbf\udcbb\udcba\udcba']

    @vstinner
    Copy link
    Member

    Right, enabling explicitly the Python UTF-8 Mode works around the issue

    When the Python UTF-8 Mode is used, on macOS or on Android, Python uses its own UTF-8 decoder which respects the RFC 3629: it rejects characters outside [U+0000; U+10ffff].

    Otherwise, Python relies on the libc mbstowcs() decoder which may or may not create characters outside the [U+0000; U+10ffff] range. I understand that this issue is mostly about the UTF-8 encoding, I don't think that other encodings can produce characters greater than U+10ffff code point.

    @vstinner
    Copy link
    Member

    New changeset 9976834 by Victor Stinner in branch 'master':
    bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (GH-24843)
    9976834

    @miss-islington
    Copy link
    Contributor

    New changeset aa967ec by Miss Islington (bot) in branch '3.9':
    bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (GH-24843)
    aa967ec

    @vstinner
    Copy link
    Member

    New changeset 3b6e61e by Miss Islington (bot) in branch '3.8':
    bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (GH-24843) (GH-24906)
    3b6e61e

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    vstinner added a commit to vstinner/cpython that referenced this issue May 30, 2023
    Return a classical int, rather than size_t.
    vstinner added a commit that referenced this issue May 30, 2023
    Return a classical int, rather than size_t. The size_t type was
    kept from copied/pasted code related to mbstowcs().
    carljm added a commit to carljm/cpython that referenced this issue May 30, 2023
    * main:
      CI: Temporarily skip paths with spaces to avoid error (python#105110)
      pythongh-105071: add missing versionadded directive (python#105097)
      pythongh-80064: Fix is_valid_wide_char() return type (python#105099)
      Small speedup for dataclass __eq__ and __repr__ (python#104904)
      pythongh-103921: Minor PEP-695 fixes to the `ast` module docs (python#105093)
      pythongh-105091: stable_abi.py: Remove "Unixy" check from --all on other platforms (pythonGH-105092)
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants