Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode decoding issue on Windows #96

Open
jorisroovers opened this issue Jul 8, 2019 · 5 comments
Open

Unicode decoding issue on Windows #96

jorisroovers opened this issue Jul 8, 2019 · 5 comments
Labels
bug User-facing bugs windows Windows related issues

Comments

@jorisroovers
Copy link
Owner

When trying to decode Unicode characters on Windows, gitlint can crash.
This can easily be shown by trying to lint the commit 3ee281e of the gitlint commit history.

(.venv) C:\Users\Administrator\gitlint>gitlint --commits 3ee281eec4ef4325ce90d27f6f368a6b95818cfe
Traceback (most recent call last):
  File "C:\Users\Administrator\gitlint\gitlint\.venv\Scripts\gitlint-script.py", line 11, in <module>
    load_entry_point('gitlint', 'console_scripts', 'gitlint')()
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 1114, in invoke
    return Command.invoke(self, ctx)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\cli.py", line 180, in cli
    ctx.invoke(lint)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\cli.py", line 211, in lint
    gitcontext = GitContext.from_local_repository(lint_config.target, ctx.obj[2])
  File "c:\users\administrator\gitlint\gitlint\git.py", line 199, in from_local_repository
    raw_commit = _git("log", sha, "-1", long_format, _cwd=repository_path).split("\n")
  File "c:\users\administrator\gitlint\gitlint\git.py", line 28, in _git
    result = sh.git(*command_parts, **git_kwargs)  # pylint: disable=unexpected-keyword-arg
  File "c:\users\administrator\gitlint\gitlint\shell.py", line 45, in git
    return _exec(*args, **kwargs)
  File "c:\users\administrator\gitlint\gitlint\shell.py", line 65, in _exec
    stdout = ustr(result[0])
  File "c:\users\administrator\gitlint\gitlint\utils.py", line 27, in ustr
    return unicode(obj, DEFAULT_ENCODING)  # pragma: no cover # noqa
  File "C:\Users\Administrator\gitlint\gitlint\.venv\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1: character maps to <undefined>
@jorisroovers jorisroovers added bug User-facing bugs windows Windows related issues labels Jul 8, 2019
@jorisroovers
Copy link
Owner Author

jorisroovers commented Aug 21, 2019

I spend some time looking into this. Disclaimer: I'm by no means a unicode expert and what follows is based on limited research. If anyone reading this has more expertise in this area, please chime in!

TLDR: A fix will follow that will takes into account the LC_ALL, LC_CTYPE, LANG env vars on Windows and otherwise fallsback to UTF-8. This matches git's behavior on Windows.

In a nutshell, the problem is here
https://github.com/jorisroovers/gitlint/blob/master/gitlint/utils.py#L8

In particular, on my windows 10 VM, Python 2.7, both CMD and Powershell:

$ python -c "import locale; print locale.getpreferredencoding()"
cp1252

This causes problems with gitlint tries to parse unicode characters encoded in UTF-8 (or any encoding other than cp1252).

I first considered to just remove the call to locale.getpreferredencoding() and always use UTF-8. However, while I believe that works for the majority of users, I don't want to downgrade the experience for power-users. For users that want/need an encoding different from UTF-8, this is probably a hard requirement to be able to use gitlint at all.

I then considered to hard-code UTF-8 only on Windows (and keep the current code on other platforms), but after some more research I discovered that git actually adheres to the LC_ALL environment variable on Windows (see footnote 1) - which allows users to define which charset to use. Note that reading the LC_ALL envvar is somewhat atypical for Windows. I personally think this is a rather useful workaround and given gitlint's dependency on git itself, it makes sense to follow the same behavior.

So in a nutshell, what I'm thinking to implement now is:

  1. On non-Windows machines, everything stays as is
  2. On Windows-machines:
    1. Try to determine preferred encoding from environment variables LC_ALL, LC_CTYPE, LANG (in that order)
    2. Otherwise, fallback on UTF-8
    3. Ignore locale.getpreferredencoding() all-together (like git).

Footnotes

Footnote 1: git itself requires LC_ALL on Windows

As a side-note, git itself requires you to explicitly set LC_ALL on Windows to properly display unicode characters, like so:

# Without 'LC_ALL=C.UTF-8'
$ git log -1 --pretty="%an" 3ee281
<C5><81>ukasz Rogalski
# With 'LC_ALL=C.UTF-8'
$ set LC_ALL=C.UTF-8
$ git log -1 --pretty="%an" 3ee281
Łukasz Rogalski

I believe this is the relevant source-code in git itself: https://github.com/git/git/blob/5fa0f5238b0cd46cfe7f6fa76c3f526ea98148d9/gettext.c#L15-L32

Footnote 2:

From what I could gather, there's a lot discussion about python's unicode character decoding on Windows.
In particular, from diagonally reading a relevant blog post, I eventually stumbled upon PEP 528 which explains how in Python >= 3.6, Python defaults to UTF-8 for Windows console encoding. More indication that Unicode handling in python is messy, particularly so on Windows.

Footnote 3:

Python allows you to specify the error behavior for unicode errors (Python2, Python3). While it's possible to implement a git-like behavior of printing placeholder chars on unicode decoding errors, I'm probably not going to do this for now. I'd prefer gitlint to hard-crash on decoding issues for now so it's more likely that users report the issues they encounter.

jorisroovers added a commit that referenced this issue Aug 25, 2019
This *should* fix #96.

Also did a minor refactoring of utils.py that makes it easier to unit test
some of the constants that are defined in that module.
jorisroovers added a commit that referenced this issue Aug 25, 2019
Update: Needs more testing, doesn't solve the issue (yet)

This *should* fix #96.

Also did a minor refactoring of utils.py that makes it easier to unit test
some of the constants that are defined in that module.
@jorisroovers
Copy link
Owner Author

The plot thickens!

In my previous comment, I really only considered reads of unicode characters from the git command output, i.e. the cause of the crash in the original description. That issue is mostly solved by c939a0d (although I need to amend that commit with a small fix).

However, turns out that writing unicode characters to the Windows console is an entire beast on its own - see issue1602. In a nutshell, properly and consistently printing Unicode characters to the Windows console is very messy in python. From what I can gather, manually working around this is complicated.

The good news is that the Click library's click.echo() function (which we already use in part of the codebase) has all the necessary work-arounds baked in.

So what I'm planning to do now is replace all occurrences where we are writing to stdout/stderr directly with the click.echo() function.

The 2 most relevant places:

  • In the display module
  • Writing a custom LogHandler that uses click.echo() to print (needed to fix unicode characters in log/debug messages).

Hopefully this will work!

jorisroovers added a commit that referenced this issue Nov 2, 2019
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables
on Windows - in line with git's behavior.

The default fallback is UTF-8, this should lead to improved unicode
parsing for the majority of users.

This is a partial fix for #96.

Also includes a minor refactoring of utils.py that makes it easier to unit test
some of the constants that are defined in that module.
jorisroovers added a commit that referenced this issue Nov 3, 2019
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables
on Windows - in line with git's behavior.

The default fallback is UTF-8, this should lead to improved unicode
parsing for the majority of users.

This is a partial fix for #96.

Also includes a minor refactoring of utils.py that makes it easier to unit test
some of the constants that are defined in that module.
jorisroovers added a commit that referenced this issue Nov 13, 2019
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables
on Windows - in line with git's behavior.

The default fallback is UTF-8, this should lead to improved unicode
parsing for the majority of users.

This is a partial fix for #96.

Also includes a minor refactoring of utils.py that makes it easier to unit test
some of the constants that are defined in that module.
@jorisroovers
Copy link
Owner Author

Quick example of something that doesn't work as expected yet:
Try the following on Python 2.7 on Windows (this does work on Python 3.x).

echo WIP: tëst | gitlint

This will crash gitlint on a unicode detection error.

@metya
Copy link

metya commented Sep 16, 2020

Is it here I should write about a bug?

when I install gitlint hook by pre-commit in linux it works normally, but in windows it failed with error

gitlint..................................................................Failed
- hook id: gitlint
- exit code: 1

Traceback (most recent call last):
  File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\metya\scoop\apps\miniconda3\4.7.12.1\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\Scripts\gitlint.EXE\__main__.py", line 7, in <module>
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 1114, in invoke
    return Command.invoke(self, ctx)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 204, in cli
    log_system_info()
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\cli.py", line 51, in log_system_info
    LOG.debug("Git version: %s", git_version())
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 57, in git_version
    return _git("--version").replace(u"\n", u"")
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\git.py", line 33, in _git
    result = sh.git(*command_parts, **git_kwargs)  # pylint: disable=unexpected-keyword-arg
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 45, in git
    return _exec(*args, **kwargs)
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\shell.py", line 65, in _exec
    stdout = ustr(result[0])
  File "c:\users\metya\.cache\pre-commit\repog8u71i9k\py_env-default\lib\site-packages\gitlint\utils.py", line 88, in ustr
    return obj.decode(DEFAULT_ENCODING)
LookupError: unknown encoding: C

Even if the commit message is perfectly valid.

jorisroovers added a commit that referenced this issue Oct 17, 2020
When specifying an unknown encoding on windows via the LC_ALL, LC_CTYPE or LANG
environment variables, gitlint will known fallback to UTF-8 instead of
crashing. This is a common scenario when using the commit-msg hook.

Relates to #96
jorisroovers added a commit that referenced this issue Oct 17, 2020
When specifying an unknown encoding on windows via the LC_ALL, LC_CTYPE or LANG
environment variables, gitlint will known fallback to UTF-8 instead of
crashing. This is a common scenario when using the commit-msg hook.

Relates to #96
@jorisroovers
Copy link
Owner Author

@metya, can you try setting LC_ALL to UTF-8 and let me know if that works for you?

# Regular windows CMD
Set LC_ALL=UTF-8

# git-bash/Cygwin
export LC_ALL=UTF-8

# Now try again

The reason this happens is because git sets LC_CTYPE=C on Windows when invoking /bin/sh, and Python can't find the C encoding. I've added a workaround in #158, adding some logic for gitlint to more smartly fall back to UTF-8. This will go out as part of the 0.14.0 release within the next month.

This doesn't solve all unicode issues on Windows (I've spend more time on it but no silver bullets...yet), but hopefully it should keep gitlint from crashing.

jorisroovers added a commit that referenced this issue Oct 24, 2020
- IMPORTANT: Gitlint 0.14.x will be the last gitlint release to support Python
  2.7 and Python 3.5, as both are EOL which makes it difficult to keep
  supporting them.
- Python 3.9 support
- New Rule: title-min-length enforces a minimum length on titles
  (default: 5 chars) (#138)
- New Rule: body-match-regex allows users to enforce that the commit-msg body
  matches a given regex (#130)
- New Rule: ignore-body-lines allows users to ignore parts of a commit by
  matching a regex against the lines in a commit message body (#126)
- Named Rules allow users to have multiple instances of the same rule active at
  the same time. This is useful when you want to enforce the same rule multiple
  times but with different options (#113, #66)
- User-defined Configuration Rules allow users to dynamically change gitlint's
  configuration and/or the commit before any other rules are applied.
- The commit-msg hook has been re-written in Python (it contained a lot of
  Bash before), fixing a number of platform specific issues. Existing users
  will need to reinstall their hooks
  (gitlint uninstall-hook; gitlint install-hook) to make use of this.
- Most general options can now be set through environment variables (e.g. set
  the general.ignore option via GITLINT_IGNORE=T1,T2). The list of available
  environment variables can be found in the configuration documentation.
- Users can now use self.log.debug("my message") for debugging purposes in
  their user-defined rules. Debug messages will show up when running
  gitlint --debug.
- Breaking: User-defined rule id's can no longer start with 'I', as those are
  reserved for built-in gitlint ignore rules.
- New RegexOption rule option type for use in user-defined rules. By using the
  RegexOption, regular expressions are pre-validated at gitlint startup and
  compiled only once which is much more efficient when linting multiple commits.
- Bugfixes:
 - Improved UTF-8 fallback on Windows (ongoing - #96)
 - Windows users can now use the 'edit' function of the commit-msg hook (#94)
 - Doc update: Users should use --ulimit nofile=1024 when invoking gitlint
   using Docker (#129)
 - The commit-msg hook was broken in Ubuntu's gitlint package due to a
   python/python3 mismatch (#127)
 - Better error message when no git username is set (#149)
 - Options can now actually be set to None (from code) to make them optional.
 - Ignore rules no longer have "None" as default regex, but an empty regex -
   effectively disabling them by default (as intended).
- Contrib Rules:
 - Added 'ci' and 'build' to conventional commit types (#135)
- Under-the-hood: minor performance improvements (removed some unnecessary
  regex matching), test improvements, improved debug logging, CI runs on pull
  requests, PR request template.

Full Release details in CHANGELOG.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug User-facing bugs windows Windows related issues
Projects
None yet
Development

No branches or pull requests

2 participants