-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode decoding issue on Windows #96
Comments
I spend some time looking into this. Disclaimer: I'm by no means a unicode expert and what follows is based on limited research. If anyone reading this has more expertise in this area, please chime in! TLDR: A fix will follow that will takes into account the In a nutshell, the problem is here In particular, on my windows 10 VM, Python 2.7, both $ python -c "import locale; print locale.getpreferredencoding()"
cp1252 This causes problems with gitlint tries to parse unicode characters encoded in I first considered to just remove the call to I then considered to hard-code So in a nutshell, what I'm thinking to implement now is:
FootnotesFootnote 1: git itself requires LC_ALL on WindowsAs a side-note, # Without 'LC_ALL=C.UTF-8'
$ git log -1 --pretty="%an" 3ee281
<C5><81>ukasz Rogalski
# With 'LC_ALL=C.UTF-8'
$ set LC_ALL=C.UTF-8
$ git log -1 --pretty="%an" 3ee281
Łukasz Rogalski I believe this is the relevant source-code in git itself: https://github.com/git/git/blob/5fa0f5238b0cd46cfe7f6fa76c3f526ea98148d9/gettext.c#L15-L32 Footnote 2:From what I could gather, there's a lot discussion about python's unicode character decoding on Windows. Footnote 3:Python allows you to specify the error behavior for unicode errors (Python2, Python3). While it's possible to implement a git-like behavior of printing placeholder chars on unicode decoding errors, I'm probably not going to do this for now. I'd prefer gitlint to hard-crash on decoding issues for now so it's more likely that users report the issues they encounter. |
This *should* fix #96. Also did a minor refactoring of utils.py that makes it easier to unit test some of the constants that are defined in that module.
Update: Needs more testing, doesn't solve the issue (yet) This *should* fix #96. Also did a minor refactoring of utils.py that makes it easier to unit test some of the constants that are defined in that module.
The plot thickens! In my previous comment, I really only considered reads of unicode characters from the git command output, i.e. the cause of the crash in the original description. That issue is mostly solved by c939a0d (although I need to amend that commit with a small fix). However, turns out that writing unicode characters to the Windows console is an entire beast on its own - see issue1602. In a nutshell, properly and consistently printing Unicode characters to the Windows console is very messy in python. From what I can gather, manually working around this is complicated. The good news is that the Click library's click.echo() function (which we already use in part of the codebase) has all the necessary work-arounds baked in. So what I'm planning to do now is replace all occurrences where we are writing to stdout/stderr directly with the The 2 most relevant places:
Hopefully this will work! |
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables on Windows - in line with git's behavior. The default fallback is UTF-8, this should lead to improved unicode parsing for the majority of users. This is a partial fix for #96. Also includes a minor refactoring of utils.py that makes it easier to unit test some of the constants that are defined in that module.
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables on Windows - in line with git's behavior. The default fallback is UTF-8, this should lead to improved unicode parsing for the majority of users. This is a partial fix for #96. Also includes a minor refactoring of utils.py that makes it easier to unit test some of the constants that are defined in that module.
Gitlint will now respect the LC_ALL, LC_CTYPE and LANG environment variables on Windows - in line with git's behavior. The default fallback is UTF-8, this should lead to improved unicode parsing for the majority of users. This is a partial fix for #96. Also includes a minor refactoring of utils.py that makes it easier to unit test some of the constants that are defined in that module.
Quick example of something that doesn't work as expected yet: echo WIP: tëst | gitlint This will crash gitlint on a unicode detection error. |
Is it here I should write about a bug? when I install gitlint hook by pre-commit in linux it works normally, but in windows it failed with error
Even if the commit message is perfectly valid. |
When specifying an unknown encoding on windows via the LC_ALL, LC_CTYPE or LANG environment variables, gitlint will known fallback to UTF-8 instead of crashing. This is a common scenario when using the commit-msg hook. Relates to #96
When specifying an unknown encoding on windows via the LC_ALL, LC_CTYPE or LANG environment variables, gitlint will known fallback to UTF-8 instead of crashing. This is a common scenario when using the commit-msg hook. Relates to #96
@metya, can you try setting # Regular windows CMD
Set LC_ALL=UTF-8
# git-bash/Cygwin
export LC_ALL=UTF-8
# Now try again The reason this happens is because git sets This doesn't solve all unicode issues on Windows (I've spend more time on it but no silver bullets...yet), but hopefully it should keep gitlint from crashing. |
- IMPORTANT: Gitlint 0.14.x will be the last gitlint release to support Python 2.7 and Python 3.5, as both are EOL which makes it difficult to keep supporting them. - Python 3.9 support - New Rule: title-min-length enforces a minimum length on titles (default: 5 chars) (#138) - New Rule: body-match-regex allows users to enforce that the commit-msg body matches a given regex (#130) - New Rule: ignore-body-lines allows users to ignore parts of a commit by matching a regex against the lines in a commit message body (#126) - Named Rules allow users to have multiple instances of the same rule active at the same time. This is useful when you want to enforce the same rule multiple times but with different options (#113, #66) - User-defined Configuration Rules allow users to dynamically change gitlint's configuration and/or the commit before any other rules are applied. - The commit-msg hook has been re-written in Python (it contained a lot of Bash before), fixing a number of platform specific issues. Existing users will need to reinstall their hooks (gitlint uninstall-hook; gitlint install-hook) to make use of this. - Most general options can now be set through environment variables (e.g. set the general.ignore option via GITLINT_IGNORE=T1,T2). The list of available environment variables can be found in the configuration documentation. - Users can now use self.log.debug("my message") for debugging purposes in their user-defined rules. Debug messages will show up when running gitlint --debug. - Breaking: User-defined rule id's can no longer start with 'I', as those are reserved for built-in gitlint ignore rules. - New RegexOption rule option type for use in user-defined rules. By using the RegexOption, regular expressions are pre-validated at gitlint startup and compiled only once which is much more efficient when linting multiple commits. - Bugfixes: - Improved UTF-8 fallback on Windows (ongoing - #96) - Windows users can now use the 'edit' function of the commit-msg hook (#94) - Doc update: Users should use --ulimit nofile=1024 when invoking gitlint using Docker (#129) - The commit-msg hook was broken in Ubuntu's gitlint package due to a python/python3 mismatch (#127) - Better error message when no git username is set (#149) - Options can now actually be set to None (from code) to make them optional. - Ignore rules no longer have "None" as default regex, but an empty regex - effectively disabling them by default (as intended). - Contrib Rules: - Added 'ci' and 'build' to conventional commit types (#135) - Under-the-hood: minor performance improvements (removed some unnecessary regex matching), test improvements, improved debug logging, CI runs on pull requests, PR request template. Full Release details in CHANGELOG.md.
When trying to decode Unicode characters on Windows, gitlint can crash.
This can easily be shown by trying to lint the commit 3ee281e of the gitlint commit history.
The text was updated successfully, but these errors were encountered: