-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Use chardet as a fallback for encoding detection #3742
PR: Use chardet as a fallback for encoding detection #3742
Conversation
@@ -111,6 +113,17 @@ def get_coding(text): | |||
# sometimes we find a false encoding that can result in errors | |||
if codec in CODECS: | |||
return codec | |||
|
|||
# Falback using chardet | |||
if is_binary_string(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work for text that's not unicode, so why not remove from CODECS
above all encodings that are not unicode and use chardet for non-unicode stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus typo, Fallback (with and extra L
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ccordoba12 I think that It's better to relay in the information of the file that could be more accurate, and use encoding detection as a fallback, chardet recommend that way (although it make reference to mime types)
@goanpeca yep, duck-typing I'll correct it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Rafa, I think what is in there is good enough (checking the line at the top of the file) in case it is not found then we can use chardet that is heavier to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are we computing determining the encoding for files in the Editor @rlaverde? Could you point to the exact point where we are doing it? |
@ccordoba12 editor call two functions of encoding module The encoding is guessed in example: https://github.com/spyder-ide/spyder/blob/master/spyder/widgets/editor.py#L1258 |
@rlaverde, please add some files to test this new feature:
Thanks! |
1837c16
to
2bb47cf
Compare
There are failures on Python 2, please fix them. |
@@ -31,6 +32,8 @@ def test_is_text_file(tmpdir): | |||
def test_files_encodings(expected_encoding, text_file): | |||
with open(os.path.join(__location__, text_file), 'rb') as f: | |||
text = f.read() | |||
if PY2: | |||
text = str(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuring this also in the Editor, i.e. that text read from files in Python 2 is converted to bytes before detecting its encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I was wrong, this wasn't the problem with python2
The problem is trying to convert bytes to string without knowing the encoding, in python3 str(some_bytes_with_no_utf8_encoding)
return an empty string but It fails in python2
18b297a
to
86d7a21
Compare
I think this one is ready, thanks @rlaverde! |
Fixes #3731
I think It's better to leave the logic that already exist to detect text encoding, and just add chardet as a fallback
Using only chardet some utf-8 text will be detect as ascii (because utf-8 is a superset of ascii)