Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: Use chardet as a fallback for encoding detection #3742

Merged
merged 4 commits into from
Nov 27, 2016

Conversation

rlaverde
Copy link
Member

@rlaverde rlaverde commented Nov 24, 2016

Fixes #3731


I think It's better to leave the logic that already exist to detect text encoding, and just add chardet as a fallback

Using only chardet some utf-8 text will be detect as ascii (because utf-8 is a superset of ascii)

@@ -111,6 +113,17 @@ def get_coding(text):
# sometimes we find a false encoding that can result in errors
if codec in CODECS:
return codec

# Falback using chardet
if is_binary_string(text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work for text that's not unicode, so why not remove from CODECS above all encodings that are not unicode and use chardet for non-unicode stuff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus typo, Fallback (with and extra L)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccordoba12 I think that It's better to relay in the information of the file that could be more accurate, and use encoding detection as a fallback, chardet recommend that way (although it make reference to mime types)

@goanpeca yep, duck-typing I'll correct it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Rafa, I think what is in there is good enough (checking the line at the top of the file) in case it is not found then we can use chardet that is heavier to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree with you guys, thanks for the clarification :-)

@rlaverde, please don't forget to fix the little typo pointed out by @goanpeca on the line above this line :-)

@ccordoba12 ccordoba12 changed the title Use chardet as a fallback for encoding detection PR: Use chardet as a fallback for encoding detection Nov 25, 2016
@ccordoba12
Copy link
Member

Where are we computing determining the encoding for files in the Editor @rlaverde? Could you point to the exact point where we are doing it?

@rlaverde
Copy link
Member Author

@ccordoba12 editor call two functions of encoding module read() and write() and these functions call encode() and decode() and these call get_coding() (the one I modified)

The encoding is guessed in encode() and decode() and get_coding()

example: https://github.com/spyder-ide/spyder/blob/master/spyder/widgets/editor.py#L1258

@ccordoba12 ccordoba12 added this to the v3.1 milestone Nov 25, 2016
@ccordoba12
Copy link
Member

ccordoba12 commented Nov 25, 2016

@rlaverde, please add some files to test this new feature:

  1. One with cp1252 (or ANSI) Windows encoding.
  2. Another one with Big5 or another Chinese encoding.
  3. And a final one with a Cyrillic encoding (KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, or windows-1251).

Thanks!

@rlaverde rlaverde force-pushed the correctly-report-encoding branch from 1837c16 to 2bb47cf Compare November 25, 2016 17:49
@ccordoba12
Copy link
Member

There are failures on Python 2, please fix them.

@@ -31,6 +32,8 @@ def test_is_text_file(tmpdir):
def test_files_encodings(expected_encoding, text_file):
with open(os.path.join(__location__, text_file), 'rb') as f:
text = f.read()
if PY2:
text = str(text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuring this also in the Editor, i.e. that text read from files in Python 2 is converted to bytes before detecting its encoding?

Copy link
Member Author

@rlaverde rlaverde Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I was wrong, this wasn't the problem with python2

The problem is trying to convert bytes to string without knowing the encoding, in python3 str(some_bytes_with_no_utf8_encoding) return an empty string but It fails in python2

@rlaverde rlaverde force-pushed the correctly-report-encoding branch from 18b297a to 86d7a21 Compare November 25, 2016 21:10
@ccordoba12
Copy link
Member

I think this one is ready, thanks @rlaverde!

@ccordoba12 ccordoba12 merged commit cea8fce into spyder-ide:3.x Nov 27, 2016
ccordoba12 added a commit that referenced this pull request Nov 27, 2016
@rlaverde rlaverde deleted the correctly-report-encoding branch December 26, 2016 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants