PR: Use chardet as a fallback for encoding detection #3742

rlaverde · 2016-11-24T21:37:33Z

I think It's better to leave the logic that already exist to detect text encoding, and just add chardet as a fallback

Using only chardet some utf-8 text will be detect as ascii (because utf-8 is a superset of ascii)

ccordoba12 · 2016-11-25T01:10:14Z

spyder/utils/encoding.py

@@ -111,6 +113,17 @@ def get_coding(text):
            # sometimes we find a false encoding that can result in errors
            if codec in CODECS:
                return codec
+
+    # Falback using chardet
+    if is_binary_string(text):


This will only work for text that's not unicode, so why not remove from CODECS above all encodings that are not unicode and use chardet for non-unicode stuff?

Plus typo, Fallback (with and extra L)

@ccordoba12 I think that It's better to relay in the information of the file that could be more accurate, and use encoding detection as a fallback, chardet recommend that way (although it make reference to mime types)

@goanpeca yep, duck-typing I'll correct it :)

I agree with Rafa, I think what is in there is good enough (checking the line at the top of the file) in case it is not found then we can use chardet that is heavier to check.

Ok, I agree with you guys, thanks for the clarification :-)

@rlaverde, please don't forget to fix the little typo pointed out by @goanpeca on the line above this line :-)

ccordoba12 · 2016-11-25T01:26:12Z

Where are we computing determining the encoding for files in the Editor @rlaverde? Could you point to the exact point where we are doing it?

rlaverde · 2016-11-25T14:42:46Z

@ccordoba12 editor call two functions of encoding module read() and write() and these functions call encode() and decode() and these call get_coding() (the one I modified)

The encoding is guessed in encode() and decode() and get_coding()

example: https://github.com/spyder-ide/spyder/blob/master/spyder/widgets/editor.py#L1258

ccordoba12 · 2016-11-25T15:49:21Z

@rlaverde, please add some files to test this new feature:

One with cp1252 (or ANSI) Windows encoding.
Another one with Big5 or another Chinese encoding.
And a final one with a Cyrillic encoding (KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, or windows-1251).

Thanks!

ccordoba12 · 2016-11-25T20:14:33Z

There are failures on Python 2, please fix them.

ccordoba12 · 2016-11-25T20:38:05Z

spyder/utils/tests/test_encoding.py

@@ -31,6 +32,8 @@ def test_is_text_file(tmpdir):
 def test_files_encodings(expected_encoding, text_file):
    with open(os.path.join(__location__, text_file), 'rb') as f:
        text = f.read()
+        if PY2:
+            text = str(text)


Are we assuring this also in the Editor, i.e. that text read from files in Python 2 is converted to bytes before detecting its encoding?

No, I was wrong, this wasn't the problem with python2

The problem is trying to convert bytes to string without knowing the encoding, in python3 str(some_bytes_with_no_utf8_encoding) return an empty string but It fails in python2

ccordoba12 · 2016-11-27T16:13:34Z

I think this one is ready, thanks @rlaverde!

Fixes #3731

Use chardet as a fallback for encoding detection

27f4968

ccordoba12 reviewed Nov 25, 2016

View reviewed changes

ccordoba12 changed the title ~~Use chardet as a fallback for encoding detection~~ PR: Use chardet as a fallback for encoding detection Nov 25, 2016

ccordoba12 added this to the v3.1 milestone Nov 25, 2016

rlaverde added 2 commits November 25, 2016 11:00

Fix little typo

dca5fd9

Test text files with different character encodings

2bb47cf

rlaverde force-pushed the correctly-report-encoding branch from 1837c16 to 2bb47cf Compare November 25, 2016 17:49

ccordoba12 reviewed Nov 25, 2016

View reviewed changes

Fix errors with get_coding and no-uft-8 files

86d7a21

rlaverde force-pushed the correctly-report-encoding branch from 18b297a to 86d7a21 Compare November 25, 2016 21:10

ccordoba12 merged commit cea8fce into spyder-ide:3.x Nov 27, 2016

ccordoba12 added a commit that referenced this pull request Nov 27, 2016

Merge from 3.x: PR #3742

b7182d2

Fixes #3731

ccordoba12 mentioned this pull request Nov 27, 2016

Saving a file is not respecting the encoding detected with chardet #3753

Closed

rlaverde deleted the correctly-report-encoding branch December 26, 2016 14:48

goanpeca assigned rlaverde Jan 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Use chardet as a fallback for encoding detection #3742

PR: Use chardet as a fallback for encoding detection #3742

rlaverde commented Nov 24, 2016 •

edited

Loading

ccordoba12 Nov 25, 2016

goanpeca Nov 25, 2016

rlaverde Nov 25, 2016

goanpeca Nov 25, 2016

ccordoba12 Nov 25, 2016

ccordoba12 commented Nov 25, 2016

rlaverde commented Nov 25, 2016

ccordoba12 commented Nov 25, 2016 •

edited

Loading

ccordoba12 commented Nov 25, 2016

ccordoba12 Nov 25, 2016

rlaverde Nov 25, 2016 •

edited

Loading

ccordoba12 commented Nov 27, 2016

PR: Use chardet as a fallback for encoding detection #3742

PR: Use chardet as a fallback for encoding detection #3742

Conversation

rlaverde commented Nov 24, 2016 • edited Loading

ccordoba12 Nov 25, 2016

Choose a reason for hiding this comment

goanpeca Nov 25, 2016

Choose a reason for hiding this comment

rlaverde Nov 25, 2016

Choose a reason for hiding this comment

goanpeca Nov 25, 2016

Choose a reason for hiding this comment

ccordoba12 Nov 25, 2016

Choose a reason for hiding this comment

ccordoba12 commented Nov 25, 2016

rlaverde commented Nov 25, 2016

ccordoba12 commented Nov 25, 2016 • edited Loading

ccordoba12 commented Nov 25, 2016

ccordoba12 Nov 25, 2016

Choose a reason for hiding this comment

rlaverde Nov 25, 2016 • edited Loading

Choose a reason for hiding this comment

ccordoba12 commented Nov 27, 2016

rlaverde commented Nov 24, 2016 •

edited

Loading

ccordoba12 commented Nov 25, 2016 •

edited

Loading

rlaverde Nov 25, 2016 •

edited

Loading