UnicodeDecodeError on Macs #20

erikannotations · 2021-11-26T02:15:41Z

The following error has occurred on at least two Macs running Coleto. It occurs with Coleto's included Loaisel corpus, as well as another user's texts:

`== coleto: running text_wdiff. ==
Looking good: wdiff results have been written to disk.

Traceback (most recent call last):

File "coleto/run_coleto.py", line 44, in
main()

File "coleto/run_coleto.py", line 34, in main
text_wdiff.main(params)

File "/Users/USER/Documents/coleto-main/coleto/text_wdiff.py", line 64, in main
check_results(params["wdiffed_file"])

File "/Users/USER/Documents/coleto-main/coleto/text_wdiff.py", line 39, in check_results
wdiffed = infile.read()

File "/Users/USER/opt/anaconda3/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1904: invalid continuation byte`

Although byte number and position number vary by input data.

erikannotations · 2021-11-26T02:40:38Z

In both the Loaisel texts and the user's texts, the Mac detects the encoding as UTF-8.

I tried turning each txt file to a string in Python, ignoring any encoding errors, then writing as a new text file:

with open ("Loaisel_1779.txt", "r", encoding='utf-8', errors='ignore') as f:
text = f.read()

But the error persists...

erikannotations · 2021-11-26T03:07:52Z

Okay, I found a workaround.

When this error occurs, a wdiffed.txt file is still created in the Outputs folder by Coleto. By inspecting the insertions and deletions as marked by Wdiff, the user can locate encoding errors (by the presence of strings which look like code or random symbols), often due to diacritical marks and punctuation marks in the original texts. By correcting these or omitting these in the original text files, e.g.:

é —> e
ï —> i
— —> -

Coleto will run error-free.

This may have something to do with how these characters are encoded by different language settings on Macs, not sure...

christofs · 2021-11-26T07:49:23Z

Thanks for testing and reporting, I'll have a look.

christofs · 2021-11-30T19:14:27Z

I've not been able to replicate this on a Linux machine. The files appear to be encoded in UTF8, as they should. Additional files in German have also not caused this error. Most posts I have seen on this issue consider this as a case where the real encoding of the files is something else, like latin-1 or ISO-8859-1, but that doesn't appear to be the case here. I'll try to get my hands on a Mac to do some more testing.

erikannotations assigned christofs Nov 26, 2021

erikannotations self-assigned this Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError on Macs #20

UnicodeDecodeError on Macs #20

erikannotations commented Nov 26, 2021 •

edited

Loading

erikannotations commented Nov 26, 2021

erikannotations commented Nov 26, 2021 •

edited

Loading

christofs commented Nov 26, 2021

christofs commented Nov 30, 2021

UnicodeDecodeError on Macs #20

UnicodeDecodeError on Macs #20

Comments

erikannotations commented Nov 26, 2021 • edited Loading

erikannotations commented Nov 26, 2021

erikannotations commented Nov 26, 2021 • edited Loading

christofs commented Nov 26, 2021

christofs commented Nov 30, 2021

erikannotations commented Nov 26, 2021 •

edited

Loading

erikannotations commented Nov 26, 2021 •

edited

Loading