-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError on Macs #20
Comments
In both the Loaisel texts and the user's texts, the Mac detects the encoding as UTF-8. I tried turning each txt file to a string in Python, ignoring any encoding errors, then writing as a new text file: with open ("Loaisel_1779.txt", "r", encoding='utf-8', errors='ignore') as f: But the error persists... |
Okay, I found a workaround. When this error occurs, a wdiffed.txt file is still created in the Outputs folder by Coleto. By inspecting the insertions and deletions as marked by Wdiff, the user can locate encoding errors (by the presence of strings which look like code or random symbols), often due to diacritical marks and punctuation marks in the original texts. By correcting these or omitting these in the original text files, e.g.: é —> e Coleto will run error-free. This may have something to do with how these characters are encoded by different language settings on Macs, not sure... |
Thanks for testing and reporting, I'll have a look. |
I've not been able to replicate this on a Linux machine. The files appear to be encoded in UTF8, as they should. Additional files in German have also not caused this error. Most posts I have seen on this issue consider this as a case where the real encoding of the files is something else, like latin-1 or ISO-8859-1, but that doesn't appear to be the case here. I'll try to get my hands on a Mac to do some more testing. |
The following error has occurred on at least two Macs running Coleto. It occurs with Coleto's included Loaisel corpus, as well as another user's texts:
`== coleto: running text_wdiff. ==
Looking good: wdiff results have been written to disk.
Traceback (most recent call last):
File "coleto/run_coleto.py", line 44, in
main()
File "coleto/run_coleto.py", line 34, in main
text_wdiff.main(params)
File "/Users/USER/Documents/coleto-main/coleto/text_wdiff.py", line 64, in main
check_results(params["wdiffed_file"])
File "/Users/USER/Documents/coleto-main/coleto/text_wdiff.py", line 39, in check_results
wdiffed = infile.read()
File "/Users/USER/opt/anaconda3/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1904: invalid continuation byte`
Although byte number and position number vary by input data.
The text was updated successfully, but these errors were encountered: