Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File encoding checks fail with uchardet when the file already is UTF-8 #129

Closed
Zharktas opened this issue Jan 24, 2024 · 1 comment · Fixed by #130
Closed

File encoding checks fail with uchardet when the file already is UTF-8 #129

Zharktas opened this issue Jan 24, 2024 · 1 comment · Fixed by #130

Comments

@Zharktas
Copy link
Contributor

Zharktas commented Jan 24, 2024

Describe the bug
uchardet says that the file is UTF-8 file but datapusher still tries to re-encode it

To Reproduce

Example file https://www.avoindata.fi/data/dataset/4b64be55-5a69-4f6b-a9d5-d4cbbe5c4382/resource/6409bdec-4a48-46a9-8729-5a727e37cd55/download/data.csv

Upload it to to datapusher, log says the following:

Fetching from: http://localhost/data/fi/dataset/3180686e-5a17-4b83-8d93-8a3cc954fccd/resource/bdcf31b0-3c90-46e7-8b56-94a6ad1f897f/download/test_data.csv...
File format: CSV
Downloading 52.99MB file...
Fetched 52.99MB file in 0.67 seconds.
ANALYZING WITH QSV..
Normalizing/UTF-8 transcoding CSV...
Identified encoding of the file: UTF-8

File is not UTF-8 encoded. Re-encoding from UTF-8
 to UTF-8

iconv: source charset UTF-8
Invalid argument
Job aborted as the file cannot be re-encoded to UTF-8: Command '['iconv', '-f', 'UTF-8\n', '-t', 'UTF-8', '/tmp/tmpr35v1f3a/tmp.CSV', '--output', '/tmp/tmpr35v1f3a/qsv_input_utf_8_encoded.csv']' returned non-zero exit status 1.

it appears to be failing here:

if file_encoding.stdout != "UTF-8":

@Zharktas
Copy link
Contributor Author

running uchardet actually produces line ending in to the variable and the check fails, in my testing the file_encoding.stdout contains UTF-8\n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant