Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type guessing in 0.12 is less robust #182

Closed
ThrawnCA opened this issue Mar 30, 2023 · 6 comments
Closed

Type guessing in 0.12 is less robust #182

ThrawnCA opened this issue Mar 30, 2023 · 6 comments

Comments

@ThrawnCA
Copy link
Collaborator

The new type guessing system in 0.12+ encounters errors that the old one didn't. For example,
QOLSVC-1280_current-program-funding-dtis_1_2_2_2_1.csv results in an error due to cells not being recognised as any type, even string:

Error: File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/utils.py", line 221, in type_guess _columns.append(max(guesses_tuples, key=lambda t_n: t_n[1])[0]) ValueError('max() arg is an empty sequence') 

Or when a field is recognised as numeric or timestamp, and then a later row leaves that field blank, as per
QOLSVC-1280_current-program-funding-dtis_1_2_2_2_2.csv:

Error: Validation error writing rows to db: None - {'message': 'The data was invalid: invalid input syntax for type timestamp: ""'} 

The old messytables-based type guessing had no trouble parsing these.

@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Apr 17, 2023

It appears that the first problem arises when data is numeric on some rows but free text on other rows. The numeric rows cause str to be removed from the set of guessed types, and the string rows cause Decimal to be removed, leaving no guessed types at all. However, the at_least_one_value flag is still set to True, so the fallback str type isn't applied.

@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented Apr 17, 2023

The "invalid input syntax for type timestamp" error appears to occur because datetime.datetime accepts an empty string as a valid timestamp value, but the SQL timestamp type does not.

ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Apr 17, 2023
- If all types have been rejected, ensure that the fallback flag is correctly set
ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Apr 17, 2023
- replace empty strings with None if they have types that will choke on empty string
ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Apr 18, 2023
- Columns that used numeric on some rows and free text on others resulted in no type being guessed and an error
ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Apr 18, 2023
…#182

- 'timestamp' and 'numeric' cannot handle empty strings, so convert to None
@ThrawnCA
Copy link
Collaborator Author

ThrawnCA commented May 26, 2023

Found another issue, where specific data combinations will result in the parser ignoring double quotes and just splitting rows on every comma. I'm not sure exactly of the exact root cause yet, continuing to investigate.
blp_dashboard_dataset_31-mar-2023.csv
The attached file contains 2 rows and will trigger the error, but if you upload a file with just one of the rows - either one, it doesn't matter - then it succeeds.

0.11 has no trouble with this file.

@ThrawnCA
Copy link
Collaborator Author

Ah, found the cause. The input file uses double quotes, but it encloses so many single quotes that the csv module assumes single quotes must be the quote character.

I'm not sure whether it's better to override this and always use double quotes, or set it up to try both.

@ThrawnCA
Copy link
Collaborator Author

One option would be to reduce parser.CSV_SAMPLE_LINES to 1, so it only samples the header, not the body, when determining the dialect to use. That way has less risk of weird edge cases in the data. But there may be better approaches.

@ThrawnCA
Copy link
Collaborator Author

It appears that Messytables was sampling 1000 lines to sniff the CSV dialect, rather than 100, which is why it behaves differently on the attached file.

ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue May 29, 2023
- Messytables used to use 1000 rows, the Tabulator approach should do the same
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant