Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update direct csv fast load to handle empty ending lines on csv and other edge conditions #206

Closed
duttonw opened this issue Jan 31, 2024 · 5 comments

Comments

@duttonw
Copy link
Collaborator

duttonw commented Jan 31, 2024

Example logs from importing 40mb csv file with 400,000+ rows.

https://www.data.qld.gov.au/dataset/5efaa096-4480-4540-88be-a10ababd9f49/resource/a14317b7-2fca-41b7-8294-9a1f7a085b0f

deleting "a14317b7-2fca-41b7-8294-9a1f7a085b0f" from DataStore.
09:57:34,304 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Deleting "a14317b7-2fca-41b7-8294-9a1f7a085b0f" from DataStore.
09:57:34,414 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=False on resource a14317b7-2fca-41b7-8294-9a1f7a085b0f
/mnt/local_data/ckan_venv/src/ckan/ckanext/datastore/logic/action.py:646: SAWarning: TypeDecorator JsonDictType() will not produce a cache key because the ``cache_ok`` attribute is not set to True
.  This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions.  Set this attribute to True if this type object's state is
 safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
  q.update({'extras': extras}, synchronize_session=False)
Fields: [{'id': 'NOTIFICATION_DATE', 'type': 'text', 'info': {'notes': 'Date Queensland Health were notified of first COVID-19 detection from case.', 'type_override': '', 'label': 'Notification Da
te'}}, {'id': 'AGE_GROUP_5Y', 'type': 'text', 'info': {'notes': 'The five-year age group according to age of case at time of detection.', 'type_override': '', 'label': ''}}]
09:57:34,919 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Fields: [{'id': 'NOTIFICATION_DATE', 'type': 'text', 'info': {'notes': 'Date Queensland Health were notified of first COVID-19 detection f
rom case.', 'type_override': '', 'label': 'Notification Date'}}, {'id': 'AGE_GROUP_5Y', 'type': 'text', 'info': {'notes': 'The five-year age group according to age of case at time of detection.',
'type_override': '', 'label': ''}}]
09:57:34,981 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=True on resource a14317b7-2fca-41b7-8294-9a1f7a085b0f
Copying to database...
09:57:35,269 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Copying to database...
missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

09:57:41,240 WARNI [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

Load using COPY failed: Error during the load into PostgreSQL: missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

09:57:41,248 WARNI [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Load using COPY failed: Error during the load into PostgreSQL: missing data for column "AGE_GROUP_5Y"
CONTEXT:  COPY a14317b7-2fca-41b7-8294-9a1f7a085b0f, line 1761313: ""

Trying again with tabulator
09:57:41,253 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Trying again with tabulator
Determining column names and types
09:57:41,259 INFO  [e0db9009-2f3f-4a5d-b2bd-3f5018581f08] Determining column names and types
@duttonw
Copy link
Collaborator Author

duttonw commented Jan 31, 2024

similar issue
https://www.data.qld.gov.au/dataset/ambient-estuarine-water-quality-monitoring-data-near-real-time-sites-2012-to-present-day/resource/5d1e8368-7ec3-435a-92d0-280ad1e3db0d?truncate=30&inner_span=True

aving chunk 64
10:57:08,871 INFO  [86b1fa53-ffb9-4d52-9042-afb9b66ed3d6] Saving chunk 64
10:57:09,717 ERROR [ckanext.xloader.jobs] xloader error: list index out of range, Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 293, in load_csv
    f)
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT:  COPY 5d1e8368-7ec3-435a-92d0-280ad1e3db0d, line 16306: "2023-05-14 12:10:00,129534,,,,,,,,,,,-20.169,,148.464,,,,0,,,,,,264.4,,,,,,3.187,,,,,,,,,,,,,,,,,,,,..."


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 222, in xloader_data_into_datastore_
    direct_load()
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 173, in direct_load
    logger=logger)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 301, in load_csv
    ' {}'.format(error_str))
ckanext.xloader.job_exceptions.LoaderError: Error during the load into PostgreSQL: extra data after last expected column
CONTEXT:  COPY 5d1e8368-7ec3-435a-92d0-280ad1e3db0d, line 16306: "2023-05-14 12:10:00,129534,,,,,,,,,,,-20.169,,148.464,,,,0,,,,,,264.4,,,,,,3.187,,,,,,,,,,,,,,,,,,,,..."


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 80, in xloader_data_into_datastore
    xloader_data_into_datastore_(input, job_dict)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 226, in xloader_data_into_datastore_
    tabulator_load()
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/jobs.py", line 195, in tabulator_load
    logger=logger)
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 454, in load_table
    for i, records in enumerate(chunky(result, 250)):
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 513, in chunky
    item = list(itertools.islice(it, n))
  File "/mnt/local_data/ckan_venv/src/ckanext-xloader/ckanext/xloader/loader.py", line 413, in row_iterator
    data_row[headers[index]] = cell
IndexError: list index out of range

@duttonw
Copy link
Collaborator Author

duttonw commented Jan 31, 2024

another blocking fast load:

https://www.data.qld.gov.au/dataset/queensland-covid-19-case-line-list-location-source-of-infection/resource/1dbae506-d73c-4c19-b727-e8654b8be95a

'23505'}}
11:13:22,309 WARNI [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Load using COPY failed: Validation error when creating the database table: None - {'constraints': ['Cannot insert records or create indexbecause of uniqueness constraint'], 'info': {'orig': 'duplicate key value violates unique constraint "pg_type_typname_nsp_index"\nDETAIL:  Key (typname, typnamespace)=(1dbae506-d73c-4c19-b727-e8654b8be95a__id_seq, 17092) already exists.\n', 'pgcode': '23505'}}
Trying again with tabulator
11:13:22,317 INFO  [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Trying again with tabulator
Determining column names and types
11:13:22,327 INFO  [2b90e4ed-21ff-4d12-85f8-a78da6e47e09] Determining column names and types
load_table: Decoded encoding: {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}

@duttonw duttonw changed the title Update direct csv load to handle empty ending lines on csv Update direct csv fast load to handle empty ending lines on csv and other edge conditions Jan 31, 2024
ThrawnCA added a commit that referenced this issue Feb 9, 2024
Add a unit test sample that has an extra empty line. This ideally should be handled gracefully (ie ignore the extra line)
ThrawnCA added a commit that referenced this issue Feb 9, 2024
Extra empty lines at the end of a file should be ignored.
ThrawnCA added a commit that referenced this issue Feb 9, 2024
Skip rows that are completely blank instead of erroring out
@ThrawnCA
Copy link
Collaborator

ThrawnCA commented Feb 9, 2024

@duttonw I'm not sure how feasible it is to handle columns with the wrong number of commas, but completely blank rows are simple enough. Tabulator has built-in functionality to let us skip them.

ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Feb 28, 2024
ThrawnCA added a commit to qld-gov-au/ckanext-xloader that referenced this issue Mar 15, 2024
@duttonw
Copy link
Collaborator Author

duttonw commented Apr 5, 2024

once qld-gov-au#90 reaches /ckan/ckanet-xloader this can be closed.

@duttonw
Copy link
Collaborator Author

duttonw commented Apr 16, 2024

resolved in qld-gov-au#90, it will get to ckan org version in due time.

@duttonw duttonw closed this as completed Apr 16, 2024
peterVorman added a commit to OpenGov-OpenData/ckanext-xloader that referenced this issue Aug 5, 2024
* commit 'a96ce28c589dfe6b1b850d8eeeb14f1e1dfe9759': (80 commits)
  Add note about 2.11 support
  Update images and actions, test 2.11
  feat(tests): added tyoe guess on mixed integers;
  add more ignorable blank lines to test sample, ckan#206
  add more options for maintainers to expedite XLoader runs, GitHub ckan#202
  strip extra space for column name
  In plugin.py, there is an fix of resource format key error
  fix list syntax for combining range and dict
  skip blank rows in source files, ckan#206
  add unit test for handling empty lines, ckan#206
  add sample file with extra blank line at end, ckan#206
  fix(tests): less complicated;
  further cleanup
  fix(tests): finalized test method;
  fix(tests): subrequest params;
  fix(tests): module path;
  feat(tests): added new test;
  fix(syntax): flake8;
  fix(helpers): comments and better syntax;
  fix(templates): set in block;
  ...

# Resolved conflicts:
#	.github/workflows/test.yml
#	ckanext/xloader/controllers.py
#	ckanext/xloader/plugin.py
#	ckanext/xloader/templates-bs2/package/resource_edit_base.html
#	ckanext/xloader/templates/package/resource_edit_base.html
#	ckanext/xloader/utils.py
#	ckanext/xloader/views.py
peterVorman added a commit to OpenGov-OpenData/ckanext-xloader that referenced this issue Aug 5, 2024
* commit 'a96ce28c589dfe6b1b850d8eeeb14f1e1dfe9759': (80 commits)
  Add note about 2.11 support
  Update images and actions, test 2.11
  feat(tests): added tyoe guess on mixed integers;
  add more ignorable blank lines to test sample, ckan#206
  add more options for maintainers to expedite XLoader runs, GitHub ckan#202
  strip extra space for column name
  In plugin.py, there is an fix of resource format key error
  fix list syntax for combining range and dict
  skip blank rows in source files, ckan#206
  add unit test for handling empty lines, ckan#206
  add sample file with extra blank line at end, ckan#206
  fix(tests): less complicated;
  further cleanup
  fix(tests): finalized test method;
  fix(tests): subrequest params;
  fix(tests): module path;
  feat(tests): added new test;
  fix(syntax): flake8;
  fix(helpers): comments and better syntax;
  fix(templates): set in block;
  ...

# Resolved conflicts:
#	.github/workflows/test.yml
#	ckanext/xloader/controllers.py
#	ckanext/xloader/plugin.py
#	ckanext/xloader/templates-bs2/package/resource_edit_base.html
#	ckanext/xloader/templates/package/resource_edit_base.html
#	ckanext/xloader/utils.py
#	ckanext/xloader/views.py
peterVorman added a commit to OpenGov-OpenData/ckanext-xloader that referenced this issue Aug 5, 2024
* commit 'a96ce28c589dfe6b1b850d8eeeb14f1e1dfe9759': (80 commits)
  Add note about 2.11 support
  Update images and actions, test 2.11
  feat(tests): added tyoe guess on mixed integers;
  add more ignorable blank lines to test sample, ckan#206
  add more options for maintainers to expedite XLoader runs, GitHub ckan#202
  strip extra space for column name
  In plugin.py, there is an fix of resource format key error
  fix list syntax for combining range and dict
  skip blank rows in source files, ckan#206
  add unit test for handling empty lines, ckan#206
  add sample file with extra blank line at end, ckan#206
  fix(tests): less complicated;
  further cleanup
  fix(tests): finalized test method;
  fix(tests): subrequest params;
  fix(tests): module path;
  feat(tests): added new test;
  fix(syntax): flake8;
  fix(helpers): comments and better syntax;
  fix(templates): set in block;
  ...

# Resolved conflicts:
#	.github/workflows/test.yml
#	ckanext/xloader/controllers.py
#	ckanext/xloader/plugin.py
#	ckanext/xloader/templates-bs2/package/resource_edit_base.html
#	ckanext/xloader/templates/package/resource_edit_base.html
#	ckanext/xloader/utils.py
#	ckanext/xloader/views.py
peterVorman added a commit to OpenGov-OpenData/ckanext-xloader that referenced this issue Aug 5, 2024
* commit 'a96ce28c589dfe6b1b850d8eeeb14f1e1dfe9759': (80 commits)
  Add note about 2.11 support
  Update images and actions, test 2.11
  feat(tests): added tyoe guess on mixed integers;
  add more ignorable blank lines to test sample, ckan#206
  add more options for maintainers to expedite XLoader runs, GitHub ckan#202
  strip extra space for column name
  In plugin.py, there is an fix of resource format key error
  fix list syntax for combining range and dict
  skip blank rows in source files, ckan#206
  add unit test for handling empty lines, ckan#206
  add sample file with extra blank line at end, ckan#206
  fix(tests): less complicated;
  further cleanup
  fix(tests): finalized test method;
  fix(tests): subrequest params;
  fix(tests): module path;
  feat(tests): added new test;
  fix(syntax): flake8;
  fix(helpers): comments and better syntax;
  fix(templates): set in block;
  ...

# Resolved conflicts:
#	.github/workflows/test.yml
#	ckanext/xloader/controllers.py
#	ckanext/xloader/plugin.py
#	ckanext/xloader/templates-bs2/package/resource_edit_base.html
#	ckanext/xloader/templates/package/resource_edit_base.html
#	ckanext/xloader/utils.py
#	ckanext/xloader/views.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants