Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing issue of quoted values in large CSV file #1930

Closed
peterdesmet opened this issue Jan 19, 2023 · 7 comments
Closed

Parsing issue of quoted values in large CSV file #1930

peterdesmet opened this issue Jan 19, 2023 · 7 comments
Assignees
Milestone

Comments

@peterdesmet
Copy link
Member

I ran into a CSV parsing issue.

The source file I'm uploading is a CSV file (2.5 million rows) that only uses "quotes" when needed. This is the default readr::write_csv() behaviour, e.g. to escape commas in values. Note that I am not indicating Field Quotes: " in the IPT, as that is reserved for when all values are quoted.

Snippet of source file. Notice quoted "BIG FLOCK, WIDESPREAD FORAGING ON SPRAT" in occurrenceRemarks

eventID,basisOfRecord,occurrenceID,individualCount,sex,lifeStage,behavior,occurrenceStatus,associatedTaxa,occurrenceRemarks,scientificNameID,scientificName,kingdom
110000125_110000125_110003829,HumanObservation,110000125_110000125_110003829_1100024895,200,,,Deep pluging,present,,"BIG FLOCK, WIDESPREAD FORAGING ON SPRAT",urn:lsid:marinespecies.org:taxname:137156,Rissa tridactyla,Animalia

In the generated Darwin Core Archive, the resulting file is the following. Notice how "BIG FLOCK ... is now spread over multiple fields:

id	basisOfRecord	occurrenceID	occurrenceRemarks	individualCount	sex	lifeStage	behavior	occurrenceStatus	associatedTaxa	eventID	scientificNameID	scientificName	kingdom
110000125_110000125_110003829	HumanObservation	110000125_110000125_110003829_1100024895	"BIG FLOCK	200			Deep pluging	present		110000125_110000125_110003829	WIDESPREAD FORAGING ON SPRAT"	urn:lsid:marinespecies.org:taxname:137156	Rissa tridactyla

Any idea what might be causing this? It's the first time I encounter this, even though I have uploaded many such CSV files (with only quoted values when necessary) to the IPT before, without ever running into issues. See e.g. https://www.gbif.org/occurrence/3795234906, where some values contain commas:

UvA-BiTS tag attached by harness to free-ranging animal | Found dead, possibly predated by peregrine falcon.
peterdesmet added a commit to EMODnet/esas2obis that referenced this issue Jan 19, 2023
@mike-podolskiy90 mike-podolskiy90 self-assigned this Jan 26, 2023
@mike-podolskiy90
Copy link
Contributor

mike-podolskiy90 commented Jan 26, 2023

@peterdesmet Thank you Peter for reporting this and sorry for the late reply.
Looks like Field Quotes resolves this issue, but its description is not correct in that case

@peterdesmet
Copy link
Member Author

Oh great, so it Field Quotes = " also works file files that are quoted where needed. Is there a reason why Field Quotes doesn't use " by default? It is the default in the CSV readers and writers I'm aware of:

@mike-podolskiy90
Copy link
Contributor

Yes, it works. Also I've just checked creating a couple of file sources, and they all have Field Quotes = " by default (txts though)

@peterdesmet
Copy link
Member Author

Thanks! I notice now that other csv or even csv.zip files I uploaded as source (in other resources) all got Field Quotes = ". So seems like an isolated case.

In any case, would indeed be good if description of the field was updated.

@peterdesmet
Copy link
Member Author

Did some more tests. Looks like the IPT is assessing the source data to figure out whether to use quotes values or not:

  • No quotes values found, use Field Quotes = <empty>
  • Quoted values found, use Field Quotes = "

So far so good. The issue seems to be that it only looks in the first 20 lines of a file. Compare:

  • line_10.csv: quoted behavior value in line 10, will set Field Quotes = "
  • line_21.csv: qouted behavior value in line 21, will set Field Quotes = <empty>

This will lead to parsing issues down the line. The user can off course always set Field Quotes to resolve the issue, but then they need to remember doing so. So I'd like to know if there are any downsides to always setting Field Quotes = "?

@mike-podolskiy90
Copy link
Contributor

@peterdesmet Thank you for spending your time testing this issue.
I was also thinking about setting Field Quotes to ". I don't see any issues right now, but I assume they may appear.

@mike-podolskiy90 mike-podolskiy90 added this to the 2.7.7 milestone Nov 16, 2023
@mike-podolskiy90
Copy link
Contributor

Let's just set " as a default parameter #2190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants