CSV ingest improvements #3767

landreev · 2017-04-12T15:16:19Z

(this issue is for a clearly defined, short term goal; not another generic "improve xxx" issue;)

The original CSV parser was purposefully restrictive; strict formatting - one line per observation (no new lines in fields), fixed number of commas per line, etc. These requirements are no longer relevant. At the same time Gary specifically requested the CSV ingest to handle full text - with escaped new lines and rich punctuation, etc. Example: the file posts_all.tab (csv original) in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QSZMPD.

One way to define the goal would be to say that any Google/Excel spreadsheet columns exported as CSV should be parseable by our ingest. (I will add more details on how they escape punctuation characters and such).

A sensible way to achieve this would be to switch to some available open source parser (Apache seems like a good candidate), rather than maintaining our own.

raprasad · 2017-04-21T13:26:34Z

code

>>> import pandas as pd
>>> df = pd.read_csv('posts_all.csv')
>>> df.columns
Index([u'What.is.the.city.', u'folder', u'file',
       u'What.is.the.name.of.the.organization.making.posts.', u'url',
       u'content', u'site', u'What.is.the.account.name.of.the.person.posting.',
       u'PostDate', u'category', u'textseg'],
      dtype='object')
>>> df['What.is.the.city.'].describe()
count         43757
unique            1
top       Zhanggong
freq          43757
Name: What.is.the.city., dtype: object
>>> df.describe()
         category
count  188.000000
mean     3.936170
std      0.381624
min      3.000000
25%      4.000000
50%      4.000000
75%      4.000000
max      5.000000
>>> df['What.is.the.name.of.the.organization.making.posts.'].describe()
count     43722
unique      247
top         网宣办
freq       9159
Name: What.is.the.name.of.the.organization.making.posts., dtype: object

etc, etc

pdurbin · 2017-04-21T13:34:36Z

@raprasad heh. @mercecrosas also just mentioned to us that we could use the readxl R package, which just reached 1.0: https://blog.rstudio.org/2017/04/19/readxl-1-0-0/ . Using either Python or R for this sounds more like #2331 to me.

Back at #585 (comment) @bencomp mentioned using https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html , which I believe is what @landreev was referring to above.

In short, we need to decide what's in scope for this issue. I thought when we estimated this issue as an 8 on Wednesday we were talking about the Java/Apache Commons route, not the Python or R route. Perhaps we should re-estimate this issue and/or estimate #2331.

pdurbin · 2017-06-21T09:24:28Z

@oscardssmith does your pull request at #3930 also fix #2626?

oscardssmith · 2017-06-21T11:55:03Z

No idea, I'll put it in as a test case. Is imagine it's fixed though

landreev · 2017-06-22T16:19:44Z

@oscardssmith the file I was talking about is the one mentioned in the initial description, at the top of the issue.

landreev · 2017-07-17T04:31:11Z

Reviewing the results of the investigation - great job btw - I feel like we should consider at least one possible improvement/fix to be within the scope of this development iteration: All these cases of files that are NOT actually CSV, but tabular, that still get ingested as CSV files with just 1 column. The current code is doing it in a less broken way than what we were doing before... But it's still wrong to be doing this.

So let's discuss this. As in whether we want to change the behavior now; and if we do - how? I'm assuming that all these ingests happen because the files get uploaded with file names that have the ".csv" extension. We then assume that it is indeed a CSV, and that we should give it a try as such - and accept the result unless the parser explicitly fails. It appears that we need to be more careful there - namely, if the parser works, but we only find 1 column, we should do some extra checks and see if that was a CSV file in the first place. Maybe, we should be counting the tabs as we parse? And, if we've reached the end of the file without finding any commas - maybe we should just try and parse again, using the tab as the delimiter character? (and what do we do if that works? ingest the file? Or reject it with "this is not a CSV file"?)

An alternative, of course, is to declare this out of scope, release as is, and open a new issue for this. As long as it is accounted for, and on the list of things that need to be addressed - because it really seems wrong, what we are doing now.

oscardssmith · 2017-07-17T18:20:07Z

I think it makes most sense to say out of scope, and get this pushed out the door.

djbrooke · 2017-07-17T18:33:48Z

If it's incrementally better, let's get it tested and released! :)

landreev · 2017-07-17T20:18:57Z

OK, I'll open a new issue for this; and finish the review of all the other cases, shortly.

landreev · 2017-07-24T15:14:53Z

Moving the issue into QA.

Fixed a couple typos I missed earlier.

kcondon · 2017-08-01T00:08:42Z

OK, tested basic csv ingest and all tests passed:

Number, string, date ingest (YYYY-MM-DD)
Unrecognized date format treated as string
Missing values end up missing in resulting tab files except strings are empty quotes. On 2 Ravens, numeric missing show as invalid, dates and strings show as valid, consistent with prior version. Missing values are expressed in csv as no value without quotes.
Tested comma in string value by enclosing string in quotes
Tested carriage return in string value by enclosing in quotes, examining resultant tab file.
Tested accents in string.
Tested error handling of mistmatched columns and values.

So, in addition to the extensive automatic ingesting of production csv files this seems to pass testing.

One last minor request is to perhaps mention some of the above behavior in the docs, esp missing values, quotes to preserve commas and carriage returns, and support for UTF-8 characters.

landreev · 2017-08-01T01:27:48Z

OK, I'll quickly add the few things we've discussed with @dlmurphy to the guide; then will pass it to Derek for review.

landreev · 2017-08-02T02:36:17Z

@kcondon @dlmurphy : I wrote, and re-wrote a few paragraphs in the doc. @dlmurphy, please review for clarity, etc. Anything extra we want to document/discuss should definitely be addressed in the next dev. iteration. (there will be another iteration sometime soon, for adding direct support for tab-delimited files). As it is, I've already ended up writing more than I thought was in scope for this issue.

Edited for typos and clarity

landreev mentioned this issue Apr 12, 2017

Ingest: Provide more robust ingest for Excel and CSV #585

Closed

djbrooke changed the title ~~CSV ingest improvements, 4.6.2~~ CSV ingest improvements Apr 12, 2017

djbrooke changed the title ~~CSV ingest improvements~~ CSV ingest improvements, 4.6.2 Apr 12, 2017

djbrooke changed the title ~~CSV ingest improvements, 4.6.2~~ CSV ingest improvements Apr 19, 2017

djbrooke changed the title ~~CSV ingest improvements~~ CSV ingest improvements, 4.6.2 Apr 19, 2017

djbrooke changed the title ~~CSV ingest improvements, 4.6.2~~ CSV ingest improvements Apr 19, 2017

djbrooke added ready and removed ready labels Apr 19, 2017

djbrooke added the Backlog label May 18, 2017

pdurbin added the Feature: File Upload & Handling label Jun 8, 2017

djbrooke added ready and removed Backlog labels Jun 8, 2017

oscardssmith self-assigned this Jun 20, 2017

oscardssmith added the in progress label Jun 20, 2017

oscardssmith mentioned this issue Jun 20, 2017

3767: CSV ingest improvements #3930

Closed

djbrooke assigned landreev Jun 21, 2017

This was referenced Jun 25, 2017

Ingest of CSV (without header row) fails because of database column size limit #2626

Closed

Review Good Tables from OKFN #1609

Closed

Dataset Tabular File Ingest - Displaying both "success" and "ingest in process" message is confusing/wrong #2709

Closed

pdurbin added in progress and removed Status: Backlog labels Jun 26, 2017

djbrooke assigned sekmiller and unassigned oscardssmith Jul 11, 2017

sekmiller removed their assignment Jul 13, 2017

landreev self-assigned this Jul 17, 2017

djbrooke added this to the 4.8 - Large Data Upload Integration milestone Jul 20, 2017

djbrooke mentioned this issue Jul 21, 2017

rdata ingest defaults #3999

Closed

landreev added Status: QA and removed Status: Code Review labels Jul 24, 2017

djbrooke unassigned landreev Jul 24, 2017

kcondon self-assigned this Jul 31, 2017

dlmurphy added a commit that referenced this issue Jul 31, 2017

Typo fix [ref: #3767]

fbd5789

Fixed a couple typos I missed earlier.

landreev self-assigned this Aug 1, 2017

djbrooke assigned dlmurphy Aug 1, 2017

landreev removed their assignment Aug 2, 2017

landreev added a commit that referenced this issue Aug 2, 2017

shortened one sentence in the csv doc (#3767)

145608c

dlmurphy added a commit that referenced this issue Aug 2, 2017

Docs reviewed [ref: #3767]

4e45054

Edited for typos and clarity

dlmurphy removed their assignment Aug 2, 2017

kcondon closed this as completed Aug 2, 2017

kcondon removed the Status: QA label Aug 2, 2017

landreev mentioned this issue Aug 4, 2017

Add support for directly ingesting tab-delimited files #4044

Closed

pdurbin mentioned this issue Aug 7, 2017

Ingest: Excel ingest can't accept more than 26 (Z) columns. #3382

Closed

pdurbin mentioned this issue Oct 31, 2017

Reword Error Message - Tabular Data Ingest Failed #4250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV ingest improvements #3767

CSV ingest improvements #3767

landreev commented Apr 12, 2017 •

edited by djbrooke

Loading

raprasad commented Apr 21, 2017

pdurbin commented Apr 21, 2017

pdurbin commented Jun 21, 2017

oscardssmith commented Jun 21, 2017

landreev commented Jun 22, 2017 •

edited

Loading

landreev commented Jul 17, 2017

oscardssmith commented Jul 17, 2017

djbrooke commented Jul 17, 2017

landreev commented Jul 17, 2017

landreev commented Jul 24, 2017

kcondon commented Aug 1, 2017

landreev commented Aug 1, 2017

landreev commented Aug 2, 2017

CSV ingest improvements #3767

CSV ingest improvements #3767

Comments

landreev commented Apr 12, 2017 • edited by djbrooke Loading

raprasad commented Apr 21, 2017

pdurbin commented Apr 21, 2017

pdurbin commented Jun 21, 2017

oscardssmith commented Jun 21, 2017

landreev commented Jun 22, 2017 • edited Loading

landreev commented Jul 17, 2017

oscardssmith commented Jul 17, 2017

djbrooke commented Jul 17, 2017

landreev commented Jul 17, 2017

landreev commented Jul 24, 2017

kcondon commented Aug 1, 2017

landreev commented Aug 1, 2017

landreev commented Aug 2, 2017

landreev commented Apr 12, 2017 •

edited by djbrooke

Loading

landreev commented Jun 22, 2017 •

edited

Loading