Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV ingest improvements #3767

Closed
landreev opened this issue Apr 12, 2017 · 26 comments
Closed

CSV ingest improvements #3767

landreev opened this issue Apr 12, 2017 · 26 comments

Comments

@landreev
Copy link
Contributor

landreev commented Apr 12, 2017

(this issue is for a clearly defined, short term goal; not another generic "improve xxx" issue;)

The original CSV parser was purposefully restrictive; strict formatting - one line per observation (no new lines in fields), fixed number of commas per line, etc. These requirements are no longer relevant. At the same time Gary specifically requested the CSV ingest to handle full text - with escaped new lines and rich punctuation, etc. Example: the file posts_all.tab (csv original) in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QSZMPD.

One way to define the goal would be to say that any Google/Excel spreadsheet columns exported as CSV should be parseable by our ingest. (I will add more details on how they escape punctuation characters and such).

A sensible way to achieve this would be to switch to some available open source parser (Apache seems like a good candidate), rather than maintaining our own.

@djbrooke djbrooke changed the title CSV ingest improvements, 4.6.2 CSV ingest improvements Apr 12, 2017
@djbrooke djbrooke changed the title CSV ingest improvements CSV ingest improvements, 4.6.2 Apr 12, 2017
@djbrooke djbrooke changed the title CSV ingest improvements, 4.6.2 CSV ingest improvements Apr 19, 2017
@djbrooke djbrooke changed the title CSV ingest improvements CSV ingest improvements, 4.6.2 Apr 19, 2017
@djbrooke djbrooke changed the title CSV ingest improvements, 4.6.2 CSV ingest improvements Apr 19, 2017
@djbrooke djbrooke added ready and removed ready labels Apr 19, 2017
@raprasad
Copy link
Contributor

code

>>> import pandas as pd
>>> df = pd.read_csv('posts_all.csv')
>>> df.columns
Index([u'What.is.the.city.', u'folder', u'file',
       u'What.is.the.name.of.the.organization.making.posts.', u'url',
       u'content', u'site', u'What.is.the.account.name.of.the.person.posting.',
       u'PostDate', u'category', u'textseg'],
      dtype='object')
>>> df['What.is.the.city.'].describe()
count         43757
unique            1
top       Zhanggong
freq          43757
Name: What.is.the.city., dtype: object
>>> df.describe()
         category
count  188.000000
mean     3.936170
std      0.381624
min      3.000000
25%      4.000000
50%      4.000000
75%      4.000000
max      5.000000
>>> df['What.is.the.name.of.the.organization.making.posts.'].describe()
count     43722
unique      247
top         网宣办
freq       9159
Name: What.is.the.name.of.the.organization.making.posts., dtype: object

etc, etc

@pdurbin
Copy link
Member

pdurbin commented Apr 21, 2017

@raprasad heh. @mercecrosas also just mentioned to us that we could use the readxl R package, which just reached 1.0: https://blog.rstudio.org/2017/04/19/readxl-1-0-0/ . Using either Python or R for this sounds more like #2331 to me.

Back at #585 (comment) @bencomp mentioned using https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html , which I believe is what @landreev was referring to above.

In short, we need to decide what's in scope for this issue. I thought when we estimated this issue as an 8 on Wednesday we were talking about the Java/Apache Commons route, not the Python or R route. Perhaps we should re-estimate this issue and/or estimate #2331.

@pdurbin
Copy link
Member

pdurbin commented Jun 21, 2017

@oscardssmith does your pull request at #3930 also fix #2626?

@oscardssmith
Copy link
Contributor

No idea, I'll put it in as a test case. Is imagine it's fixed though

@landreev
Copy link
Contributor Author

landreev commented Jun 22, 2017

@oscardssmith the file I was talking about is the one mentioned in the initial description, at the top of the issue.

@djbrooke djbrooke assigned sekmiller and unassigned oscardssmith Jul 11, 2017
@sekmiller sekmiller removed their assignment Jul 13, 2017
@landreev
Copy link
Contributor Author

Reviewing the results of the investigation - great job btw - I feel like we should consider at least one possible improvement/fix to be within the scope of this development iteration: All these cases of files that are NOT actually CSV, but tabular, that still get ingested as CSV files with just 1 column. The current code is doing it in a less broken way than what we were doing before... But it's still wrong to be doing this.

So let's discuss this. As in whether we want to change the behavior now; and if we do - how? I'm assuming that all these ingests happen because the files get uploaded with file names that have the ".csv" extension. We then assume that it is indeed a CSV, and that we should give it a try as such - and accept the result unless the parser explicitly fails. It appears that we need to be more careful there - namely, if the parser works, but we only find 1 column, we should do some extra checks and see if that was a CSV file in the first place. Maybe, we should be counting the tabs as we parse? And, if we've reached the end of the file without finding any commas - maybe we should just try and parse again, using the tab as the delimiter character? (and what do we do if that works? ingest the file? Or reject it with "this is not a CSV file"?)

An alternative, of course, is to declare this out of scope, release as is, and open a new issue for this. As long as it is accounted for, and on the list of things that need to be addressed - because it really seems wrong, what we are doing now.

@landreev landreev self-assigned this Jul 17, 2017
@oscardssmith
Copy link
Contributor

I think it makes most sense to say out of scope, and get this pushed out the door.

@djbrooke
Copy link
Contributor

If it's incrementally better, let's get it tested and released! :)

@landreev
Copy link
Contributor Author

OK, I'll open a new issue for this; and finish the review of all the other cases, shortly.

@djbrooke djbrooke added this to the 4.8 - Large Data Upload Integration milestone Jul 20, 2017
@landreev
Copy link
Contributor Author

Moving the issue into QA.

@kcondon kcondon self-assigned this Jul 31, 2017
dlmurphy added a commit that referenced this issue Jul 31, 2017
Fixed a couple typos I missed earlier.
@kcondon
Copy link
Contributor

kcondon commented Aug 1, 2017

OK, tested basic csv ingest and all tests passed:

  1. Number, string, date ingest (YYYY-MM-DD)
  2. Unrecognized date format treated as string
  3. Missing values end up missing in resulting tab files except strings are empty quotes. On 2 Ravens, numeric missing show as invalid, dates and strings show as valid, consistent with prior version. Missing values are expressed in csv as no value without quotes.
  4. Tested comma in string value by enclosing string in quotes
  5. Tested carriage return in string value by enclosing in quotes, examining resultant tab file.
  6. Tested accents in string.
  7. Tested error handling of mistmatched columns and values.

So, in addition to the extensive automatic ingesting of production csv files this seems to pass testing.

One last minor request is to perhaps mention some of the above behavior in the docs, esp missing values, quotes to preserve commas and carriage returns, and support for UTF-8 characters.

@landreev landreev self-assigned this Aug 1, 2017
@landreev
Copy link
Contributor Author

landreev commented Aug 1, 2017

OK, I'll quickly add the few things we've discussed with @dlmurphy to the guide; then will pass it to Derek for review.

@landreev
Copy link
Contributor Author

landreev commented Aug 2, 2017

@kcondon @dlmurphy : I wrote, and re-wrote a few paragraphs in the doc. @dlmurphy, please review for clarity, etc. Anything extra we want to document/discuss should definitely be addressed in the next dev. iteration. (there will be another iteration sometime soon, for adding direct support for tab-delimited files). As it is, I've already ended up writing more than I thought was in scope for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants