-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV ingest improvements #3767
Comments
code >>> import pandas as pd
>>> df = pd.read_csv('posts_all.csv')
>>> df.columns
Index([u'What.is.the.city.', u'folder', u'file',
u'What.is.the.name.of.the.organization.making.posts.', u'url',
u'content', u'site', u'What.is.the.account.name.of.the.person.posting.',
u'PostDate', u'category', u'textseg'],
dtype='object')
>>> df['What.is.the.city.'].describe()
count 43757
unique 1
top Zhanggong
freq 43757
Name: What.is.the.city., dtype: object
>>> df.describe()
category
count 188.000000
mean 3.936170
std 0.381624
min 3.000000
25% 4.000000
50% 4.000000
75% 4.000000
max 5.000000
>>> df['What.is.the.name.of.the.organization.making.posts.'].describe()
count 43722
unique 247
top 网宣办
freq 9159
Name: What.is.the.name.of.the.organization.making.posts., dtype: object etc, etc |
@raprasad heh. @mercecrosas also just mentioned to us that we could use the Back at #585 (comment) @bencomp mentioned using https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html , which I believe is what @landreev was referring to above. In short, we need to decide what's in scope for this issue. I thought when we estimated this issue as an 8 on Wednesday we were talking about the Java/Apache Commons route, not the Python or R route. Perhaps we should re-estimate this issue and/or estimate #2331. |
@oscardssmith does your pull request at #3930 also fix #2626? |
No idea, I'll put it in as a test case. Is imagine it's fixed though |
@oscardssmith the file I was talking about is the one mentioned in the initial description, at the top of the issue. |
Reviewing the results of the investigation - great job btw - I feel like we should consider at least one possible improvement/fix to be within the scope of this development iteration: All these cases of files that are NOT actually CSV, but tabular, that still get ingested as CSV files with just 1 column. The current code is doing it in a less broken way than what we were doing before... But it's still wrong to be doing this. So let's discuss this. As in whether we want to change the behavior now; and if we do - how? I'm assuming that all these ingests happen because the files get uploaded with file names that have the ".csv" extension. We then assume that it is indeed a CSV, and that we should give it a try as such - and accept the result unless the parser explicitly fails. It appears that we need to be more careful there - namely, if the parser works, but we only find 1 column, we should do some extra checks and see if that was a CSV file in the first place. Maybe, we should be counting the tabs as we parse? And, if we've reached the end of the file without finding any commas - maybe we should just try and parse again, using the tab as the delimiter character? (and what do we do if that works? ingest the file? Or reject it with "this is not a CSV file"?) An alternative, of course, is to declare this out of scope, release as is, and open a new issue for this. As long as it is accounted for, and on the list of things that need to be addressed - because it really seems wrong, what we are doing now. |
I think it makes most sense to say out of scope, and get this pushed out the door. |
If it's incrementally better, let's get it tested and released! :) |
OK, I'll open a new issue for this; and finish the review of all the other cases, shortly. |
Moving the issue into QA. |
Fixed a couple typos I missed earlier.
OK, tested basic csv ingest and all tests passed:
So, in addition to the extensive automatic ingesting of production csv files this seems to pass testing. One last minor request is to perhaps mention some of the above behavior in the docs, esp missing values, quotes to preserve commas and carriage returns, and support for UTF-8 characters. |
OK, I'll quickly add the few things we've discussed with @dlmurphy to the guide; then will pass it to Derek for review. |
@kcondon @dlmurphy : I wrote, and re-wrote a few paragraphs in the doc. @dlmurphy, please review for clarity, etc. Anything extra we want to document/discuss should definitely be addressed in the next dev. iteration. (there will be another iteration sometime soon, for adding direct support for tab-delimited files). As it is, I've already ended up writing more than I thought was in scope for this issue. |
(this issue is for a clearly defined, short term goal; not another generic "improve xxx" issue;)
The original CSV parser was purposefully restrictive; strict formatting - one line per observation (no new lines in fields), fixed number of commas per line, etc. These requirements are no longer relevant. At the same time Gary specifically requested the CSV ingest to handle full text - with escaped new lines and rich punctuation, etc. Example: the file posts_all.tab (csv original) in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QSZMPD.
One way to define the goal would be to say that any Google/Excel spreadsheet columns exported as CSV should be parseable by our ingest. (I will add more details on how they escape punctuation characters and such).
A sensible way to achieve this would be to switch to some available open source parser (Apache seems like a good candidate), rather than maintaining our own.
The text was updated successfully, but these errors were encountered: