-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorporate parquet files? #9897
Comments
Yes! Recently 2020 data from the US Census was published in Harvard Dataverse in parquet format: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5LAVKV (2010 data was published as well: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6 ) This work is with the US Census is ongoing and being tracked here: That said, no, Dataverse doesn't have any particular support for parquet files. In the examples above the parquet files are in a zip file. Here's a preview of the 2020 zip: |
One note here is that dataverse does not seem to "Unzip" a compressed parquet collection in a way that respects the file hierarchy. In this example I just made it says it "failed to unzip the file properly". The file itself is still intact: the user can unzip it themselves after downloading, and Phil's screenshot above shows that the file hierarchy can be viewed as metadata, which may be the best way forward. But just a note. parquet-dataverse.mp4 |
@kuriwaki that's unfortunate! Would you be able to create a dedicated issue for this (about |
Has there been any discussion of using parquet files at some level of dataverse? (I see it mentioned in only one issue.)
I've used them some, and I love how they work well with R, Python, DuckDb, Spark, and others.
Several R programmers (like @kuriwaki) have advocated for rds files over RData files. From my recent experience with parquet files, they have all the advertised advantages of rds files (eg, compression, strong-typing, and factor levels), plus the appeal of interoperability with other platforms.
I haven't thought much beyond this. But when I read about problems with RData files and the messiness of Rserve described by @landreev, I see parquet as a improvement for many reasons --not least is the ability to replace a flaky remote instance with a local parquet library.
cc: @pdurbin
The text was updated successfully, but these errors were encountered: