-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support .gz file format for fread #717
Comments
You can just do |
:Bump: quite useful (and coming up frequently). |
Yes this would be a very useful feature to have. Using a command line, is a temporary solution at best, since it relies on the underlying system to have the tools for decompression. For instance 'zcat' is not available on windows unless one installs cygwin etc. Since FRead is by far the best tool in R to read file, it would be a huge performance gain to read gzipped/bziped/... files directly. |
I just saw @Arun's bump in my email, and literally a few hours ago I was ingesting 200+ such files. +1 for usefullness |
Would have been useful here as well : |
I agree with @gbonamy readding directly from zip files would be a fantastic addition!! |
Reading from a connection with unz() would also be quite useful. I have a function that downloads a zip file, and only reads one file then throws it away. So if I could use fread(unz(zipfile, file = file)) it would be a great addition. |
I ++ about directly from gz files. I would personally use it every day. |
+1 from me as well. |
+1 |
4 similar comments
+1 |
+1 |
+1 |
+1 |
I'm curious - are people requesting this mostly working on Windows? I have trouble seeing the desire for this kind of specialization on Linux. I personally mostly use .xz compression, but wouldn't care if |
👍 |
Just put here my tip for OS X users. dt <- fread(input = 'zcat < data.gz') |
This is probably not the most efficient way but it works for me, you will probably have to change zread <- function(zf,f,...){
require(data.table)
res <- fread(paste(readLines(tmp <- unz(zf,f)), collapse = "\n"),...)
close(tmp)
res
} |
|
@dselivanov it gets the job done on small files, never tried it on large ones though... my method probably does a lot of useless memory allocation passing the whole file around as a character vector. |
Just zcat the file, see previous posts On Tuesday, 8 December 2015, gdkrmr notifications@github.com wrote:
|
+1 for us hapless Windows users and for portability. There may be reasons why |
+1 My first wish from fread / fwrite |
@webbp, I have the same pbm. I cannot use zcat, although it is pretty, because too little size in /dev/shm on my AWS EC2 instance. I should try to redirect /dev/shm to a EBS disk, but did not figure out how yet. Meanwhile, "zcat file.tsv.gz > file.tsv followed by fread('file.tsv')" is a penible workaround, but at least it works. An alternative idea would be to use a specific tmp directory. Any idea? |
+1 |
2 similar comments
+1 |
+1 |
Is it possible to make a R package have command line tools for windows, mac, liunx wrapped in same interface. Then we can use the An example of this kind of package I realized this kind of package will not be allowed in CRAN if you need to pack a gzip windows version in package. Either hosting it in other place, or ask user to download gzip windows by themselves. |
To uncompress file into temp file on disk will always work, but that could be slow because of disk access. If we read the file into a raw vector in RAM, then uncompress it with |
I wrote a function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file. So we can use The code is pure R so it should work in all platforms. I feel the The code is inspired by Here are some benchmarks:
|
One thing to note is that the
|
i forgot about this issue and tried to fread a gz file, only to get a mysterious error causing me to waste time, again, searching for the solution. 3 years later, still waiting for this elementary fix. |
After further exploration, my error above only occurs when there are spaces in the directory name: fread("zcat < data/directory\ one/test.csv.gz" But not with underscores: fread("zcat < data/directory_two/test.csv.gz" And can be alleviated by escaping the backslash again: fread("zcat < data/directory\\ one/test.csv.gz" Hope this helps. Otherwise, the |
Another example on StackOverflow why this feature is needed: |
How about:
|
This is considerably slower.
…-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread (#717)
How about:
dt = as.data.table(read_csv("myfile.gz"))```
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#717 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
|
setDT will be faster than as.data.table. What tool does read_csv use for
uncracking the .gz?
…On Mar 2, 2018 2:32 AM, "mspivakov" ***@***.***> wrote:
This is considerably slower.
-------- Original message --------
From: Webb Phillips
Date:2018/03/01 6:26 PM (GMT+00:00)
To: "Rdatatable/data.table"
Cc: Mikhail Spivakov , Comment
Subject: Re: [Rdatatable/data.table] Support .gz file format for fread
(#717)
How about:
dt = as.data.table(read_csv("myfile.gz"))```
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://github.com/
Rdatatable/data.table#717#issuecomment-369684424>, or mute the
thread<https://github.com/notifications/unsubscribe-auth/
ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT
Registered Charity No. 1053902.
The information transmitted in this email is directed only to the
addressee. If you received this in error, please contact the sender and
delete this email from your system. The contents of this e-mail are the
views of the sender and do not necessarily represent the views of the
Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.
babraham.ac.uk/terms>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#717 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdZa08PZo9LKMGbnzMnmUeaspKxgVks5taD6wgaJpZM4CKNWu>
.
|
And it is parsed with |
@frenchja How would this work with
Ah got it, it should be this:
|
@swvanderlaan I tend to use
|
Avoids error 'gzip: stdout: No space left on device' (Rdatatable/data.table#717 (comment))
I have several thousands of
.gz
files containing data incsv
format - about 60GB in total in terms of.gz
files. Decompressing them and load some pieces viafread
turns out a huge pain in the first step. I'm wonder whether it is possible to improve the functionality offread
so that it can read compressed file formats just asread.table
does?Perhaps file connection issues are highly relevant, as mentioned in #341, #543, and #561.
Some other reference:
http://stackoverflow.com/questions/5764499/decompress-gz-file-using-r
http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html
The text was updated successfully, but these errors were encountered: