Support .gz file format for fread #717

renqian · 2014-07-04T04:08:46Z

I have several thousands of .gz files containing data in csv format - about 60GB in total in terms of .gz files. Decompressing them and load some pieces via fread turns out a huge pain in the first step. I'm wonder whether it is possible to improve the functionality of fread so that it can read compressed file formats just as read.table does?

Perhaps file connection issues are highly relevant, as mentioned in #341, #543, and #561.
Some other reference:

http://stackoverflow.com/questions/5764499/decompress-gz-file-using-r

http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

The text was updated successfully, but these errors were encountered:

eantonya · 2014-07-07T15:18:26Z

You can just do fread('zcat file.gz'), or some loop variation, if you have many files.

arunsrinivasan · 2014-09-28T07:48:25Z

:Bump: quite useful (and coming up frequently).

Here's one SO post.

gbonamy · 2014-09-28T09:40:44Z

Yes this would be a very useful feature to have. Using a command line, is a temporary solution at best, since it relies on the underlying system to have the tools for decompression. For instance 'zcat' is not available on windows unless one installs cygwin etc.

Since FRead is by far the best tool in R to read file, it would be a huge performance gain to read gzipped/bziped/... files directly.

rsaporta · 2014-09-30T03:03:15Z

I just saw @Arun's bump in my email, and literally a few hours ago I was ingesting 200+ such files. +1 for usefullness

mattdowle · 2014-10-07T12:25:05Z

Would have been useful here as well :
http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html
Will take a look.

xiaodaigh · 2014-11-20T05:45:33Z

I agree with @gbonamy readding directly from zip files would be a fantastic addition!!

rmscriven · 2015-03-16T07:46:02Z

Reading from a connection with unz() would also be quite useful. I have a function that downloads a zip file, and only reads one file then throws it away. So if I could use fread(unz(zipfile, file = file)) it would be a great addition.

statquant · 2015-04-17T11:10:08Z

I ++ about directly from gz files. I would personally use it every day.

mspivakov · 2015-04-26T19:47:41Z

+1 from me as well.

fleimgruber · 2015-06-03T08:04:35Z

+1

zx8754 · 2015-07-01T19:18:55Z

+1

rickdonnelly · 2015-07-15T14:39:36Z

+1

qgeissmann · 2015-07-17T14:49:52Z

+1

jayjacobs · 2015-07-28T13:36:12Z

+1

eantonya · 2015-07-28T13:59:23Z

I'm curious - are people requesting this mostly working on Windows? I have trouble seeing the desire for this kind of specialization on Linux.

I personally mostly use .xz compression, but wouldn't care if fread directly supported it - I very frequently pipe the uncompressed result and do some post-processing before loading it in R (e.g. fread('xzcat file.xz | grep smth | awk blah')) and I like not depending on fread's file-format reading abilities - my shell processes are almost always going to be more advanced than whatever is implemented in fread.

zachmayer · 2015-08-20T22:57:17Z

👍

dselivanov · 2015-09-14T10:17:10Z

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

gdkrmr · 2015-12-08T09:10:50Z

This is probably not the most efficient way but it works for me, you will probably have to change unz for gzfile :

zread <- function(zf,f,...){
  require(data.table)
  res <- fread(paste(readLines(tmp <- unz(zf,f)), collapse = "\n"),...)
  close(tmp)
  res
}

dselivanov · 2015-12-08T09:19:52Z

readLines incredibly slow...

gdkrmr · 2015-12-08T10:06:20Z

@dselivanov it gets the job done on small files, never tried it on large ones though... my method probably does a lot of useless memory allocation passing the whole file around as a character vector.

statquant · 2015-12-08T13:43:10Z

Just zcat the file, see previous posts

On Tuesday, 8 December 2015, gdkrmr notifications@github.com wrote:

@dselivanov https://github.com/dselivanov it gets the job done on small
files... my method probably does a lot of useless memory allocation passing
the whole file around as a character vector.

—
Reply to this email directly or view it on GitHub
#717 (comment)
.

cybaea · 2015-12-21T08:44:26Z

+1 for us hapless Windows users and for portability. There may be reasons why fread cannot accept a connection (as in help("connections", package="base")) but if not that would be a great and portable solution. Would also help with some common encoding issues (eg BOMs in UTF-8 files).

TuSKan · 2016-11-23T16:00:15Z

+1 My first wish from fread / fwrite

borisclemencon · 2016-11-25T18:48:08Z

@webbp, I have the same pbm. I cannot use zcat, although it is pretty, because too little size in /dev/shm on my AWS EC2 instance. I should try to redirect /dev/shm to a EBS disk, but did not figure out how yet. Meanwhile, "zcat file.tsv.gz > file.tsv followed by fread('file.tsv')" is a penible workaround, but at least it works.

An alternative idea would be to use a specific tmp directory. Any idea?

sznadas · 2017-02-06T15:35:20Z

+1

mGalarnyk · 2017-02-10T21:55:16Z

+1

rargelaguet · 2017-02-15T17:50:04Z

+1

xhdong-umd · 2017-02-23T16:05:46Z

Is it possible to make a R package have command line tools for windows, mac, liunx wrapped in same interface. Then we can use the zcat usage with fread when that package is installed.

An example of this kind of package

I realized this kind of package will not be allowed in CRAN if you need to pack a gzip windows version in package. Either hosting it in other place, or ask user to download gzip windows by themselves.

xhdong-umd · 2017-02-23T16:41:57Z

To uncompress file into temp file on disk will always work, but that could be slow because of disk access. If we read the file into a raw vector in RAM, then uncompress it with memDecompress before feeding a uncompressed raw vector to fread, will that work?

xhdong-umd · 2017-03-02T17:38:16Z

I wrote a function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file. So we can use temp_unzip(file, fread, ...).

The code is pure R so it should work in all platforms. I feel the zcat method is good enough for linux/mac(I do need to quote the file name sometimes), but too complex for windows.

The code is inspired by R.utils but I really don't like its default behavior of removing input file by default. Also I think R.utils author just modified the compressFile code to use for decompressFile. There is need to call gzfile and bzfile separately for compression, but you don't have to call gzfile, bzfile and xzfile separately because gzfile can handle all compression formats (except zip, which I used unzip).

Here are some benchmarks:

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)

Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1

frenchja · 2017-05-12T22:44:56Z

One thing to note is that the zcat solution appears to only work if the file exists in the same directory that R is launched:

Error in fread("zcat < data/directory/test.csv.gz") :
  File is empty: /var/folders/41/asdf_kj80000gn/T//RtmpwtAttt/fileebeb5e124cef

map2085 · 2017-05-17T20:35:42Z

i forgot about this issue and tried to fread a gz file, only to get a mysterious error causing me to waste time, again, searching for the solution.

3 years later, still waiting for this elementary fix.

frenchja · 2017-05-17T21:47:50Z

After further exploration, my error above only occurs when there are spaces in the directory name:

fread("zcat < data/directory\ one/test.csv.gz"

But not with underscores:

fread("zcat < data/directory_two/test.csv.gz"

And can be alleviated by escaping the backslash again:

fread("zcat < data/directory\\ one/test.csv.gz"

Hope this helps. Otherwise, the zcat solution works fine.

jaapwalhout · 2018-02-27T21:10:41Z

Another example on StackOverflow why this feature is needed:
data.table fread error - gzip file - set temporary directory

webbp · 2018-03-01T18:25:57Z

How about:

library(readr)
DT = as.data.table(read_csv("myfile.gz"))

mspivakov · 2018-03-01T18:32:12Z

This is considerably slower.

…

-------- Original message -------- From: Webb Phillips Date:2018/03/01 6:26 PM (GMT+00:00) To: "Rdatatable/data.table" Cc: Mikhail Spivakov , Comment Subject: Re: [Rdatatable/data.table] Support .gz file format for fread (#717) How about: dt = as.data.table(read_csv("myfile.gz"))``` — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#717 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>

MichaelChirico · 2018-03-02T00:29:40Z

setDT will be faster than as.data.table. What tool does read_csv use for uncracking the .gz?

…

On Mar 2, 2018 2:32 AM, "mspivakov" ***@***.***> wrote: This is considerably slower. -------- Original message -------- From: Webb Phillips Date:2018/03/01 6:26 PM (GMT+00:00) To: "Rdatatable/data.table" Cc: Mikhail Spivakov , Comment Subject: Re: [Rdatatable/data.table] Support .gz file format for fread (#717) How about: dt = as.data.table(read_csv("myfile.gz"))``` — You are receiving this because you commented. Reply to this email directly, view it on GitHub<https://github.com/ Rdatatable/data.table#717#issuecomment-369684424>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ ABlQ65ZJPuwpxCfmJTuRZ1aBPh4Jejniks5taD06gaJpZM4CKNWu>. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www. babraham.ac.uk/terms> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#717 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdZa08PZo9LKMGbnzMnmUeaspKxgVks5taD6wgaJpZM4CKNWu> .

malcook · 2018-03-18T08:54:51Z

@frenchja - agree - though you might prefer to escape those spaces with R's shQuote

byapparov · 2018-04-17T22:59:09Z

readr is reading data from connection (?gzfile) in memory: https://github.com/tidyverse/readr/blob/6f0bb65296afa55709fd60cdc5d59a4c89623e36/src/connection.cpp

And it is parsed with read_tokens_ : https://github.com/tidyverse/readr/blob/6f0bb65296afa55709fd60cdc5d59a4c89623e36/src/read.cpp

swvanderlaan · 2018-06-28T10:52:01Z

@frenchja How would this work with past0? I have now the code below, but that throws an error:

SOME_DIR = "/Users/swvanderlaan/some_dir"
data <- fread('zcat < paste0(SOME_DIR,"/somedata.txt.gz")', 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

Ah got it, it should be this:

data <- 
  fread(paste0("zcat < '", SOME_DIR,"/somedata.txt.gz","'"), 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

MichaelChirico · 2018-07-02T03:28:52Z

@swvanderlaan I tend to use sprintf for cases like this; you should also use file.path and shQuote to be platform-robust:

fread(sprintf('zcat %s', shQuote(file.path(SOME_DIR, 'somedata.txt.gz'))))

Avoids error 'gzip: stdout: No space left on device' (Rdatatable/data.table#717 (comment))

mattdowle added the High label Jul 4, 2014

arunsrinivasan added the feature request label Jul 5, 2014

eantonya added Low and removed High labels Jul 8, 2014

arunsrinivasan mentioned this issue Sep 12, 2014

[feature request] fread w/ compressed files #807

Closed

arunsrinivasan added High and removed Low labels Sep 28, 2014

arunsrinivasan added the fread label Sep 4, 2015

xhdong-umd mentioned this issue Feb 27, 2017

data.table::fread ctmm-initiative/ctmm#2

Closed

st-pasha mentioned this issue Jul 7, 2017

Master task for fread bugs / proposals #2247

Closed

mattdowle removed this from the Candidate milestone May 10, 2018

PeteHaitch added a commit to hansenlab/bsseq that referenced this issue Jul 4, 2018

Explicitly gunzip to tempfile()

0656078

Avoids error 'gzip: stdout: No space left on device' (Rdatatable/data.table#717 (comment))

sritchie73 mentioned this issue Aug 24, 2018

Add cmd= to fread() #3010

Merged

MichaelChirico mentioned this issue Sep 29, 2018

fread gz #3085

Merged

mattdowle added this to the 1.11.8 milestone Sep 29, 2018

mattdowle closed this as completed in #3085 Sep 29, 2018

rodo-nunez mentioned this issue Nov 6, 2018

fread doen't work for .txt.gz but it does for the same file with .csv.gz extension #3136

Closed

dracodoc mentioned this issue Aug 26, 2021

setcolorder gains before= and after= #4691

Merged

remomomo mentioned this issue Jul 6, 2022

feature request: fread should raise an error instead of a warning when reading a gzipped file that does not fit in temporary storage #5415

Open

Support .gz file format for fread #717

Support .gz file format for fread #717

Comments

renqian commented Jul 4, 2014

eantonya commented Jul 7, 2014

arunsrinivasan commented Sep 28, 2014

gbonamy commented Sep 28, 2014

rsaporta commented Sep 30, 2014

mattdowle commented Oct 7, 2014

xiaodaigh commented Nov 20, 2014

rmscriven commented Mar 16, 2015

statquant commented Apr 17, 2015

mspivakov commented Apr 26, 2015

fleimgruber commented Jun 3, 2015

zx8754 commented Jul 1, 2015

rickdonnelly commented Jul 15, 2015

qgeissmann commented Jul 17, 2015

jayjacobs commented Jul 28, 2015

eantonya commented Jul 28, 2015

zachmayer commented Aug 20, 2015

dselivanov commented Sep 14, 2015

gdkrmr commented Dec 8, 2015

dselivanov commented Dec 8, 2015

gdkrmr commented Dec 8, 2015

statquant commented Dec 8, 2015

cybaea commented Dec 21, 2015

TuSKan commented Nov 23, 2016

borisclemencon commented Nov 25, 2016 • edited Loading

sznadas commented Feb 6, 2017

mGalarnyk commented Feb 10, 2017

rargelaguet commented Feb 15, 2017

xhdong-umd commented Feb 23, 2017 • edited Loading

xhdong-umd commented Feb 23, 2017 • edited Loading

xhdong-umd commented Mar 2, 2017

frenchja commented May 12, 2017 • edited Loading

map2085 commented May 17, 2017

frenchja commented May 17, 2017

jaapwalhout commented Feb 27, 2018

webbp commented Mar 1, 2018 • edited Loading

mspivakov commented Mar 1, 2018 via email

MichaelChirico commented Mar 2, 2018 via email

malcook commented Mar 18, 2018

byapparov commented Apr 17, 2018 • edited Loading

swvanderlaan commented Jun 28, 2018 • edited Loading

MichaelChirico commented Jul 2, 2018

borisclemencon commented Nov 25, 2016 •

edited

Loading

xhdong-umd commented Feb 23, 2017 •

edited

Loading

xhdong-umd commented Feb 23, 2017 •

edited

Loading

frenchja commented May 12, 2017 •

edited

Loading

webbp commented Mar 1, 2018 •

edited

Loading

byapparov commented Apr 17, 2018 •

edited

Loading

swvanderlaan commented Jun 28, 2018 •

edited

Loading