Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table::fread #2

Closed
ghost opened this issue Feb 27, 2017 · 5 comments
Closed

data.table::fread #2

ghost opened this issue Feb 27, 2017 · 5 comments
Labels

Comments

@ghost
Copy link

ghost commented Feb 27, 2017

  • before fread support compressed file, or there is a cross platform solution to uncompress files, use parameter zipfile = FALSE for fread, fall back to read.csv when zip file is needed.

For 160M csv, fread took 2.64s while read.csv took 21s.

@xhdong-umd
Copy link
Contributor

The ideal solution is also use fread for zip file. There are several approaches:

  • It's a popular request in data.table
  • uncompress file to stdout with fread(input = 'zcat < data.gz'). However windows doesn't have zcat gzip installed by default. It's difficult to create a simple cross platform solution without needs user to install software first.
  • uncompress file to temp file, read file, delete the temp file. The problem here is the complexity of zip file. There are multiple possible zip methods, including file created by tar which cannot be recognized by R internal function unzip. R.utils used R connections method to uncompress zip to files, but you still need to identify compression method first.

Right now the parameter method is simplest without need of much change to existing code. We can further improve this depend on new usage or development in related packages.

@chfleming
Copy link
Contributor

I think the most important thing is that as.telemetry "just work" with default arguments. I put in some code that checks to see if the filename looks like a CSV, then attempts fread. If the filename doesn't look like a CSV or fread fails, then the slower read.table is used instead.

  data < NULL
  # fread doesn't work on compressed files yet
  if(endsWith(tolower(object),".csv"))
  { data <- try(data.table::fread(object,data.table=FALSE,check.names=TRUE,...)) }
  # if fread fails, then fall back on read.table
  if(class(data)!="data.frame")
  { data <- utils::read.csv(object,...) }

We could add in more logic for different compression formats, but I don't know that the command & pipe notation is the same across platforms.

@xhdong-umd
Copy link
Contributor

@chfleming This is a much better solution compared to extra parameter.

I think there is no need to check compression formats since there are many possibilities and platform compatibility problems.

@xhdong-umd
Copy link
Contributor

@chfleming I think we can actually just fread the first 5 rows without the file name check. It's possible the csv file have different file name (I saw .txt before). How about this:

data <- try(data.table::fread(object, data.table = FALSE, check.names = TRUE, nrows = 5), 
            silent = TRUE)
if (class(data) == "data.frame") {
  data <- data.table::fread(object,data.table=FALSE,check.names=TRUE,...)
} else {
  data <- utils::read.csv(object,...)
}

I think the direct read test should be fast enough that comparable to the file name check, and it will handle all possible cases without complex logic.

@chfleming
Copy link
Contributor

That seems to work well. Pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants