Skip to content

Commit

Permalink
Closes #2898, fread no longer fails on empty input
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael Chirico committed May 24, 2018
1 parent b89831f commit 3f4221f
Show file tree
Hide file tree
Showing 4 changed files with 57 additions and 30 deletions.
5 changes: 4 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,16 @@

### Changes in v1.11.3 (in development)

#### NEW FEATURES

1. `fread` no longer errors on empty files, and instead returns `NULL`, [#2898](https://github.com/Rdatatable/data.table/issues/2898). This improves compatibility with, e.g., using `fread` + `rbindlist` on a directory in which some files may be empty, since `rbindlist` already skips `NULL` elements.

#### BUG FIXES

1. Empty RHS of `:=` is no longer an error when the `i` clause returns no rows to assign to anyway, [#2829](https://github.com/Rdatatable/data.table/issues/2829). Thanks to @cguill95 for reporting and to @MarkusBonsch for fixing.

2. Fixed runaway memory usage with R-devel (R > 3.5.0), [#2882](https://github.com/Rdatatable/data.table/pull/2882). Thanks to many people but in particular to Paul Bailey for making the breakthrough reproducible example and Luke Tierney for then pinpointing the issue. It was caused by an interaction of two or more data.table threads operating on new compact vectors in the ALTREP framework, such as the sequence `1:n`. This interaction could result in R's garbage collector turning off, and hence the memory explosion. Problems may occur in R 3.5.0 too but we were only able to reproduce in R > 3.5.0. The R code in data.table's implementation benefits from ALTREP (`for` loops in R no longer allocate their range vector input, for example) but are not so appropriate as data.table columns. Sequences such as `1:n` are common in test data but not very common in real-world datasets. Therefore, there is no need for data.table to support columns which are ALTREP compact sequences. The `data.table()` function already expanded compact vectors (by happy accident) but `setDT()` did not (it now does). If, somehow, a compact vector still reaches the internal parallel regions, a helpful error will now be generated. If this happens, please report it as a bug.


### Changes in v1.11.2 (on CRAN 8 May 2018)

1. `test.data.table()` created/overwrote variable `x` in `.GlobalEnv`, [#2828](https://github.com/Rdatatable/data.table/issues/2828); i.e. a modification of user's workspace which is not allowed. Thanks to @etienne-s for reporting.
Expand Down
72 changes: 44 additions & 28 deletions R/fread.R
Original file line number Diff line number Diff line change
Expand Up @@ -26,44 +26,60 @@ fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=
stopifnot(nThread>=1L)
if (!missing(file)) {
if (!identical(input, "")) stop("You can provide 'input=' or 'file=', not both.")
if (!file.exists(file)) stop("File '",file,"' does not exist.")
if (isTRUE(file.info(file)$isdir)) stop("File '",file,"' is a directory. Not yet implemented.") # dir.exists() requires R v3.2+, #989
file_info = file.info(file)
if (is.na(file_info$size)) stop("File '",file,"' does not exist or is non-readable.")
if (isTRUE(file_info$isdir)) stop("File '",file,"' is a directory. Not yet implemented.") # dir.exists() requires R v3.2+, #989
if (!file_info$size) {
warning("File '", file, "' has size 0. Returning NULL")
return(NULL)
}
input = file
} else {
if (!is.character(input) || length(input)!=1L) {
stop("'input' must be a single character string containing a file name, a system command containing at least one space, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or, the input data itself containing at least one \\n or \\r")
}
if ( input == "" || length(grep('\\n|\\r', input)) ) {
# input is data itself containing at least one \n or \r
} else if (file.exists(input)) {
if (isTRUE(file.info(input)$isdir)) stop("File '",input,"' is a directory. Not yet implemented.")
} else {
if (substring(input,1L,1L)==" ") {
stop("Input argument is not a file name and contains no \\n or \\r, but starts with a space. Please remove the leading space.")
}
# either a download or a system command, both to temp file
tmpFile = tempfile()
on.exit(unlink(tmpFile), add=TRUE)
str6 = substring(input,1L,6L) # avoid grepl() for #2531
str7 = substring(input,1L,7L)
str8 = substring(input,1L,8L)
if (str7=="ftps://" || str8=="https://") {
if (!requireNamespace("curl", quietly = TRUE))
file_info = file.info(input)
if (!is.na(file_info$size)) {
if (isTRUE(file_info$isdir)) stop("File '",input,"' is a directory. Not yet implemented.")
if (!file_info$size) {
warning("File '", input, "' has size 0. Returning NULL")
return(NULL)
}
} else {
if (substring(input,1L,1L)==" ") {
stop("Input argument is not a file name and contains no \\n or \\r, but starts with a space. Please remove the leading space.")
}
# either a download or a system command, both to temp file
tmpFile = tempfile()
on.exit(unlink(tmpFile), add=TRUE)
str6 = substring(input,1L,6L) # avoid grepl() for #2531
str7 = substring(input,1L,7L)
str8 = substring(input,1L,8L)
if (str7=="ftps://" || str8=="https://") {
if (!requireNamespace("curl", quietly = TRUE))
stop("Input URL requires https:// connection for which fread() requires 'curl' package, but cannot be found. Please install curl using 'install.packages('curl')'.")
curl::curl_download(input, tmpFile, mode="wb", quiet = !showProgress)
}
else if (str6=="ftp://" || str7== "http://" || str7=="file://") {
method = if (str7=="file://") "internal" else getOption("download.file.method", default="auto")
# force "auto" when file:// to ensure we don't use an invalid option (e.g. wget), #1668
download.file(input, tmpFile, method=method, mode="wb", quiet=!showProgress)
# In text mode on Windows-only, R doubles up \r to make \r\r\n line endings. mode="wb" avoids that. See ?connections:"CRLF"
}
else if (length(grep(' ', input))) {
(if (.Platform$OS.type == "unix") system else shell)(paste0('(', input, ') > ', tmpFile))
curl::curl_download(input, tmpFile, mode="wb", quiet = !showProgress)
}
else if (str6=="ftp://" || str7== "http://" || str7=="file://") {
method = if (str7=="file://") "internal" else getOption("download.file.method", default="auto")
# force "auto" when file:// to ensure we don't use an invalid option (e.g. wget), #1668
download.file(input, tmpFile, method=method, mode="wb", quiet=!showProgress)
# In text mode on Windows-only, R doubles up \r to make \r\r\n line endings. mode="wb" avoids that. See ?connections:"CRLF"
}
else if (length(grep(' ', input))) {
(if (.Platform$OS.type == "unix") system else shell)(paste0('(', input, ') > ', tmpFile))
}
else stop("File '",input,"' does not exist; getwd()=='", getwd(), "'",
". Include correct full path, or one or more spaces to consider the input a system command.")
input = tmpFile # the file name
if (!file.info(input)$size) {
warning("File '", input, "' has size 0. Returning NULL")
return(NULL)
}
}
else stop("File '",input,"' does not exist; getwd()=='", getwd(), "'",
". Include correct full path, or one or more spaces to consider the input a system command.")
input = tmpFile # the file name
}
}
if (!missing(autostart)) warning("'autostart' is now deprecated and ignored. Consider skip='string' or skip=n");
Expand Down
8 changes: 8 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -11835,6 +11835,14 @@ gc()
after = sum(gc()[, 2])
test(1912.2, after < before + 10)

# fread returns NULL with warning on empty file
f = tempfile()
file.create(f)
test(1913.1, fread(f), NULL, warning = 'File.*size 0')
test(1913.2, fread(file = f), NULL, warning = 'File.*size 0')
# trigger download for last instance of warning
test(1913.3, fread(paste0('file://', f)), NULL, warning = 'File.*size 0')
unlink(f)

###################################
# Add new tests above this line #
Expand Down
2 changes: 1 addition & 1 deletion man/fread.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ When \code{input} begins with http://, https://, ftp://, ftps://, or file://, \c
}
\value{
A \code{data.table} by default. A \code{data.frame} when argument \code{data.table=FALSE}; e.g. \code{options(datatable.fread.datatable=FALSE)}.
A \code{data.table} by default. A \code{data.frame} when argument \code{data.table=FALSE}; e.g. \code{options(datatable.fread.datatable=FALSE)}. \code{NULL} if the target is empty.
}
\references{
Background :\cr
Expand Down

0 comments on commit 3f4221f

Please sign in to comment.