-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First version of the fwrite function #580 #1613
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
d794b3b
First version of the fwrite function #580
oseiskar 8696b9f
minor changes in fwrite: renamed param dt to x, removed requirement o…
oseiskar 73aeff8
Improved fwrite performance by handling number formatting and NAs in …
oseiskar ff5ec81
changed fwrite test to allow different floating point notations from …
oseiskar 1b79ec4
using path.expand on the file.path argument of fwrite
oseiskar e8c50d1
increased number of significant digits to 15 in fwrite
oseiskar 6be2ed1
merged master and fixed conflicts
oseiskar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
fwrite <- function(x, file.path, append = FALSE, quote = TRUE, | ||
sep = ",", eol = "\n", na = "", col.names = TRUE, qmethod = "double", | ||
block.size = 10000) { | ||
|
||
# validate arguments | ||
stopifnot(is.data.frame(x)) | ||
stopifnot(ncol(x) > 0) | ||
|
||
stopifnot(length(quote) == 1 && class(quote) == "logical") | ||
stopifnot(length(sep) == 1 && class(sep) == "character" && nchar(sep) == 1) | ||
stopifnot(length(eol) == 1 && class(eol) == "character") | ||
stopifnot(length(qmethod) == 1 && qmethod %in% c("double", "escape")) | ||
stopifnot(length(col.names) == 1 && class(col.names) == "logical") | ||
stopifnot(length(append) == 1 && class(append) == "logical") | ||
stopifnot(length(block.size) == 1 && block.size > 0) | ||
|
||
# handle paths like "~/foo/bar" | ||
file.path <- path.expand(file.path) | ||
|
||
quoted_cols <- rep(quote, ncol(x)) | ||
|
||
# special case: single-column data.frame, doing x[block_begin:block_end,] | ||
# for such data frame gives a vector | ||
if (!is.data.table(x) && ncol(x) == 1) x <- as.data.table(x) | ||
|
||
# write header row separately for correct quoting of row names | ||
if (col.names && !append) { | ||
.Call(Cwritefile, as.list(names(x)), file.path, sep, eol, na, quoted_cols, qmethod == "escape", append) | ||
append <- TRUE | ||
} | ||
|
||
# handle empty x | ||
if (nrow(x) == 0) return() | ||
|
||
# determine from column types, which ones should be quoted | ||
if (quote) { | ||
column_types <- sapply(x, class) | ||
quoted_cols <- column_types %chin% c('character', 'factor') | ||
} | ||
|
||
# write in blocks of given size to avoid generating full copies | ||
# of columns in memory | ||
block_begin <- 1 | ||
|
||
repeat { | ||
block_end <- min(block_begin+(block.size-1), nrow(x)) | ||
|
||
dt_block <- x[c(block_begin:block_end),] | ||
|
||
# convert data.frame row block to a list of columns | ||
col_list <- lapply(dt_block, function(column) { | ||
if (!(class(column) %chin% c('integer', 'numeric', 'character'))) { | ||
column <- as.character(column) | ||
} | ||
column | ||
}) | ||
|
||
.Call(Cwritefile, col_list, file.path, sep, eol, na, quoted_cols, qmethod == "escape", append) | ||
|
||
if (block_end == nrow(x)) break | ||
|
||
append <- TRUE | ||
block_begin <- block_end+1 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
\name{fwrite} | ||
\alias{fwrite} | ||
\title{Fast CSV writer} | ||
\description{ | ||
Similar to \code{write.table} but faster and more limited in features. | ||
} | ||
\usage{ | ||
fwrite(x, file.path, append = FALSE, quote = TRUE, sep = ",", eol = "\n", na = "", | ||
col.names = TRUE, qmethod = "double", block.size = 10000) | ||
} | ||
\arguments{ | ||
\item{x}{The \code{data.table} or \code{data.frame} to write} | ||
\item{file.path}{Output file name} | ||
\item{append}{If \code{TRUE}, the file is opened in append mode and column names (header row) are not written.} | ||
\item{quote}{If \code{TRUE}, all columns of character and factor types, as well as all column names, will be surrounded by double quotes. If \code{FALSE}, nothing is quoted, even if this would break the CSV (the column contents are not checked for separator characters).} | ||
\item{sep}{The separator between columns} | ||
\item{eol}{Line separator} | ||
\item{na}{The string to use for missing values in the data} | ||
\item{col.names}{A logical value indicating if the column names (header row) should be written} | ||
\item{qmethod}{A character string specifying how to deal with embedded double quote characters when quoting strings. Must be one of "escape", in which case the quote character (as well as the backslash character) is escaped in C style by a backslash, or "double" (default), in which case it is doubled.} | ||
\item{block.size}{The output is written in blocks, each of which contains at most this number of rows. This is to avoid making large copies in memory. Can be used to tweak performance and memory usage.} | ||
} | ||
\details{ | ||
The speed-up compared to \code{write.csv} depends on the parameters and column types. | ||
} | ||
\seealso{ \code{\link[utils]{write.csv}} } | ||
\examples{ | ||
\dontrun{ | ||
|
||
fwrite(data.table(first=c(1,2), second=c(NA, 'foo"bar')), "table.csv") | ||
|
||
# table.csv contains: | ||
|
||
# "first","second" | ||
# "1","" | ||
# "2","foo""bar" | ||
} | ||
} | ||
\keyword{ data } | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
#include <R.h> | ||
#include <errno.h> | ||
#include <Rinternals.h> | ||
|
||
void writefile(SEXP list_of_columns, | ||
SEXP filename, | ||
SEXP col_sep_exp, | ||
SEXP row_sep_exp, | ||
SEXP na_exp, | ||
SEXP quote_cols, | ||
SEXP qmethod_escape_exp, | ||
SEXP append) { | ||
|
||
int error_number = 0; | ||
int qmethod_escape = *LOGICAL(qmethod_escape_exp); | ||
|
||
errno = 0; /* clear flag possibly set by previous errors */ | ||
|
||
char col_sep = *CHAR(STRING_ELT(col_sep_exp, 0)); | ||
const char *row_sep = CHAR(STRING_ELT(row_sep_exp, 0)); | ||
const char *na_str = CHAR(STRING_ELT(na_exp, 0)); | ||
const char QUOTE_CHAR = '"'; | ||
const char ESCAPE_CHAR = '\\'; | ||
|
||
/* open input file in correct mode */ | ||
const char *open_mode = "wb"; | ||
if (*LOGICAL(append)) open_mode = "ab"; | ||
FILE *f = fopen(CHAR(STRING_ELT(filename, 0)), open_mode); | ||
if (f == NULL) goto end; | ||
|
||
R_xlen_t ncols = LENGTH(list_of_columns); | ||
R_xlen_t nrows = LENGTH(VECTOR_ELT(list_of_columns, 0)); | ||
|
||
for (R_xlen_t row_i = 0; row_i < nrows; ++row_i) { | ||
for (int col_i = 0; col_i < ncols; ++col_i) { | ||
|
||
if (col_i > 0) fputc(col_sep, f); | ||
|
||
SEXP column = VECTOR_ELT(list_of_columns, col_i); | ||
|
||
switch(TYPEOF(column)) { | ||
case INTSXP: | ||
if (INTEGER(column)[row_i] == NA_INTEGER) fputs(na_str, f); | ||
else fprintf(f, "%d", INTEGER(column)[row_i]); | ||
break; | ||
|
||
case REALSXP: | ||
if (ISNA(REAL(column)[row_i])) fputs(na_str, f); | ||
else fprintf(f, "%.15g", REAL(column)[row_i]); | ||
break; | ||
|
||
default: /* assuming STRSXP */ | ||
if (STRING_ELT(column, row_i) == NA_STRING) fputs(na_str, f); | ||
else { | ||
int quote = LOGICAL(quote_cols)[col_i]; | ||
if (quote) fputc(QUOTE_CHAR, f); | ||
for (const char *ch = CHAR(STRING_ELT(column, row_i)); *ch != '\0'; ++ch) { | ||
if (quote) { | ||
if (*ch == QUOTE_CHAR) { | ||
if (qmethod_escape) fputc(ESCAPE_CHAR, f); | ||
else fputc(QUOTE_CHAR, f); /* qmethod = "double" */ | ||
} | ||
if (qmethod_escape && *ch == ESCAPE_CHAR) fputc(ESCAPE_CHAR, f); | ||
} | ||
fputc(*ch, f); | ||
} | ||
if (quote) fputc(QUOTE_CHAR, f); | ||
} | ||
break; | ||
} | ||
} | ||
if (fputs(row_sep, f) < 0) goto end; | ||
} | ||
|
||
end: | ||
error_number = errno; | ||
if (f != NULL) fclose(f); | ||
if (error_number) error(strerror(errno)); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reply to MichaelChirico's comment if it would be faster to create an extra column and use it like
x[.(block_no)]
: could be a little bit faster and not difficult to implement in C, but I dislike the idea of modifying the input data table. Is there a convenient way to generate the name of such column so that it would not conflict with existing column names?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oseiskar none I'm aware of, once #633 solved it will be easy, you can use cryptic name (prefix it with the dot), and check if it doesn't exist in a data.table. Not sure what you are referring by modifying input, but adding column without modifying input is as simple as
x = shallow(x)[, "col" := new]
, it won't copy the data, and it will add new column only to locally processed data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to reinforce Jan's point, this sort of thing (restricting acceptable column names) is done under the hood in several places of
data.table
code already, see e.g. here. As a suggestion, there'sb__
, which has an analogue tof__
,l__
,o__
,zo__
,jn__
, andjl__
, all found in[.data.table
. Agreedblock_no
is too likely to be in users' tables (in fact I have several cases like that myself).