Skip to content

Commit

Permalink
fwrite() UTF-8 csv file (#4785)
Browse files Browse the repository at this point in the history
  • Loading branch information
shrektan authored Apr 15, 2021
1 parent 306f49c commit 1eadb95
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 7 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

2. `mean(na.rm=TRUE)` by group is now GForce optimized, [#4849](https://github.com/Rdatatable/data.table/issues/4849). Thanks to the [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type `integer64` resulting in a difference to the `bit64::mean.integer64` method: `data.table` returns a `double` result whereas `bit64` rounds the mean to the nearest integer.

3. `fwrite()` now writes UTF-8 or native csv files by specifying the `encoding=` argument, [#1770](https://github.com/Rdatatable/data.table/pull/1770). Thanks to @shrektan for the request and the PR.

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
8 changes: 6 additions & 2 deletions R/fwrite.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,12 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
compress = c("auto", "none", "gzip"),
yaml = FALSE,
bom = FALSE,
verbose=getOption("datatable.verbose", FALSE)) {
verbose=getOption("datatable.verbose", FALSE),
encoding = "") {
na = as.character(na[1L]) # fix for #1725
if (length(encoding) != 1L || !encoding %chin% c("", "UTF-8", "native")) {
stop("Argument 'encoding' must be '', 'UTF-8' or 'native'.")
}
if (missing(qmethod)) qmethod = qmethod[1L]
if (missing(compress)) compress = compress[1L]
if (missing(dateTimeAs)) { dateTimeAs = dateTimeAs[1L] }
Expand Down Expand Up @@ -108,7 +112,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
.Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
showProgress, is_gzip, bom, yaml, verbose)
showProgress, is_gzip, bom, yaml, verbose, encoding)
invisible()
}

17 changes: 17 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -17302,3 +17302,20 @@ test(2165.5, X[Y, on=.(A), x.B, by=.EACHI], data.table(A=2:1, x.B=c(2L,NA)))
# missing j was caught in groupingsets but not cube, leading to unexpected error message, #4282
DT = data.table(a=1)
test(2166, cube(DT, by='a'), error="Argument 'j' is required")

# fwrite support encoding "native" and "UTF-8", #1770
latin1 = "fa\xE7ile"
Encoding(latin1) = "latin1"
utf8 = iconv(latin1, "latin1", "UTF-8")
text = c(latin1, utf8, "aaaaaaaa")
dt = data.table(A = text, B = as.factor(text))
dt2 = data.table(A = text, B = text)
csvfile = tempfile(fileext = ".csv")
fwrite(dt, csvfile, encoding = "UTF-8", bom = TRUE)
test(2167.1, fread(csvfile, encoding = "UTF-8"), dt2)
if (identical(text, enc2native(text))) { # ensure native encoding can represent latin1 strings
fwrite(dt, csvfile, encoding = "native")
test(2167.2, fread(csvfile), dt2)
}
test(2167.3, fwrite(dt, csvfile, encoding="nativ"), error="Argument 'encoding' must be")
unlink(csvfile)
4 changes: 3 additions & 1 deletion man/fwrite.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
compress = c("auto", "none", "gzip"),
yaml = FALSE,
bom = FALSE,
verbose = getOption("datatable.verbose", FALSE))
verbose = getOption("datatable.verbose", FALSE),
encoding = "")
}
\arguments{
\item{x}{Any \code{list} of same length vectors; e.g. \code{data.frame} and \code{data.table}. If \code{matrix}, it gets internally coerced to \code{data.table} preserving col names but not row names}
Expand Down Expand Up @@ -59,6 +60,7 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
\item{yaml}{If \code{TRUE}, \code{fwrite} will output a CSVY file, that is, a CSV file with metadata stored as a YAML header, using \code{\link[yaml]{as.yaml}}. See \code{Details}. }
\item{bom}{If \code{TRUE} a BOM (Byte Order Mark) sequence (EF BB BF) is added at the beginning of the file; format 'UTF-8 with BOM'.}
\item{verbose}{Be chatty and report timings?}
\item{encoding}{ The encoding of the strings written to the CSV file. Default is \code{""}, which means writting raw bytes without considering the encoding. Other possible options are \code{"UTF-8"} and \code{"native"}. }
}
\details{
\code{fwrite} began as a community contribution with \href{https://github.com/Rdatatable/data.table/pull/1613}{pull request #1613} by Otto Seiskari. This gave Matt Dowle the impetus to specialize the numeric formatting and to parallelize: \url{https://www.h2o.ai/blog/fast-csv-writing-for-r/}. Final items were tracked in \href{https://github.com/Rdatatable/data.table/issues/1664}{issue #1664} such as automatic quoting, \code{bit64::integer64} support, decimal/scientific formatting exactly matching \code{write.csv} between 2.225074e-308 and 1.797693e+308 to 15 significant figures, \code{row.names}, dates (between 0000-03-01 and 9999-12-31), times and \code{sep2} for \code{list} columns where each cell can itself be a vector.
Expand Down
17 changes: 13 additions & 4 deletions src/fwriteR.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,23 @@
#define DATETIMEAS_EPOCH 2
#define DATETIMEAS_WRITECSV 3

static bool utf8=false;
static bool native=false;
#define TO_UTF8(s) (utf8 && NEED2UTF8(s))
#define TO_NATIVE(s) (native && (s)!=NA_STRING && !IS_ASCII(s))
#define ENCODED_CHAR(s) (TO_UTF8(s) ? translateCharUTF8(s) : (TO_NATIVE(s) ? translateChar(s) : CHAR(s)))

static char sep2; // '\0' if there are no list columns. Otherwise, the within-column separator.
static bool logical01=true; // should logicals be written as 0|1 or true|false. Needed by list column writer too in case a cell is a logical vector.
static int dateTimeAs=0; // 0=ISO(yyyy-mm-dd), 1=squash(yyyymmdd), 2=epoch, 3=write.csv
static const char *sep2start, *sep2end;
// sep2 is in main fwrite.c so that writeString can quote other fields if sep2 is present in them
// if there are no list columns, set sep2=='\0'

// Non-agnostic helpers ...

const char *getString(SEXP *col, int64_t row) { // TODO: inline for use in fwrite.c
SEXP x = col[row];
return x==NA_STRING ? NULL : CHAR(x);
return x==NA_STRING ? NULL : ENCODED_CHAR(x);
}

int getStringLen(SEXP *col, int64_t row) {
Expand Down Expand Up @@ -45,7 +50,7 @@ int getMaxCategLen(SEXP col) {
const char *getCategString(SEXP col, int64_t row) {
// the only writer that needs to have the header of the SEXP column, to get to the levels
int x = INTEGER(col)[row];
return x==NA_INTEGER ? NULL : CHAR(STRING_ELT(getAttrib(col, R_LevelsSymbol), x-1));
return x==NA_INTEGER ? NULL : ENCODED_CHAR(STRING_ELT(getAttrib(col, R_LevelsSymbol), x-1));
}

writer_fun_t funs[] = {
Expand Down Expand Up @@ -164,10 +169,12 @@ SEXP fwriteR(
SEXP is_gzip_Arg,
SEXP bom_Arg,
SEXP yaml_Arg,
SEXP verbose_Arg
SEXP verbose_Arg,
SEXP encoding_Arg
)
{
if (!isNewList(DF)) error(_("fwrite must be passed an object of type list; e.g. data.frame, data.table"));

fwriteMainArgs args = {0}; // {0} to quieten valgrind's uninitialized, #4639
args.is_gzip = LOGICAL(is_gzip_Arg)[0];
args.bom = LOGICAL(bom_Arg)[0];
Expand Down Expand Up @@ -224,6 +231,8 @@ SEXP fwriteR(
dateTimeAs = INTEGER(dateTimeAs_Arg)[0];
logical01 = LOGICAL(logical01_Arg)[0];
args.scipen = INTEGER(scipen_Arg)[0];
utf8 = !strcmp(CHAR(STRING_ELT(encoding_Arg, 0)), "UTF-8");
native = !strcmp(CHAR(STRING_ELT(encoding_Arg, 0)), "native");

int firstListColumn = 0;
for (int j=0; j<args.ncol; j++) {
Expand Down

0 comments on commit 1eadb95

Please sign in to comment.