First version of the fwrite function #580 #1613

oseiskar · 2016-03-27T19:56:28Z

This implementation of fwrite (#580) aims to be faster, or at least as fast as write.csv, but a few things have been left out or simplified:

When quote=TRUE, all column names are quoted
When quote=FALSE, nothing is quoted, even if this would break the CSV
There is no option for row.names. They only make sense for data.frames with named rows. For data.tables, they would just reduce to row numbers.

The speedup compared to write.csv depends on column types and parameters but speedup factors from 2 to 4 are possible.

MichaelChirico · 2016-03-27T20:58:35Z

R/fwrite.R

@@ -0,0 +1,62 @@
+fwrite <- function(dt, file.path, append = FALSE, quote = TRUE,


should definitely use DT or x, given the dt function in the stats package.

consistency with write.csv would have file.path named file and take default value "" (according to ?write.csv, the default is to print to console)

ok, I'll change this

This is semi-intentional. Supporting file="" (stdout) and connections would require passing file handles (instead of just file name strings) from R to C and I don't currently know how to do this. R's C interface is poorly documented.

Gotcha. Not sure how important the consistency is, but maybe just follow write.table to this end:

if (file == "") file <- stdout() else if (is.character(file)) { file <- if (nzchar(fileEncoding)) file(file, ifelse(append, "a", "w"), encoding = fileEncoding) else file(file, ifelse(append, "a", "w")) on.exit(close(file)) } else if (!isOpen(file, "w")) { open(file, "w") on.exit(close(file)) } if (!inherits(file, "connection")) stop("'file' must be a character string or connection")

And here is the C side of the code, to the extent that it helps

…f unique column names, replaced %in% -> %chin%

oseiskar · 2016-03-28T08:27:41Z

R/fwrite.R

+  repeat {
+    block_end <- min(block_begin+(block.size-1), nrow(x))
+
+    dt_block <- x[c(block_begin:block_end),]


As a reply to MichaelChirico's comment if it would be faster to create an extra column and use it like x[.(block_no)]: could be a little bit faster and not difficult to implement in C, but I dislike the idea of modifying the input data table. Is there a convenient way to generate the name of such column so that it would not conflict with existing column names?

@oseiskar none I'm aware of, once #633 solved it will be easy, you can use cryptic name (prefix it with the dot), and check if it doesn't exist in a data.table. Not sure what you are referring by modifying input, but adding column without modifying input is as simple as x = shallow(x)[, "col" := new], it won't copy the data, and it will add new column only to locally processed data.

Just to reinforce Jan's point, this sort of thing (restricting acceptable column names) is done under the hood in several places of data.table code already, see e.g. here. As a suggestion, there's b__, which has an analogue to f__, l__, o__, zo__, jn__, and jl__, all found in [.data.table. Agreed block_no is too likely to be in users' tables (in fact I have several cases like that myself).

…C. Also added na option to fwrite.

…fprintf, e.g., 1e-10 (Linux) = 1e-010 (Windows)

oseiskar · 2016-03-28T15:53:52Z

I moved number and NA formatting to C code, which resulted in less column copying and a significant perfomance boost. See this gist. Now this version seems to be between 2 and 4 times faster than write.csv with character and numeric columns.

@jangorecki thanks for pointing out the shortcomings of Sys.time. Any ideas how it should be replaced in benchmarking?

MichaelChirico · 2016-03-28T15:55:50Z

My preference is to use microbenchmark::get_nanotime.

By the way, thanks for working on this! The data.table community will be overjoyed to see a working version of this, I'm sure.

MichaelChirico · 2016-03-28T18:02:20Z

Two more things:

I get an error when trying to write outside current wd; modifying slightly the example in ?fwrite:

fwrite(data.table(first=c(1,2), second=c(NA, 'foo"bar')), "~/Desktop/table.csv")

Error in fwrite(data.table(first = c(1, 2), second = c(NA, "foo\"bar")), :
No such file or directory

Handling for list columns is probably incorrect. Not sure whether we should support... seems like it will be a good counterpart to sep2 in fread. As of now:

DT <- data.table(a = 1:3, l = list(list(1:6), list(3:5), list("a","b")))

fwrite(DT, "test.csv")

Produces:

"a","l"
1,list(1:6)
2,list(3:5)
3,list("a", "b")

I don't know the proper way to handle this, perhaps surround a list in angle brackets and comma-separate the elements?

"a","l"
1,<<1,2,3,4,5,6>>
2,<<3,4,5>>
3,<<"a">,<"b">>

?

Or perhaps take the cue from format.data.table:

format.item <- function(x) {
  if (is.atomic(x) || is.formula(x)) 
    paste(c(format(head(x, 6), justify = justify, ...), 
            if (length(x) > 6) ""), collapse = ",")
  else paste("<", class(x)[1L], ">", sep = "")
}

if (is.list(col)) 
  col = sapply(col, format.item)
else col = format(char.trunc(col), justify = justify, ...)

Otherwise just error on list columns

oseiskar · 2016-03-28T18:09:34Z

@MichaelChirico The problem is not the working directory but the ~, which is not automatically expanded in C. I'll fix this using path.expand...

MichaelChirico · 2016-03-28T20:13:59Z

inre: lists, just noticed that write.csv pops an error in this case, so it should be fine to just error for now:

Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type list in EncodeElement

MichaelChirico · 2016-03-28T20:20:31Z

inre: numeric columns, for consistency with write.csv more digits should be printed; from ?write.csv:

In almost all cases the conversion of numeric quantities is governed by the option "scipen" (see options), but with the internal equivalent of digits = 15. For finer control, use format to make a character matrix/data frame, and call write.table on that.

jangorecki · 2016-03-28T20:23:26Z

@oseiskar

elapsed.secs <- system.time(ans <- fwrite(...))[[3L]]`

is the easiest way
for precise timing you can use

pt = microbenchmark::get_nanotime()
ans = fwrite(...)
microbenchmark::get_nanotime() - pt

or from the devel version of microbenchmarkCore a drop-in replacement for system.time called system.nanotime

No need to have list columns supported now, this can be easily and flexibly handled within data.table before passing it to fwrite.

Not sure about as.character() for non-(numeric/integer/characters), it might be better to use toString instead (then by=1:nrow(.) probably needed), but it generally depends on how it is processed further in terms of edge cases detection. toString always returns non-NA, length 1L character. Not sure about performance of that.

oseiskar · 2016-03-30T18:26:53Z

As a reply to this earlier comment about supporting write.csv-like file argument that could also be an R connection and would default to writing to the console: not possible, as far as I know. Adding support to writing in STDOUT is possible but it seems that writing to an existing R connection is not.

@MichaelChirico thanks for looking up the C side of write.csv. Unfortunately, it does not seem to be possible do replicate this because the code is calling functions that are not part of the public C API of R. It would seem that all file/connection handling is part of the private / internal R code and thus unavailable.

In particular, one would need to call getConnection to transform a parameter SEXP, representing a connection number, to an Rconnection and trying to replicate the implementation this function would not help here.

dselivanov · 2016-03-31T14:26:57Z

Not sure, if this will help, but leave it here: official C level API for connection handles in R 3.3.0.

oseiskar · 2016-04-02T08:35:10Z

Actually, it is possible to bypass the private API limitation and call getConnection at least on some platforms: see this branch for a proof of concept. However, this brings in a whole new bunch of problems:

It is not guaranteed to work or remain supported (However, as @dselivanov pointed out, official support will be available in R version 3.3.0)
The first attempted implementation absolutely kills performance. The code linked above is about 4 times slower that write.csv with character columns.
For regular files, it is possible to "fix" 2. using lower-level functions like this (speedup drops from about 2x to about 1.5x) but this will not work for all R connections, such as stdout. It is possible to handle both cases separately, which would significantly increase the complexity of the code.

In my opinion, supporting R connection arguments is not worth the trouble at this point.

dselivanov · 2016-04-02T19:01:01Z

Personally I think, connections is "nice to have" feature (which can be implemented in future), but even without connections this is a great PR. Thank you, @oseiskar.

jangorecki · 2016-04-02T21:15:48Z

+1 dselivanov
FYI @mattdowle :)

mattdowle · 2016-04-07T01:57:35Z

Awesome! Looks great @oseiskar :-)

arunsrinivasan · 2016-04-07T16:45:59Z

Great PR, @oseiskar. Using the data.dable from the extensive benchmark Michael did on SO here, I profiled it using Instruments -> time profiler. Here's a snapshot.

I've not looked at the code, but IIUC writefile() takes only about 50% of the time? Also perhaps there are trivial places that could be improved on that you could spot? Hope it's of some use.

jangorecki · 2016-04-11T19:06:00Z

@oseiskar
Could you substitute R_xlen_t with something which is available in R 2.15.0?
While installing package on R 2.15.0 (data.table stated dependency) I get the following error:

fwrite.c: In function ‘writefile’:
fwrite.c:31:3: error: unknown type name ‘R_xlen_t’
   R_xlen_t ncols = LENGTH(list_of_columns);
   ^

You can reproduce it with:

docker run -it docker.io/jangorecki/r-2.15.0 /bin/bash
curl -O https://Rdatatable.github.io/data.table/src/contrib/data.table_1.9.7.tar.gz
R2 CMD INSTALL data.table_1.9.7.tar.gz

Alternatively, if there is no substitution for that or it would decrease performance, then @mattdowle would need to decide about stated dependency upgrade. R 2.15.0 is from March 2012, in my opinion 4 years old isn't yet old enough to deprecate it without strong reason.

arunsrinivasan · 2016-04-11T19:08:51Z

@jangorecki here's the solution: https://github.com/Rdatatable/data.table/blob/9d2d71098d849c99e6eebb0e0b539eb58d723b05/src/rbindlist.c#L13-17

mattdowle · 2016-11-07T19:04:17Z

Work continued here: #1664

First version of the fwrite function Rdatatable#580

d794b3b

oseiskar force-pushed the fwrite branch from 015ad40 to d794b3b Compare March 27, 2016 20:14

MichaelChirico reviewed Mar 27, 2016
View reviewed changes

minor changes in fwrite: renamed param dt to x, removed requirement o…

8696b9f

…f unique column names, replaced %in% -> %chin%

oseiskar reviewed Mar 28, 2016
View reviewed changes

oseiskar added 2 commits March 28, 2016 18:12

Improved fwrite performance by handling number formatting and NAs in …

73aeff8

…C. Also added na option to fwrite.

changed fwrite test to allow different floating point notations from …

ff5ec81

…fprintf, e.g., 1e-10 (Linux) = 1e-010 (Windows)

using path.expand on the file.path argument of fwrite

1b79ec4

increased number of significant digits to 15 in fwrite

e8c50d1

oseiskar force-pushed the fwrite branch from 3ac65c1 to cf83d03 Compare April 3, 2016 07:41

merged master and fixed conflicts

6be2ed1

oseiskar force-pushed the fwrite branch from cf83d03 to 6be2ed1 Compare April 3, 2016 10:59

mattdowle merged commit 6be2ed1 into Rdatatable:master Apr 7, 2016

oseiskar deleted the fwrite branch April 25, 2016 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First version of the fwrite function #580 #1613

First version of the fwrite function #580 #1613

oseiskar commented Mar 27, 2016

MichaelChirico Mar 27, 2016

oseiskar Mar 28, 2016

MichaelChirico Mar 28, 2016

oseiskar Mar 28, 2016

jangorecki Mar 28, 2016

MichaelChirico Mar 28, 2016

oseiskar commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

oseiskar commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

jangorecki commented Mar 28, 2016

oseiskar commented Mar 30, 2016

dselivanov commented Mar 31, 2016

oseiskar commented Apr 2, 2016

dselivanov commented Apr 2, 2016

jangorecki commented Apr 2, 2016

mattdowle commented Apr 7, 2016

arunsrinivasan commented Apr 7, 2016

jangorecki commented Apr 11, 2016

arunsrinivasan commented Apr 11, 2016

mattdowle commented Nov 7, 2016

		@@ -0,0 +1,62 @@
		fwrite <- function(dt, file.path, append = FALSE, quote = TRUE,

First version of the fwrite function #580 #1613

First version of the fwrite function #580 #1613

Conversation

oseiskar commented Mar 27, 2016

MichaelChirico Mar 27, 2016

Choose a reason for hiding this comment

oseiskar Mar 28, 2016

Choose a reason for hiding this comment

MichaelChirico Mar 28, 2016

Choose a reason for hiding this comment

oseiskar Mar 28, 2016

Choose a reason for hiding this comment

jangorecki Mar 28, 2016

Choose a reason for hiding this comment

MichaelChirico Mar 28, 2016

Choose a reason for hiding this comment

oseiskar commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

oseiskar commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

MichaelChirico commented Mar 28, 2016

jangorecki commented Mar 28, 2016

oseiskar commented Mar 30, 2016

dselivanov commented Mar 31, 2016

oseiskar commented Apr 2, 2016

dselivanov commented Apr 2, 2016

jangorecki commented Apr 2, 2016

mattdowle commented Apr 7, 2016

arunsrinivasan commented Apr 7, 2016

jangorecki commented Apr 11, 2016

arunsrinivasan commented Apr 11, 2016

mattdowle commented Nov 7, 2016