-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First version of the fwrite function #580 #1613
Conversation
@@ -0,0 +1,62 @@ | |||
fwrite <- function(dt, file.path, append = FALSE, quote = TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
should definitely use
DT
orx
, given thedt
function in thestats
package. -
consistency with
write.csv
would havefile.path
namedfile
and take default value""
(according to?write.csv
, the default is to print to console)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
ok, I'll change this
-
This is semi-intentional. Supporting
file=""
(stdout) and connections would require passing file handles (instead of just file name strings) from R to C and I don't currently know how to do this. R's C interface is poorly documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Gotcha. Not sure how important the consistency is, but maybe just follow
write.table
to this end:
if (file == "")
file <- stdout()
else if (is.character(file)) {
file <- if (nzchar(fileEncoding))
file(file, ifelse(append, "a", "w"), encoding = fileEncoding)
else file(file, ifelse(append, "a", "w"))
on.exit(close(file))
}
else if (!isOpen(file, "w")) {
open(file, "w")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
And here is the C side of the code, to the extent that it helps
…f unique column names, replaced %in% -> %chin%
repeat { | ||
block_end <- min(block_begin+(block.size-1), nrow(x)) | ||
|
||
dt_block <- x[c(block_begin:block_end),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reply to MichaelChirico's comment if it would be faster to create an extra column and use it like x[.(block_no)]
: could be a little bit faster and not difficult to implement in C, but I dislike the idea of modifying the input data table. Is there a convenient way to generate the name of such column so that it would not conflict with existing column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oseiskar none I'm aware of, once #633 solved it will be easy, you can use cryptic name (prefix it with the dot), and check if it doesn't exist in a data.table. Not sure what you are referring by modifying input, but adding column without modifying input is as simple as x = shallow(x)[, "col" := new]
, it won't copy the data, and it will add new column only to locally processed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to reinforce Jan's point, this sort of thing (restricting acceptable column names) is done under the hood in several places of data.table
code already, see e.g. here. As a suggestion, there's b__
, which has an analogue to f__
, l__
, o__
, zo__
, jn__
, and jl__
, all found in [.data.table
. Agreed block_no
is too likely to be in users' tables (in fact I have several cases like that myself).
…C. Also added na option to fwrite.
…fprintf, e.g., 1e-10 (Linux) = 1e-010 (Windows)
I moved number and NA formatting to C code, which resulted in less column copying and a significant perfomance boost. See this gist. Now this version seems to be between 2 and 4 times faster than @jangorecki thanks for pointing out the shortcomings of |
My preference is to use By the way, thanks for working on this! The |
Two more things:
Produces:
I don't know the proper way to handle this, perhaps surround a
? Or perhaps take the cue from
Otherwise just error on |
@MichaelChirico The problem is not the working directory but the |
inre:
|
inre: numeric columns, for consistency with
|
elapsed.secs <- system.time(ans <- fwrite(...))[[3L]]` is the easiest way pt = microbenchmark::get_nanotime()
ans = fwrite(...)
microbenchmark::get_nanotime() - pt or from the devel version of microbenchmarkCore a drop-in replacement for No need to have list columns supported now, this can be easily and flexibly handled within data.table before passing it to fwrite. Not sure about |
As a reply to this earlier comment about supporting @MichaelChirico thanks for looking up the C side of In particular, one would need to call |
Not sure, if this will help, but leave it here: official C level API for connection handles in R 3.3.0. |
Actually, it is possible to bypass the private API limitation and call
In my opinion, supporting R connection arguments is not worth the trouble at this point. |
Personally I think, connections is "nice to have" feature (which can be implemented in future), but even without connections this is a great PR. Thank you, @oseiskar. |
+1 dselivanov |
Awesome! Looks great @oseiskar :-) |
Great PR, @oseiskar. Using the I've not looked at the code, but IIUC |
@oseiskar fwrite.c: In function ‘writefile’:
fwrite.c:31:3: error: unknown type name ‘R_xlen_t’
R_xlen_t ncols = LENGTH(list_of_columns);
^ You can reproduce it with:
Alternatively, if there is no substitution for that or it would decrease performance, then @mattdowle would need to decide about stated dependency upgrade. R 2.15.0 is from March 2012, in my opinion 4 years old isn't yet old enough to deprecate it without strong reason. |
Work continued here: #1664 |
This implementation of
fwrite
(#580) aims to be faster, or at least as fast aswrite.csv
, but a few things have been left out or simplified:quote=TRUE
, all column names are quotedquote=FALSE
, nothing is quoted, even if this would break the CSVrow.names
. They only make sense fordata.frame
s with named rows. Fordata.tables
, they would just reduce to row numbers.The speedup compared to
write.csv
depends on column types and parameters but speedup factors from 2 to 4 are possible.