-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readLines with fread #1518
readLines with fread #1518
Conversation
249b89c
to
cae607e
Compare
Hm, Looks nice, could you please remove irrelevant (spacing) lines from the PR so that the actual changes are easily visible to have a look at? Thanks. |
a50445e
to
23fbe0f
Compare
see updated PR. Do you mind if I will open separate PR for syntax/readability refinements? |
@dselivanov I've not yet had time to look at the PR, sorry. On syntax/readability, I'd suggest starting a new thread/issue so as to develop/write data.table package coding style (has been on my internal to-do for quite some time now :-( ). |
this is cool. i'd been using |
@arunsrinivasan see #1532. |
@dselivanov just had a look. Nice! Could you explain in words about |
Could also be useful to have some benchmarks to compare performance vis-a-vis |
@arunsrinivasan sure, will update PR soon. Another question - in case of |
@dselivanov fread has |
@dselivanov same behaviour as always. Default is On second thought, Also, |
@arunsrinivasan sure, we can check for
Another option is to introduce new function, say, |
|
Oh I see, |
I'm not yet sure about this. Will come back later. |
Could you file an issue for this? I'd like to refer to an issue number rather than a PR number. |
@MichaelChirico feel free to file issue yourself if you really need it. |
Can I do anything else to improve this PR, so it could be merged? |
@dselivanov sorry for the delay. I've given it some thought. I still find To me, |
@arunsrinivasan ok agree with your point about |
Okay. As long as |
See updated PR. Sorry for long update, totally forgot about it... |
man/fread.Rd
Outdated
@@ -15,14 +15,14 @@ skip=0L, select=NULL, drop=NULL, colClasses=NULL, | |||
integer64=getOption("datatable.integer64"), # default: "integer64" | |||
dec=if (sep!=".") "." else ",", col.names, | |||
check.names=FALSE, encoding="unknown", quote="\"", | |||
strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, | |||
strip.white = ifelse(sep == "\n", FALSE, TRUE), fill=FALSE, blank.lines.skip=FALSE, key=NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ifelse
could be more useful for vectorized input, here strip.white = sep!="\n"
or strip.white = !identical(sep,"\n")
should be sufficient, isn't it? the latter one if we allow 0 length sep
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
does it wait for anything? assuming feedback was well addressed we could have this feature as it is pretty useful. @mattdowle @arunsrinivasan |
I only kindly ask to merge it before 1.9.8 CRAN release :-). 21 июля 2016 г. 9:44 PM пользователь "Jan Gorecki" notifications@github.com
|
@dselivanov if you could rebase and resolve the conflicts, I'll take a quick look and merge it in. |
BTW. IMO |
@jangorecki EOF is always an implicit separator. |
Makes more sense, so it is consistent to how |
I will try to resolve conflicts and update PR. |
4965e05
to
d52ea15
Compare
@arunsrinivasan rebased now. |
inst/tests/tests.Rraw
Outdated
@@ -9108,6 +9108,12 @@ test_values <- c(53L, 53L, 52L, 1L, 52L, 1L, 1L, | |||
|
|||
test(1702, isoweek(test_cases), test_values) | |||
|
|||
# | |||
lines <- fread("a,b\n ab,cd,ce\n abcdef\n hjkli \n", sep = "\n", header = T)[[1]] | |||
test(1591.1, lines, c(" ab,cd,ce", " abcdef", " hjkli ") ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test numbers can be adjusted to avoid duplicated id of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I have amended PR.
strip.white = !identical(sep,"\n"), | ||
fill = FALSE, blank.lines.skip = FALSE, key = NULL, | ||
showProgress = getOption("datatable.showProgress"), | ||
data.table = getOption("datatable.fread.datatable")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is more readable now 👍
src/fread.c
Outdated
@@ -798,13 +802,16 @@ SEXP readfile(SEXP input, SEXP separg, SEXP nrowsarg, SEXP headerarg, SEXP nastr | |||
// Auto detect separator, number of fields, and location of first row | |||
// ******************************************************************************************** | |||
const char *seps; | |||
if (isNull(separg)) { | |||
if(!isNull(separg)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally, this could have one extra space after if
d52ea15
to
5eae076
Compare
Merge remote-tracking branch 'upstream/master' into fread_lines # Conflicts: # R/fread.R # inst/tests/tests.Rraw
merge upstream to my PR branch. |
Current coverage is 90.66% (diff: 100%)@@ master #1518 diff @@
==========================================
Files 58 58
Lines 10697 10702 +5
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 9695 9703 +8
+ Misses 1002 999 -3
Partials 0 0
|
@arunsrinivasan @jangorecki anything else I can do? |
Sorry, took a break. Seems fine but might be hard for this release as Matt seems to have almost wrapped up checking against dependent packages. Tagging @mattdowle just in case. |
I don't think this can be easily merged - logic of |
It is tricky to use
fread
as replacement forreadLines
for reading non-tabular raw text data. In fact I realised that I actually can do that only after creating this PR.Consider following example where I want to read text without splitting it into
data.frame
/data.table
Fortunately (and I realised that only after browsing
fread.c
source code) we can passsep = "\n"
.Note, that we received
"hjkli"
, not" hjkli "
, becausestrip.white = TRUE
by default. This also can be misleading.And the last - most painful. It is not possible to do trick above on windows, where
EOL = '\r\n'
: