Skip to content

Commit

Permalink
merge master and tweak news item
Browse files Browse the repository at this point in the history
  • Loading branch information
mattdowle committed Aug 30, 2021
2 parents d3de56d + 5c6bee3 commit 50039e9
Show file tree
Hide file tree
Showing 33 changed files with 378 additions and 492 deletions.
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,8 @@ S3method(seq, ITime)
S3method(unique, IDate)
S3method(unique, ITime)
S3method('[<-', IDate)
S3method('min', IDate)
S3method('max', IDate)
S3method(edit, data.table)

# generics to support custom column formatters
Expand Down
68 changes: 65 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@

8. `melt()` now supports `NA` entries when specifying a list of `measure.vars`, which translate into runs of missing values in the output. Useful for melting wide data with some missing columns, [#4027](https://github.com/Rdatatable/data.table/issues/4027). Thanks to @vspinu for reporting, and @tdhock for implementing.

9. `melt()` now supports multiple output variable columns via the `variable_table` attribute of `measure.vars`, [#3396](https://github.com/Rdatatable/data.table/issues/3396) [#2575](https://github.com/Rdatatable/data.table/issues/2575) [#2551](https://github.com/Rdatatable/data.table/issues/2551), [#4998](https://github.com/Rdatatable/data.table/issues/4998). It should be a `data.table` with one row that describes each element of the `measure.vars` vector(s). These data/columns are copied to the output instead of the usual variable column. This is backwards compatible since the previous behavior (one output variable column) is used when there is no `variable_table`. New functions `measure()` and `measurev()` which use either a separator or a regex to create a `measure.vars` list/vector with `variable_table` attribute; useful for melting data that has several distinct pieces of information encoded in each column name. See new `?measure` and new section in reshape vignette. Thanks to Matthias Gomolka, Ananda Mahto, Hugh Parsonage, Mark Fairbanks for reporting, and to @tdhock for implementing.
9. `melt()` now supports multiple output variable columns via the `variable_table` attribute of `measure.vars`, [#3396](https://github.com/Rdatatable/data.table/issues/3396) [#2575](https://github.com/Rdatatable/data.table/issues/2575) [#2551](https://github.com/Rdatatable/data.table/issues/2551), [#4998](https://github.com/Rdatatable/data.table/issues/4998). It should be a `data.table` with one row that describes each element of the `measure.vars` vector(s). These data/columns are copied to the output instead of the usual variable column. This is backwards compatible since the previous behavior (one output variable column) is used when there is no `variable_table`. New functions `measure()` and `measurev()` which use either a separator or a regex to create a `measure.vars` list/vector with `variable_table` attribute; useful for melting data that has several distinct pieces of information encoded in each column name. See new `?measure` and new section in reshape vignette. Thanks to Matthias Gomolka, Ananda Mahto, Hugh Parsonage, Mark Fairbanks for reporting, and to Toby Dylan Hocking for implementing. Thanks to @keatingw for testing before release, requesting `measure()` accept single groups too [#5065](https://github.com/Rdatatable/data.table/issues/5065), and Toby for implementing.

10. A new interface for _programming on data.table_ has been added, closing [#2655](https://github.com/Rdatatable/data.table/issues/2655) and many other linked issues. It is built using base R's `substitute`-like interface via a new `env` argument to `[.data.table`. For details see the new vignette *programming on data.table*, and the new `?substitute2` manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.

Expand Down Expand Up @@ -109,12 +109,14 @@

21. `melt()` was pseudo generic in that `melt(DT)` would dispatch to the `melt.data.table` method but `melt(not-DT)` would explicitly redirect to `reshape2`. Now `melt()` is standard generic so that methods can be developed in other packages, [#4864](https://github.com/Rdatatable/data.table/pull/4864). Thanks to @odelmarcelle for suggesting and implementing.

22. `DT(i, j, by, ...)` has been added, i.e. functional form of a `data.table` query, [#641](https://github.com/Rdatatable/data.table/issues/641) [#4872](https://github.com/Rdatatable/data.table/issues/4872). Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables the `data.table` general form query to be invoked on a `data.frame` without converting it to a `data.table` first. The class of the input object is retained.
22. `DT(i, j, by, ...)` has been added, i.e. functional form of a `data.table` query, [#641](https://github.com/Rdatatable/data.table/issues/641) [#4872](https://github.com/Rdatatable/data.table/issues/4872). Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables the `data.table` general form query to be invoked on a `data.frame` without converting it to a `data.table` first. The class of the input object is retained. Thanks to Mark Fairbanks and Boniface Kamgang for testing and reporting problems that have been fixed before release, [#5106](https://github.com/Rdatatable/data.table/issues/5106) [#5107](https://github.com/Rdatatable/data.table/issues/5107).

```R
mtcars |> DT(mpg>20, .(mean_hp=mean(hp)), by=cyl)
```

When `data.table` queries (either `[...]` or `|> DT(...)`) receive a `data.table`, the operations maintain `data.table`'s attributes such as its key and any indices. For example, if a `data.table` is reordered by `data.table`, or a key column has a value changed by `:=` in `data.table`, its key and indices will either be dropped or reordered appropriately. Some `data.table` operations automatically add and store an index on a `data.table` for reuse in future queries, if `options(datatable.auto.index=TRUE)`, which is `TRUE` by default. `data.table`'s are also over-allocated, which means there are spare column pointer slots allocated in advance so that a `data.table` in the `.GlobalEnv` can have a column added to it truly by reference, like an in-memory database with multiple client sessions connecting to one server R process, as a `data.table` video has shown in the past. But because R and other packages don't maintain `data.table`'s attributes or over-allocation (e.g. a subset or reorder by R or another package will create invalid `data.table` attributes) `data.table` cannot use these attributes when it detects that base R or another package has touched the `data.table` in the meantime, even if the attributes may sometimes still be valid. So, please realize that, `DT()` on a `data.table` should realize better speed and memory usage than `DT()` on a `data.frame`. `DT()` on a `data.frame` may still be useful to use `data.table`'s syntax (e.g. sub-queries within group: `|> DT(i, .SD[sub-query], by=grp)`) without needing to convert to a `data.table` first.

23. `DT[i, nomatch=NULL]` where `i` contains row numbers now excludes `NA` and any outside the range [1,nrow], [#3109](https://github.com/Rdatatable/data.table/issues/3109) [#3666](https://github.com/Rdatatable/data.table/issues/3666). Before, `NA` rows were returned always for such values; i.e. `nomatch=0|NULL` was ignored. Thanks Michel Lang and Hadley Wickham for the requests, and Jan Gorecki for the PR. Using `nomatch=0` in this case when `i` is row numbers generates the warning `Please use nomatch=NULL instead of nomatch=0; see news item 5 in v1.12.0 (Jan 2019)`.

```R
Expand All @@ -135,7 +137,9 @@

24. `DT[, head(.SD,n), by=grp]` and `tail` are now optimized when `n>1`, [#5060](https://github.com/Rdatatable/data.table/issues/5060) [#523](https://github.com/Rdatatable/data.table/issues/523#issuecomment-162934391). `n==1` was already optimized. Thanks to Jan Gorecki and Michael Young for requesting, and Benjamin Schwendinger for the PR.

25. `fdroplevels()` as fast alternative to `base::droplevels` for `factor`s is now implemented. Additionally, a method for dropping levels of `data.table` `droplevels.data.table` similar to `base::droplevels.data.frame` has been added. Thanks to Steve Lianoglou for requesting, and Jan Gorecki and Benjamin Schwendinger for the PR.
25 `setcolorder()` gains `before=` and `after=`, [#4385](https://github.com/Rdatatable/data.table/issues/4358). Thanks to Matthias Gomolka for the request, and both Benjamin Schwendinger and Xianghui Dong for implementing.

26. `base::droplevels()` gains a fast method for `data.table`, [#647](https://github.com/Rdatatable/data.table/issues/647). Thanks to Steve Lianoglou for requesting, and Jan Gorecki and Benjamin Schwendinger for the PR. `fdroplevels()` for use on vectors has also been added.

## BUG FIXES

Expand Down Expand Up @@ -251,6 +255,34 @@
# no inconvenient warning
```

On the same basis, `min` and `max` methods for empty `IDate` input now return `NA_integer_` of class `IDate`, rather than `NA_double_` of class `IDate` together with base R's warning `no non-missing arguments to min; returning Inf`, [#2256](https://github.com/Rdatatable/data.table/issues/2256]. The type change and warning would cause an error in grouping, see example below. Since `NA` was returned before it seems clear that still returning `NA` but of the correct type and with no warning is appropriate, backwards compatible, and a bug fix. Thanks to Frank Narf for reporting, and Matt Dowle for fixing.
```R
DT
# d g
# <IDat> <char>
# 1: 2020-01-01 a
# 2: 2020-01-02 a
# 3: 2019-12-31 b
DT[, min(d[d>"2020-01-01"]), by=g]
# was:
# Error in `[.data.table`(DT, , min(d[d > "2020-01-01"]), by = g) :
# Column 1 of result for group 2 is type 'double' but expecting type 'integer'. Column types must be consistent for each group.
# In addition: Warning message:
# In min.default(integer(0), na.rm = FALSE) :
# no non-missing arguments to min; returning Inf
# now :
# g V1
# <char> <IDat>
# 1: a 2020-01-02
# 2: b <NA>
```
36. `DT[, min(int64Col), by=grp]` (and `max`) would return incorrect results for `bit64::integer64` columns, [#4444](https://github.com/Rdatatable/data.table/issues/4444). Thanks to @go-see for reporting, and Michael Chirico for the PR.
37. `fread(dec=',')` was able to guess `sep=','` and return an incorrect result, [#4483](https://github.com/Rdatatable/data.table/issues/4483). Thanks to Michael Chirico for reporting and fixing. It was already an error to provide both `sep=','` and `dec=','` manually.
Expand Down Expand Up @@ -293,6 +325,36 @@
39. `DT[i, sum(b), by=grp]` (and other optimized-by-group aggregates: `mean`, `var`, `sd`, `median`, `prod`, `min`, `max`, `first`, `last`, `head` and `tail`) could segfault if `i` contained row numbers and one or more were NA, [#1994](https://github.com/Rdatatable/data.table/issues/1994). Thanks to Arun Srinivasan for reporting, and Benjamin Schwendinger for the PR.
40. `identical(fread(text="A\n0.8060667366\n")$A, 0.8060667366)` is now TRUE, [#4461](https://github.com/Rdatatable/data.table/issues/4461). This is one of 13 numbers in the set of 100,000 between 0.80606 and 0.80607 in 0.0000000001 increments that were not already identical. In all 13 cases R's parser (same as `read.table`) and `fread` straddled the true value by a very similar small amount. `fread` now uses `/10^n` rather than `*10^-n` to match R identically in all cases. Thanks to Gabe Becker for requesting consistency, and Michael Chirico for the PR.

```R
for (i in 0:99999) {
s = sprintf("0.80606%05d", i)
r = eval(parse(text=s))
f = fread(text=paste0("A\n",s,"\n"))$A
if (!identical(r, f))
cat(s, sprintf("%1.18f", c(r, f, r)), "\n")
}
# input eval & read.table fread before fread now
# 0.8060603509 0.806060350899999944 0.806060350900000055 0.806060350899999944
# 0.8060614740 0.806061473999999945 0.806061474000000056 0.806061473999999945
# 0.8060623757 0.806062375699999945 0.806062375700000056 0.806062375699999945
# 0.8060629084 0.806062908399999944 0.806062908400000055 0.806062908399999944
# 0.8060632774 0.806063277399999945 0.806063277400000056 0.806063277399999945
# 0.8060638101 0.806063810099999944 0.806063810100000055 0.806063810099999944
# 0.8060647118 0.806064711799999944 0.806064711800000055 0.806064711799999944
# 0.8060658349 0.806065834899999945 0.806065834900000056 0.806065834899999945
# 0.8060667366 0.806066736599999945 0.806066736600000056 0.806066736599999945
# 0.8060672693 0.806067269299999944 0.806067269300000055 0.806067269299999944
# 0.8060676383 0.806067638299999945 0.806067638300000056 0.806067638299999945
# 0.8060681710 0.806068170999999944 0.806068171000000055 0.806068170999999944
# 0.8060690727 0.806069072699999944 0.806069072700000055 0.806069072699999944
#
# remaining 99,987 of these 100,000 were already identical
```

41. `dcast(empty-DT)` now returns an empty `data.table` rather than error `Cannot cast an empty data.table`, [#1215](https://github.com/Rdatatable/data.table/issues/1215). Thanks to Damian Betebenner for reporting, and Matt Dowle for fixing.

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
7 changes: 6 additions & 1 deletion R/IDateTime.R
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,11 @@ unique.IDate =
x
}

# define min and max to avoid base R's Inf with warning on empty, #2256
min.IDate = max.IDate = function(x, ...) {
as.IDate(if (!length(x)) NA else NextMethod())
}

# fix for #1315
as.list.IDate = function(x, ...) NextMethod()

Expand Down Expand Up @@ -304,7 +309,7 @@ clip_msec = function(secs, action) {
stopf("Valid options for ms are 'truncate', 'nearest', and 'ceil'.")
)
}

###################################################################
# Date - time extraction functions
# Adapted from Hadley Wickham's routines cited below to ensure
Expand Down
Loading

0 comments on commit 50039e9

Please sign in to comment.