Merge pull request #3 from HughParsonage/colClasses-valid-type-2545

Update NEWS and documentation for Rdatatable#2025
HughParsonage · Jan 16, 2018 · 6c22aa8 · 6c22aa8
2 parents e5842bb + ccdf57c
commit 6c22aa8
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 1 deletion.
diff --git a/NEWS.md b/NEWS.md
@@ -22,6 +22,7 @@
     * Detects and ignores trailing ^Z end-of-file control character sometimes created on MS DOS/Windows, [#1612](https://github.com/Rdatatable/data.table/issues/1612). Thanks to Gergely Daróczi for reporting and providing a file.
     * Added option `logical01` to read a column of only `0`s and `1`s as `logical`, default `TRUE` for convenience in most cases. The large sample of rows throughout the file means that `fread` will be confident that the column really does just contain `0`s and `1`s, enabling and encouraging this convenient and efficient choice to save needing conversion afterwards or setting `colClasses` manually. In R, `logical` is `integer` anyway and can be treated as such in calculations. Further, it is no longer allowed to have mixed-case literals within a single column; i.e., a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing different styles together is not.
     * `colClasses` now supports `'complex'`, `'raw'`, `'Date'`, and `'POSIXct'` if specified. Coercion via `methods::as` will be attempted for other `colClasses` as specified. Failure to safely coerce a column to the requested class results in that column reverting to the default class (probably `"character"`), with a warning (never an error). [#1634](https://github.com/Rdatatable/data.table/issues/1634). Thanks to @hughparsonage for the pull request.
+    * `stringsAsFactors` now accepts `double`s in $[0, 1]$ as a continuous extension of `c(FALSE, TRUE)`. A value of 1/r coerces to factor any column where there are fewer unique values than nrow/r.
     * Added ability to recognize and parse hexadecimal floating point numbers, as used for example in Java. Thanks for @scottstanfield [#2316](https://github.com/Rdatatable/data.table/issues/2316) for the report.
     * Now handles floating-point NaN values in a wide variety of formats, including `NaN`, `sNaN`, `1.#QNAN`, `NaN1234`, `#NUM!` and others, [#1800](https://github.com/Rdatatable/data.table/issues/1800). Thanks to Jori Liesenborgs for highlighting and the PR.
     * If negative numbers are passed to `select=` the out-of-range error now suggests `drop=` instead, [#2423](https://github.com/Rdatatable/data.table/issues/2423). Thanks to Michael Chirico for the suggestion.

diff --git a/man/fread.Rd b/man/fread.Rd
@@ -30,7 +30,10 @@ nThread=getDTthreads(), logical01=TRUE
   \item{header}{ Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name. }
   \item{na.strings}{ A character vector of strings which are to be interpreted as \code{NA} values. By default \code{",,"} for columns read as type character is read as a blank string (\code{""}) and \code{",NA,"} is read as \code{NA}. Typical alternatives might be \code{na.strings=NULL} (no coercion to NA at all!) or perhaps \code{na.strings=c("NA","N/A","null")}. }
   \item{file}{ File path, useful when we want to ensure that no shell commands will be executed. File path can also be provided to \code{input} argument. }
-  \item{stringsAsFactors}{ Convert all character columns to factors? }
+  \item{stringsAsFactors}{Convert all character columns to factors?
+  Also accepts numeric input \eqn{[0, 1]}: a value of \eqn{1/r} coerces to factor any character column where there are fewer unique values than \code{nrow(.) / r}. For example, if the \code{data.table} to be read has
+  100 rows and \code{stringsAsFactors=0.2} then any character column with fewer than 20 unique values
+  will be coerced to \code{factor}.}
   \item{verbose}{ Be chatty and report timings? }
   \item{autostart}{ Deprecated and ignored with warning. Please use \code{skip} instead. }
   \item{skip}{ If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row. \code{skip>0} means ignore the first \code{skip} rows manually. \code{skip="string"} searches for \code{"string"} in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata). }