-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types #15239
Conversation
Let me please cc @felixcheung in case although I know you were in the JIRA. |
Test build #65902 has finished for PR 15239 at commit
|
Test build #65906 has finished for PR 15239 at commit
|
Huh, it passes in my local and AppVeyor but not in Jenkins. Maybe due to the R version? |
Test build #65910 has finished for PR 15239 at commit
|
I fix the CRAN check up tomorrow. |
@@ -743,8 +743,12 @@ setMethod("toJSON", | |||
#' @note write.json since 1.6.0 | |||
setMethod("write.json", | |||
signature(x = "SparkDataFrame", path = "character"), | |||
function(x, path) { | |||
function(x, path, mode = "error", ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this change the default on the JVM side when mode was previously unset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is SaveMode.ErrorIfExists
[1] which error
means[2].
[1]
private var mode: SaveMode = SaveMode.ErrorIfExists |
[2]
case "error" | "default" => SaveMode.ErrorIfExists |
for (name in names(pairs)) { | ||
value <- pairs[[name]] | ||
if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) { | ||
stop("value[", value, "] in key[", name, "] is not convertable to string.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not be ideal because the user is not calling this function directly and value[something]
might not mean anything to them (since they have never set any value thing, furthermore, that might not be the relevant syntax in R)
Any idea on a different way to report this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I think I should the format and way to print out and error message.
How about something like..
"Supported types for options are logical, string, boolean, number and null.
The value set in ", name, " is ", typeof(value), "."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "... logical, character, numeric and NULL" as these are the R names.
@@ -835,7 +843,7 @@ loadDF <- function(x, ...) { | |||
#' @note createExternalTable since 1.4.0 | |||
createExternalTable.default <- function(tableName, path = NULL, source = NULL, ...) { | |||
sparkSession <- getSparkSession() | |||
options <- varargsToEnv(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any use of varargsToEnv
left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks.
|
also, you would need to add |
@felixcheung just to make sure, you mean some tests like the ones in #15231 (comment) so that we can check the error messages/what uers face? |
Test build #65978 has finished for PR 15239 at commit
|
Test build #65980 has finished for PR 15239 at commit
|
I am having something wrong with running |
Test build #65981 has finished for PR 15239 at commit
|
jmode <- convertToJSaveMode(mode) | ||
write <- callJMethod(write, "mode", jmode) | ||
write <- callJMethod(write, "options", options) | ||
write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think if it make sense to have a generic write method? ie. include write <- callJMethod(x@sdf, "write")
and invisible(callJMethod(write, source, path))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that sounds good. Actually, I was thinking single common method for options in both read
/write
(maybe in utils.R
?) and two common methods for reading/writing in read
/write
. I am wondering maybe if you think it is okay for me to try this in another PR after this one/#15231 are hopefully merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine
@@ -328,6 +328,7 @@ setMethod("toDF", signature(x = "RDD"), | |||
#' It goes through the entire dataset once to determine the schema. | |||
#' | |||
#' @param path Path of file to read. A vector of multiple paths is allowed. | |||
#' @param ... additional external data source specific named properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is odd - was it not complaining about this missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was not. I am not very sure on this (as actually I am not used to this CRAN check). My guess is, it seems they combine the arguments? For example, Parquet API is as below:
#' @param path path of file to read. A vector of multiple paths is allowed.
#' @return SparkDataFrame
#' @rdname read.parquet
...
read.parquet.default <- function(path, ...) {
#' @param ... argument(s) passed to the method.
#' @rdname read.parquet
...
parquetFile.default <- function(...) {
It complained about duplicated @params
when I tried to add @param ...
to read.parquet.default
. So, I ended up with removing this back.
On the other hand, for JSON APIs,
#' @param path Path of file to read. A vector of multiple paths is allowed.
#' @param ... additional external data source specific named properties.
#' @return SparkDataFrame
#' @rdname read.json
...
read.json.default <- function(path, ...) {
#' @rdname read.json
#' @name jsonFile
...
jsonFile.default <- function(path) {
It seems jsonFile
does not describe @param
. So, I think it passed.
If you meant another problem, could you please guide me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - when 2 functions share the same @Rdname, they are documented on the same Rd page and CRAN checks requirement is to have 1 and only 1 @param ...
if either/both function has ...
as parameter.
I haven't check, but my guess is you need to add @param ...
for @rdname read.json
since ...
is new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your advice. I will try to deal with this as far as I can!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, do you think it is okay as it is? I tried to make them look more consistent and clean but it kind of failed.
@@ -342,7 +342,8 @@ varargsToStrEnv <- function(...) { | |||
for (name in names(pairs)) { | |||
value <- pairs[[name]] | |||
if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) { | |||
stop("value[", value, "] in key[", name, "] is not convertable to string.") | |||
stop(paste0("Unsupported type for ", name, " : ", class(value), | |||
". Supported types are logical, numeric, character and null.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULL
instead of null
re: test - I mean for each function we are adding the
and one test with
|
Ur.. don't we have the tests for them already for ORC, Text, Parquet [1] and JSON [2]? [1]https://github.com/HyukjinKwon/spark/blob/5f9c3bef075870c26eef36757eb1b5572a065015/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1798-L1914 |
We probably do, would be good to double check we have at least a test for each. |
@@ -651,23 +651,25 @@ setGeneric("write.jdbc", function(x, url, tableName, mode = "error", ...) { | |||
|
|||
#' @rdname write.json | |||
#' @export | |||
setGeneric("write.json", function(x, path) { standardGeneric("write.json") }) | |||
setGeneric("write.json", function(x, path, mode = NULL, ...) { standardGeneric("write.json") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better as more generic to have these generics as setGeneric("write.json", function(x, path, ...)
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, sure.
This looks good, one comment on generics and this earlier comment #15239 (comment) (capital NULL is the value) |
5f9c3be
to
4126d04
Compare
Thank you so much for your close look. I missed the comment. I just addressed them. |
Test build #66294 has finished for PR 15239 at commit
|
I think this has conflict now since SPARK-17658 is merged, could you bring this up to date please? |
Sure, I will within tomorrow. |
…e.orc read/write.text
4126d04
to
eeb7db5
Compare
Test build #66450 has finished for PR 15239 at commit
|
# Allow the user to have a more flexible definiton of the text file path | ||
paths <- as.list(suppressWarnings(normalizePath(path))) | ||
read <- callJMethod(sparkSession, "read") | ||
read <- callJMethod(read, "options", options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just notice this in read.* function:
what if someone calls
read.json(path = "a", path = "b")
Which path will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an existing problem but would go down a different code path in data sources depending on their implementation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me test this first and will try to show together!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
In R, it seems we can't duplicatedly set paths.
In Scala and Python, for reading it takes all paths set in option and as arguments for reading. For writing, the argument overwrites the path set in option.
For R, in more details, It seems we can't simply specify the same keyword argument both.
With the data below,
hyukjin.json
{"NAME": "Hyukjin"}
felix.json
{"NAME": "Felix"}
-
read.json()
Duplicated keywords
> read.json(path = "hyukjin.json", path = "felix.json") Error in dispatchFunc("read.json(path)", x, ...) : argument "x" is missing, with no default
With a single keyword argument
> collect(read.json("hyukjin.json", path = "felix.json")) NAME 1 Felix
-
read.df()
Duplicated keywords
> read.df(path = "hyukjin.json", path = "felix.json", source = "json") Error in f(x, ...) : formal argument "path" matched by multiple actual arguments
With a single keyword argument
> read.df("hyukjin.json", path = "felix.json", source = "json") Error: class(schema) == "structType" is not TRUE
This case, it seems "hyukjin.json" became the third argument,
schema
.
In the case of With a single keyword argument, it seems path
becomes felix.json
. For example, as below:
> tmp <- function(path, ...) {
+ print(path)
+ }
>
> tmp("a", path = "b")
[1] "b"
For ...
arguments, it seems it throws an exception when we use some variables mix-and-matched as below:
> varargsToStrEnv("a", path="b")
Error in env[[name]] <- value :
attempt to use zero-length variable name"
However, it seems fine if they are all non-keywords arguments or keywords arguments as below:
> varargsToStrEnv("a", 1, 2, 3)
<environment: 0x7f815ba34d18>
> varargsToStrEnv(a="a", b=1, c=2, d=3)
<environment: 0x7f815ba6a440>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thank you for the very detailed analysis and tests.
I think generally it would be great to match the scala/python behavior (but not only because to match it) for read to include all path(s).
> read.json(path = "hyukjin.json", path = "felix.json")
Error in dispatchFunc("read.json(path)", x, ...) :
argument "x" is missing, with no default
This is because of the parameter hack.
> read.df(path = "hyukjin.json", path = "felix.json", source = "json")
Error in f(x, ...) :
formal argument "path" matched by multiple actual arguments
Think read.df is unique somewhat in the sense the first parameter is named path
- this is both helpful (if we don't want to support multiple path like this) or bad (user can't specify multiple paths)
> varargsToStrEnv("a", 1, 2, 3)
<environment: 0x7f815ba34d18>
This case is somewhat dangerous - I think we end by passing a list of properties without name to the JVM side - it might be a good idea to check for zero-length variable name
- perhaps could you open a JIRA on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, let me try to organise the unsolved comments here and #15231 if there is any! Thank you.
write <- callJMethod(x@sdf, "write") | ||
write <- setWriteOptions(write, mode = mode, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess similarly here, what if someone calls
write.text(df, path = "a", path = "b")
?
…d options in other types ## What changes were proposed in this pull request? This PR includes the changes below: - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` ## How was this patch tested? Unit tests in `test_sparkSQL.R`/ `utils.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15239 from HyukjinKwon/SPARK-17665.
What changes were proposed in this pull request?
This PR includes the changes below:
mode
/options
inread.parquet
,write.parquet
,read.orc
,write.orc
,read.text
,write.text
,read.json
andwrite.json
APIswrite.df
,read.df
,read.parquet
,write.parquet
,read.orc
,write.orc
,read.text
,write.text
,read.json
andwrite.json
How was this patch tested?
Unit tests in
test_sparkSQL.R
/utils.R
.