[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types #15239

HyukjinKwon · 2016-09-26T08:06:31Z

What changes were proposed in this pull request?

This PR includes the changes below:

Support mode/options in read.parquet, write.parquet, read.orc, write.orc, read.text, write.text, read.json and write.json APIs
Support other types (logical, numeric and string) as options for write.df, read.df, read.parquet, write.parquet, read.orc, write.orc, read.text, write.text, read.json and write.json

How was this patch tested?

Unit tests in test_sparkSQL.R/ utils.R.

HyukjinKwon · 2016-09-26T08:11:13Z

Let me please cc @felixcheung in case although I know you were in the JIRA.

SparkQA · 2016-09-26T08:40:59Z

Test build #65902 has finished for PR 15239 at commit a9f9df8.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-26T11:55:20Z

Test build #65906 has finished for PR 15239 at commit 4475767.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-26T12:12:01Z

Huh, it passes in my local and AppVeyor but not in Jenkins. Maybe due to the R version?

SparkQA · 2016-09-26T13:03:47Z

Test build #65910 has finished for PR 15239 at commit e599cd8.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-26T13:06:10Z

I fix the CRAN check up tomorrow.

felixcheung · 2016-09-26T18:26:26Z

R/pkg/R/DataFrame.R

@@ -743,8 +743,12 @@ setMethod("toJSON",
 #' @note write.json since 1.6.0
 setMethod("write.json",
          signature(x = "SparkDataFrame", path = "character"),
-          function(x, path) {
+          function(x, path, mode = "error", ...) {


does this change the default on the JVM side when mode was previously unset?

Default is SaveMode.ErrorIfExists[1] which error means[2].

[1]

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Line 573 in 50b89d0

private var mode: SaveMode = SaveMode.ErrorIfExists

[2]

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Line 70 in 50b89d0

case "error" | "default" => SaveMode.ErrorIfExists

felixcheung · 2016-09-26T18:29:14Z

R/pkg/R/utils.R

+  for (name in names(pairs)) {
+    value <- pairs[[name]]
+    if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) {
+      stop("value[", value, "] in key[", name, "] is not convertable to string.")


this might not be ideal because the user is not calling this function directly and value[something] might not mean anything to them (since they have never set any value thing, furthermore, that might not be the relevant syntax in R)

Any idea on a different way to report this?

Oh, I see. I think I should the format and way to print out and error message.

How about something like..

"Supported types for options are logical, string, boolean, number and null. The value set in ", name, " is ", typeof(value), "."

I think "... logical, character, numeric and NULL" as these are the R names.

felixcheung · 2016-09-26T18:31:31Z

R/pkg/R/SQLContext.R

@@ -835,7 +843,7 @@ loadDF <- function(x, ...) {
 #' @note createExternalTable since 1.4.0
 createExternalTable.default <- function(tableName, path = NULL, source = NULL, ...) {
  sparkSession <- getSparkSession()
-  options <- varargsToEnv(...)


is there any use of varargsToEnv left?

Yeap, it seems there is one case left[1].

[1]

spark/R/pkg/R/group.R

Line 104 in d2fde6b

cols <- varargsToEnv(...)

felixcheung · 2016-09-26T18:34:04Z

great, thanks.

we should consolidate the write.* function to use a helper to avoid code duplication
I'm a bit worry about the function signature changes - could we have some tests for before/after?

felixcheung · 2016-09-26T18:34:50Z

also, you would need to add @param ... doc to pass the CRAN tests

HyukjinKwon · 2016-09-27T01:30:07Z

I'm a bit worry about the function signature changes - could we have some tests for before/after?

@felixcheung just to make sure, you mean some tests like the ones in #15231 (comment) so that we can check the error messages/what uers face?

SparkQA · 2016-09-27T14:09:05Z

Test build #65978 has finished for PR 15239 at commit 0b1295b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-27T14:15:13Z

Test build #65980 has finished for PR 15239 at commit 7fffa56.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-27T14:40:49Z

I am having something wrong with running cran-check.sh locally. Please understand trying some fixes with Jenkins.

SparkQA · 2016-09-27T15:14:33Z

Test build #65981 has finished for PR 15239 at commit 5f9c3be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-09-28T04:27:25Z

R/pkg/R/DataFrame.R

+    jmode <- convertToJSaveMode(mode)
+    write <- callJMethod(write, "mode", jmode)
+    write <- callJMethod(write, "options", options)
+    write


do you think if it make sense to have a generic write method? ie. include write <- callJMethod(x@sdf, "write") and invisible(callJMethod(write, source, path))

Yes, I think that sounds good. Actually, I was thinking single common method for options in both read/write (maybe in utils.R?) and two common methods for reading/writing in read/write. I am wondering maybe if you think it is okay for me to try this in another PR after this one/#15231 are hopefully merged.

that's fine

felixcheung · 2016-09-28T04:29:44Z

R/pkg/R/SQLContext.R

@@ -328,6 +328,7 @@ setMethod("toDF", signature(x = "RDD"),
 #' It goes through the entire dataset once to determine the schema.
 #'
 #' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.


this is odd - was it not complaining about this missing?

Yes, it was not. I am not very sure on this (as actually I am not used to this CRAN check). My guess is, it seems they combine the arguments? For example, Parquet API is as below:

#' @param path path of file to read. A vector of multiple paths is allowed. #' @return SparkDataFrame #' @rdname read.parquet ... read.parquet.default <- function(path, ...) {

#' @param ... argument(s) passed to the method. #' @rdname read.parquet ... parquetFile.default <- function(...) {

It complained about duplicated @params when I tried to add @param ... to read.parquet.default. So, I ended up with removing this back.

On the other hand, for JSON APIs,

#' @param path Path of file to read. A vector of multiple paths is allowed. #' @param ... additional external data source specific named properties. #' @return SparkDataFrame #' @rdname read.json ... read.json.default <- function(path, ...) {

#' @rdname read.json #' @name jsonFile ... jsonFile.default <- function(path) {

It seems jsonFile does not describe @param. So, I think it passed.

If you meant another problem, could you please guide me?

right - when 2 functions share the same @Rdname, they are documented on the same Rd page and CRAN checks requirement is to have 1 and only 1 @param ... if either/both function has ... as parameter.

I haven't check, but my guess is you need to add @param ... for @rdname read.json since ... is new.

Thank you for your advice. I will try to deal with this as far as I can!

Hm, do you think it is okay as it is? I tried to make them look more consistent and clean but it kind of failed.

felixcheung · 2016-09-28T04:30:12Z

R/pkg/R/utils.R

@@ -342,7 +342,8 @@ varargsToStrEnv <- function(...) {
  for (name in names(pairs)) {
    value <- pairs[[name]]
    if (!(is.logical(value) || is.numeric(value) || is.character(value) || is.null(value))) {
-      stop("value[", value, "] in key[", name, "] is not convertable to string.")
+      stop(paste0("Unsupported type for ", name, " : ", class(value),
+           ". Supported types are logical, numeric, character and null."))


NULL instead of null

felixcheung · 2016-09-28T04:34:33Z

re: test - I mean for each function we are adding the ... param, that we have a test for calling it without the extra stuff, ie.
one test with

 write.json(df, jsonPath)

and one test with

 write.json(df, jsonPath, compression = T)

HyukjinKwon · 2016-09-28T06:46:46Z

Ur.. don't we have the tests for them already for ORC, Text, Parquet [1] and JSON [2]?

[1]https://github.com/HyukjinKwon/spark/blob/5f9c3bef075870c26eef36757eb1b5572a065015/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1798-L1914
[2]https://github.com/HyukjinKwon/spark/blob/5f9c3bef075870c26eef36757eb1b5572a065015/R/pkg/inst/tests/testthat/test_sparkSQL.R#L472-L528

felixcheung · 2016-09-28T21:41:58Z

We probably do, would be good to double check we have at least a test for each.
Thanks!

felixcheung · 2016-10-03T17:44:34Z

R/pkg/R/generics.R

@@ -651,23 +651,25 @@ setGeneric("write.jdbc", function(x, url, tableName, mode = "error", ...) {

 #' @rdname write.json
 #' @export
-setGeneric("write.json", function(x, path) { standardGeneric("write.json") })
+setGeneric("write.json", function(x, path, mode = NULL, ...) { standardGeneric("write.json") })


would it be better as more generic to have these generics as setGeneric("write.json", function(x, path, ...) instead?

Oh, yes, sure.

felixcheung · 2016-10-03T17:49:49Z

This looks good, one comment on generics and this earlier comment #15239 (comment) (capital NULL is the value)

HyukjinKwon · 2016-10-04T01:18:55Z

Thank you so much for your close look. I missed the comment. I just addressed them.

SparkQA · 2016-10-04T01:54:54Z

Test build #66294 has finished for PR 15239 at commit 4126d04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-05T06:05:12Z

I think this has conflict now since SPARK-17658 is merged, could you bring this up to date please?

HyukjinKwon · 2016-10-05T06:20:11Z

Sure, I will within tomorrow.

…e.orc read/write.text

SparkQA · 2016-10-06T12:24:11Z

Test build #66450 has finished for PR 15239 at commit eeb7db5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-07T06:09:13Z

R/pkg/R/SQLContext.R

  # Allow the user to have a more flexible definiton of the text file path
  paths <- as.list(suppressWarnings(normalizePath(path)))
  read <- callJMethod(sparkSession, "read")
+  read <- callJMethod(read, "options", options)


just notice this in read.* function:
what if someone calls

read.json(path = "a", path = "b")

Which path will be used?

This might be an existing problem but would go down a different code path in data sources depending on their implementation...

Let me test this first and will try to show together!

Oh, I see.

In R, it seems we can't duplicatedly set paths.

In Scala and Python, for reading it takes all paths set in option and as arguments for reading. For writing, the argument overwrites the path set in option.

For R, in more details, It seems we can't simply specify the same keyword argument both.

With the data below,

hyukjin.json

{"NAME": "Hyukjin"}

felix.json

{"NAME": "Felix"}

read.json()

Duplicated keywords

> read.json(path = "hyukjin.json", path = "felix.json") Error in dispatchFunc("read.json(path)", x, ...) : argument "x" is missing, with no default

With a single keyword argument

> collect(read.json("hyukjin.json", path = "felix.json")) NAME 1 Felix

read.df()

Duplicated keywords

> read.df(path = "hyukjin.json", path = "felix.json", source = "json") Error in f(x, ...) : formal argument "path" matched by multiple actual arguments

With a single keyword argument

> read.df("hyukjin.json", path = "felix.json", source = "json") Error: class(schema) == "structType" is not TRUE

This case, it seems "hyukjin.json" became the third argument, schema.

In the case of With a single keyword argument, it seems path becomes felix.json. For example, as below:

> tmp <- function(path, ...) { + print(path) + } > > tmp("a", path = "b") [1] "b"

For ... arguments, it seems it throws an exception when we use some variables mix-and-matched as below:

> varargsToStrEnv("a", path="b") Error in env[[name]] <- value : attempt to use zero-length variable name"

However, it seems fine if they are all non-keywords arguments or keywords arguments as below:

> varargsToStrEnv("a", 1, 2, 3) <environment: 0x7f815ba34d18>

> varargsToStrEnv(a="a", b=1, c=2, d=3) <environment: 0x7f815ba6a440>

ah, thank you for the very detailed analysis and tests.
I think generally it would be great to match the scala/python behavior (but not only because to match it) for read to include all path(s).

> read.json(path = "hyukjin.json", path = "felix.json") Error in dispatchFunc("read.json(path)", x, ...) : argument "x" is missing, with no default

This is because of the parameter hack.

> read.df(path = "hyukjin.json", path = "felix.json", source = "json") Error in f(x, ...) : formal argument "path" matched by multiple actual arguments

Think read.df is unique somewhat in the sense the first parameter is named path - this is both helpful (if we don't want to support multiple path like this) or bad (user can't specify multiple paths)

> varargsToStrEnv("a", 1, 2, 3) <environment: 0x7f815ba34d18>

This case is somewhat dangerous - I think we end by passing a list of properties without name to the JVM side - it might be a good idea to check for zero-length variable name - perhaps could you open a JIRA on that?

Yeap, let me try to organise the unsolved comments here and #15231 if there is any! Thank you.

felixcheung · 2016-10-07T06:10:51Z

R/pkg/R/DataFrame.R

            write <- callJMethod(x@sdf, "write")
+            write <- setWriteOptions(write, mode = mode, ...)


I guess similarly here, what if someone calls

write.text(df, path = "a", path = "b")

?

…d options in other types ## What changes were proposed in this pull request? This PR includes the changes below: - Support `mode`/`options` in `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` APIs - Support other types (logical, numeric and string) as options for `write.df`, `read.df`, `read.parquet`, `write.parquet`, `read.orc`, `write.orc`, `read.text`, `write.text`, `read.json` and `write.json` ## How was this patch tested? Unit tests in `test_sparkSQL.R`/ `utils.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15239 from HyukjinKwon/SPARK-17665.

HyukjinKwon changed the title ~~[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types~~ [SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types[WIP] Sep 26, 2016

HyukjinKwon changed the title ~~[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types[WIP]~~ [SPARK-17665][SPARKR][WIP] Support options/mode all for read/write APIs and options in other types Sep 26, 2016

HyukjinKwon changed the title ~~[SPARK-17665][SPARKR][WIP] Support options/mode all for read/write APIs and options in other types~~ [SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types Sep 26, 2016

HyukjinKwon changed the title ~~[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types~~ [SPARK-17665][SPARKR][WIP] Support options/mode all for read/write APIs and options in other types Sep 26, 2016

felixcheung reviewed Sep 26, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Sep 26, 2016

[SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR #15231

Closed

HyukjinKwon changed the title ~~[SPARK-17665][SPARKR][WIP] Support options/mode all for read/write APIs and options in other types~~ [SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types Sep 27, 2016

felixcheung reviewed Sep 28, 2016

View reviewed changes

felixcheung reviewed Oct 3, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-17665 branch from 5f9c3be to 4126d04 Compare October 4, 2016 01:18

HyukjinKwon added 12 commits October 6, 2016 20:35

Support other types for options

9c1439e

Add mode and options for read/write.parquet read/write.json read/writ…

837e321

…e.orc read/write.text

Fix indentation

08b8795

Fix utils and add the tests for utils

c8baba5

Fix orc tests

1ac3c7b

Do not reuse env objects

28df54b

Address comments

20cda71

Lint fix

ba772b3

Consistent location

8e10415

Fix doc

63b55f2

Address comments ("null" to "NULL" and change arguments in generic.R)

f268305

Remove duplicated path checking

eeb7db5

HyukjinKwon force-pushed the SPARK-17665 branch from 4126d04 to eeb7db5 Compare October 6, 2016 11:48

felixcheung reviewed Oct 7, 2016

View reviewed changes

asfgit closed this in 9d8ae85 Oct 7, 2016

HyukjinKwon deleted the SPARK-17665 branch October 16, 2016 08:29

		write <- callJMethod(x@sdf, "write")
		write <- setWriteOptions(write, mode = mode, ...)

[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types #15239

[SPARK-17665][SPARKR] Support options/mode all for read/write APIs and options in other types #15239

Conversation

HyukjinKwon commented Sep 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Sep 26, 2016

SparkQA commented Sep 26, 2016

SparkQA commented Sep 26, 2016

HyukjinKwon commented Sep 26, 2016

SparkQA commented Sep 26, 2016

HyukjinKwon commented Sep 26, 2016

Choose a reason for hiding this comment

HyukjinKwon Sep 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Sep 26, 2016

felixcheung commented Sep 26, 2016

HyukjinKwon commented Sep 27, 2016 • edited Loading

SparkQA commented Sep 27, 2016

SparkQA commented Sep 27, 2016

HyukjinKwon commented Sep 27, 2016 • edited Loading

SparkQA commented Sep 27, 2016

Choose a reason for hiding this comment

HyukjinKwon Sep 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Sep 28, 2016 • edited Loading

Choose a reason for hiding this comment

felixcheung Sep 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Sep 30, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Sep 28, 2016

HyukjinKwon commented Sep 28, 2016

felixcheung commented Sep 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Oct 3, 2016

HyukjinKwon commented Oct 4, 2016 • edited Loading

SparkQA commented Oct 4, 2016

felixcheung commented Oct 5, 2016

HyukjinKwon commented Oct 5, 2016

SparkQA commented Oct 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Sep 27, 2016 •

edited

Loading

HyukjinKwon commented Sep 27, 2016 •

edited

Loading

HyukjinKwon commented Sep 27, 2016 •

edited

Loading

HyukjinKwon Sep 28, 2016 •

edited

Loading

HyukjinKwon Sep 28, 2016 •

edited

Loading

felixcheung Sep 29, 2016 •

edited

Loading

HyukjinKwon Sep 30, 2016 •

edited

Loading

HyukjinKwon commented Oct 4, 2016 •

edited

Loading

HyukjinKwon Oct 7, 2016 •

edited

Loading