[SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string #15354

HyukjinKwon · 2016-10-05T02:51:08Z

What changes were proposed in this pull request?

This PR proposes to add to_json function in contrast with from_json in Scala, Java and Python.

It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function.

The usage is as below:

val df = Seq(Tuple1(Tuple1(1))).toDF("a")
df.select(to_json($"a").as("json")).show()

+--------+
|    json|
+--------+
|{"_1":1}|
+--------+

How was this patch tested?

Unit tests in JsonFunctionsSuite and JsonExpressionsSuite.

HyukjinKwon · 2016-10-05T02:51:29Z

cc @marmbrus Could you take a look please?

SparkQA · 2016-10-05T04:49:02Z

Test build #66356 has finished for PR 15354 at commit eec0cd3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-05T07:28:28Z

Test build #66372 has finished for PR 15354 at commit 5f185e3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-05T08:11:47Z

Ah, this was a known issue

SparkQA · 2016-10-05T10:18:00Z

Test build #66383 has finished for PR 15354 at commit 26fc01f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for working on this :) Just did a quick look at the Python parts and some minor style changes that we might want to consider :)

holdenk · 2016-10-06T16:34:38Z

python/pyspark/sql/functions.py

+    """
+
+    sc = SparkContext._active_spark_context
+    jc = sc._jvm.functions.to_json(_to_java_column(col), options)


This is super minor, but there is a pretty consistent pattern for all of the other functions here (including from_json), it might be good to follow that same pattern for consistencies sake since there isn't an obvious reason why that wouldn't work here.

@holdenk Thank you for your comment. Could you please a bit elaborate this comment? I am a bit not sure on what to fix.

actually nvm my original comment, the more I look at this file the less it seems the pattern is overly consistent and this same pattern is done elsewhere within the file.

holdenk · 2016-10-06T16:35:58Z

python/pyspark/sql/functions.py

+    Converts a column containing a [[StructType]] into a JSON string. Returns `null`,
+    in the case of an unsupported type.
+
+    :param col: struct column


Would :param col: name of column containing the struct maybe be more consistent with the other pydocs for the functions? (I only skimmed a few though so if its the other way around thats cool).

Sure, let me try to double check other comments as well.

SparkQA · 2016-10-07T06:14:30Z

Test build #66484 has finished for PR 15354 at commit 5f9fa29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus

Overall, looks pretty good. What do you think about failing earlier and more obviously when there are unsupported datatypes?

marmbrus · 2016-10-07T18:20:34Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+  }
+
+  test("to_json - invalid type") {
+    val schema = StructType(StructField("a", CalendarIntervalType) :: Nil)


Hmm, I realize this is a little different than from_json, but it seems it would be better to eagerly throw an AnalysisException to say the schema contains an unsupported type. We know that ahead of time, and otherwise its kind of mysterious why all the values come out as null.

Sure, that makes sense. Thanks.

I would like to leave a note. In case of CSV we are verifying the types before actually running tasks but for JSON it is not doing this. So, I made this SparkSQLJsonProcessingException which is technically a RuntimeException. However, if you want me to fix it here (adding a logic to verify the schema ahead) I will definitely do this here together.

Yeah, I think it makes more sense to add a static check for this case. We know all of the types that we are able to handle. For consistency I would also add this to the write.json code path.

@marmbrus Do you mind if I ask it is okay for me to create another JIRA and deal with this problem for JSON/CSV for reading/writing paths in another pr? It seems I should add this logics separatlely from JacksonGenerator instance (as it seems initiated in tasks and it is used in DataSet.toJSON, StructToJson and write.json and therefore, it seems I should add each separate test for each..)

SparkQA · 2016-10-07T20:06:56Z

Test build #66526 has finished for PR 15354 at commit ecdac76.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-07T21:20:56Z

Test build #66523 has finished for PR 15354 at commit 56e513c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-07T22:15:40Z

Test build #66525 has finished for PR 15354 at commit 58b344d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-10-08T02:36:53Z

retest this please

SparkQA · 2016-10-08T04:43:14Z

Test build #66558 has finished for PR 15354 at commit ecdac76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-15T15:47:41Z

Test build #67012 has finished for PR 15354 at commit 38d89a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-10-21T19:56:36Z

hi - where are we on this?

HyukjinKwon · 2016-10-22T14:07:06Z

@felixcheung Thank you for pinging!

@marmbrus Do you please mind if I handle checking the schema ahead for csv/json in read/write in another PR?

SparkQA · 2016-10-22T16:24:33Z

Test build #67387 has finished for PR 15354 at commit bbbfaff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-10-23T02:31:11Z

It would be really nice to fail in analysis rather than execution. What if it only fails after hours of computation? As a user I'd be upset. I'm also concerned they will think it's a spark bug.

HyukjinKwon · 2016-10-23T07:40:50Z

@marmbrus Sure (I didn't mean I am not going to do this..), I just handled the case in this PR for to_json.

BTW, I would like to note that there are the same problems in other JSON related functionalities. For example, I might have to add

override def checkInputDataTypes(): TypeCheckResult = {
  ...
 JacksonUtils.verifySchema(child.dataType.asInstanceOf[StructType])
  ...
}

for from_json as well. Let me please open a follow up for adding this logic and test for JSON related functionalities.

SparkQA · 2016-10-23T09:09:15Z

Test build #67407 has finished for PR 15354 at commit d74c96d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-23T09:52:57Z

Test build #67408 has finished for PR 15354 at commit 8603462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-23T11:28:12Z

Test build #67409 has finished for PR 15354 at commit 518f48d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus

Only minor comments. Thanks for working on this!

marmbrus · 2016-10-27T00:40:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala

+
+      case _ =>
+        throw new UnsupportedOperationException(
+          s"JSON conversion does not support to process ${dataType.simpleString} type.")


does not support to process is a little hard to parse. Maybe Unable to convert column ${name} of type ${dataType.simpleString} to JSON.

marmbrus · 2016-10-27T00:42:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

-          s"with the type of $dataType to JSON.")
+        throw new SparkSQLJsonProcessingException(
+          s"Failed to convert value $v (class of ${v.getClass}}) " +
+            s"with the type of $dataType to JSON.")


I would avoid this change since its throwing a private exception type to the user now.

HyukjinKwon · 2016-10-27T04:41:13Z

@marmbrus Thank you so much.

SparkQA · 2016-10-27T06:22:36Z

Test build #67622 has finished for PR 15354 at commit 791f802.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s\"Failed to convert value $v (class of $

HyukjinKwon · 2016-10-27T06:39:07Z

retest this please

SparkQA · 2016-10-27T08:15:04Z

Test build #67630 has finished for PR 15354 at commit 791f802.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s\"Failed to convert value $v (class of $

HyukjinKwon · 2016-10-27T08:24:15Z

Let me take a look into this deeper if some same tests constantly fail.

HyukjinKwon · 2016-10-27T08:24:19Z

retest this please

SparkQA · 2016-10-27T10:02:53Z

Test build #67633 has finished for PR 15354 at commit 791f802.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s\"Failed to convert value $v (class of $

HyukjinKwon · 2016-10-27T14:53:43Z

retest this please

HyukjinKwon · 2016-10-27T14:56:44Z

It seems the test is not related with this PR.

SparkQA · 2016-10-27T17:06:37Z

Test build #67647 has finished for PR 15354 at commit 791f802.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s\"Failed to convert value $v (class of $

felixcheung · 2016-10-27T22:21:40Z

looks good, should we clarify output JSON is in JSON Lines format? http://jsonlines.org/

HyukjinKwon · 2016-10-28T04:38:03Z

Oh nvm. I left a useless comment and removed it. It seems it'd be better to mention.

SparkQA · 2016-10-28T07:02:10Z

Test build #67694 has finished for PR 15354 at commit 4d69ab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus

I think we are getting close. Just one more large comment and some doc changes.

marmbrus · 2016-10-28T18:16:56Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+
+  /**
+   * Converts a column containing a [[StructType]] into a JSON string
+   * ([[http://jsonlines.org/ JSON Lines text format or newline-delimited JSON]]) with the


I don't think that this case really follows "JSON lines". It is a string inside of a larger dataframe. There are no newlines involved.

marmbrus · 2016-10-28T18:17:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    if (StructType.acceptsType(child.dataType)) {
+      try {


Sorry, one final comment as I'm looking at this more closely. I don't think we should use exceptions for control flow in the common case. Specifically, verifySchema should work the same way as acceptsType above and return a boolean.

Ah, yes, makes sense but if verifySchema returns a boolean, we could not find which field and type are problematic.

Maybe, I can make do one of the below:

make this logic in verifySchema into checkInputDataTypes

make verifySchema return the unsupported fields. and types.

Just fix the exception message without the information of unsupported fields and types.

If you pick one, I will follow (or please let me know if there is a better way)!

Oh, I see. It is for the better message. I guess its probably not worth the time to refactor in that case.

SparkQA · 2016-10-29T03:41:45Z

Test build #67736 has finished for PR 15354 at commit b76a08e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-01T15:26:44Z

Test build #67893 has finished for PR 15354 at commit 971d1c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-01T15:40:39Z

retest this please

SparkQA · 2016-11-01T18:02:11Z

Test build #67902 has finished for PR 15354 at commit 971d1c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-11-01T19:46:08Z

Thanks, I'm going to merge this to master.

HyukjinKwon · 2016-11-01T22:22:15Z

Thank you for merging this!

…column to JSON string ## What changes were proposed in this pull request? This PR proposes to add `to_json` function in contrast with `from_json` in Scala, Java and Python. It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function. The usage is as below: ``` scala val df = Seq(Tuple1(Tuple1(1))).toDF("a") df.select(to_json($"a").as("json")).show() ``` ``` bash +--------+ | json| +--------+ |{"_1":1}| +--------+ ``` ## How was this patch tested? Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#15354 from HyukjinKwon/SPARK-17764.

HyukjinKwon changed the title ~~[SPARK-17764][SQL] Add to_json supporting to convert nested struct column to JSON string~~ [SPARK-17764][SQL][WIP] Add to_json supporting to convert nested struct column to JSON string Oct 5, 2016

HyukjinKwon changed the title ~~[SPARK-17764][SQL][WIP] Add to_json supporting to convert nested struct column to JSON string~~ [SPARK-17764][SQL] Add to_json supporting to convert nested struct column to JSON string Oct 5, 2016

holdenk reviewed Oct 6, 2016

View reviewed changes

marmbrus suggested changes Oct 7, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-17764 branch from ecdac76 to 38d89a6 Compare October 15, 2016 13:37

HyukjinKwon force-pushed the SPARK-17764 branch from 38d89a6 to bbbfaff Compare October 22, 2016 14:06

marmbrus suggested changes Oct 27, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-17764 branch from 791f802 to 4d69ab2 Compare October 28, 2016 04:50

marmbrus suggested changes Oct 28, 2016

View reviewed changes

HyukjinKwon added 7 commits November 1, 2016 22:59

Add to_json supporting to convert nested struct column to JSON string

3dbfc69

Throw analysis exception in to_json expression/function

84a93b6

Remove unused imports

48e097e

Consistent exception with CSV

abed158

Address comments

b1632ba

Doducmnet in-line json and also fix links in Python links

b4cf8a1

Revert the documenation change for in-line

971d1c0

HyukjinKwon force-pushed the SPARK-17764 branch from b76a08e to 971d1c0 Compare November 1, 2016 14:00

asfgit closed this in 01dd008 Nov 1, 2016

HyukjinKwon deleted the SPARK-17764 branch January 2, 2018 03:43

[SPARK-17764][SQL] Add to_json supporting to convert nested struct column to JSON string #15354

[SPARK-17764][SQL] Add to_json supporting to convert nested struct column to JSON string #15354

Conversation

HyukjinKwon commented Oct 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Oct 5, 2016

SparkQA commented Oct 5, 2016

SparkQA commented Oct 5, 2016

HyukjinKwon commented Oct 5, 2016

SparkQA commented Oct 5, 2016

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2016

marmbrus left a comment

Choose a reason for hiding this comment

marmbrus Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Oct 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Oct 14, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

SparkQA commented Oct 7, 2016

HyukjinKwon commented Oct 8, 2016

SparkQA commented Oct 8, 2016

SparkQA commented Oct 15, 2016

felixcheung commented Oct 21, 2016

HyukjinKwon commented Oct 22, 2016

SparkQA commented Oct 22, 2016

marmbrus commented Oct 23, 2016

HyukjinKwon commented Oct 23, 2016

SparkQA commented Oct 23, 2016

SparkQA commented Oct 23, 2016

SparkQA commented Oct 23, 2016

marmbrus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 27, 2016

SparkQA commented Oct 27, 2016

HyukjinKwon commented Oct 27, 2016

SparkQA commented Oct 27, 2016

HyukjinKwon commented Oct 27, 2016

HyukjinKwon commented Oct 27, 2016

SparkQA commented Oct 27, 2016

HyukjinKwon commented Oct 27, 2016

HyukjinKwon commented Oct 27, 2016

SparkQA commented Oct 27, 2016

felixcheung commented Oct 27, 2016

HyukjinKwon commented Oct 28, 2016 • edited Loading

SparkQA commented Oct 28, 2016

marmbrus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Oct 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 29, 2016

SparkQA commented Nov 1, 2016

HyukjinKwon commented Nov 1, 2016

SparkQA commented Nov 1, 2016

marmbrus commented Nov 1, 2016

HyukjinKwon commented Nov 1, 2016

[SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string #15354

[SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string #15354

marmbrus Oct 7, 2016 •

edited

Loading

HyukjinKwon Oct 8, 2016 •

edited

Loading

HyukjinKwon Oct 14, 2016 •

edited

Loading

HyukjinKwon commented Oct 28, 2016 •

edited

Loading

HyukjinKwon Oct 29, 2016 •

edited

Loading