Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30267][SQL] Avro arrays can be of any List #26907

Closed
wants to merge 1 commit into from

Conversation

steven-aerts
Copy link
Contributor

The Deserializer assumed that avro arrays are always of type GenericData$Array which is not the case.
Assuming they are from java.util.List is safer and fixes a ClassCastException in some avro code.

What changes were proposed in this pull request?

Java.util.List has all the necessary methods and is the base class of GenericData$Array.

Why are the changes needed?

To prevent the following exception in more complex avro objects:

java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.avro.generic.GenericData$Array
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
	at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
	at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)

Does this PR introduce any user-facing change?

No

How was this patch tested?

The current tests already test this behavior. In essesence this patch just changes a type case to a more basic type. So I expect no functional impact.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Added a test for an array containing structs which was not yet covered.

So this case is when it becomes java.util.ArrayList?

@HyukjinKwon
Copy link
Member

cc @gengliangwang FYI.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115453 has finished for PR 26907 at commit fc4693d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115458 has finished for PR 26907 at commit 19a821a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -127,6 +127,25 @@ class AvroCatalystDataConversionSuite extends SparkFunSuite
}
}

test(s"array of nested schema with seed") {
Copy link
Member

@gengliangwang gengliangwang Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the test case with GenericData.Array and I can't reproduce the error you mentioned.
Do you know when the array type is not GenericData.Array ? You can also create a test case without random schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang you are right. This is an extra case which covers an extra scenario, but is was not triggering this issue. (I thought it did but it did not.)

I added an extra test to trigger the issue. I am wondering what you think of it.
Hope it does not feel too artificial. It tries to show that a GenericData can contain different list implementation types.

We hit this issue where we map an RDD[T] to a DataFrame using the AvroDeserializer. Where T is a class generate by the avro code generator. So it extends SpecifiData which extends GenericData. So all the code of the AvroDeserializer works, except of this cast, which casts to a type which is too high in the type hierarchy.

Hope it is more clear now.

Thanks for your support.

@SparkQA
Copy link

SparkQA commented Dec 18, 2019

Test build #115511 has finished for PR 26907 at commit 0811ca3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2019

Test build #115513 has finished for PR 26907 at commit 190b5e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2019

Test build #115555 has finished for PR 26907 at commit 48680a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steven-aerts
Copy link
Contributor Author

@gengliangwang for me this patch is ready. Let me know if I can still do something.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Jan 2, 2020

Test build #116029 has finished for PR 26907 at commit 9bc6de7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val deserializer = new AvroDeserializer(avroSchema, dataType)

def checkDeserialization(data: GenericData.Record): Unit = {
checkResult(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method checkResult returns a Boolean and won't fail if the result doesn't match the expected answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed.

var i = 0
while (i < len) {
val element = array.get(i)
for ((element, i) <- array.asScala.zipWithIndex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this? zipWithIndex is discouraged especially in a performance sensitive code path (https://github.com/databricks/scala-style-guide#perf-whileloops)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the old counter.

The Deserializer assumed that avro arrays are always of type
GenericData$Array which is not the case.
Assuming they are from java.util.List is safer and fixes a
ClassCastException.
Copy link
Contributor Author

@steven-aerts steven-aerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code so it validates the checkResult and removed the zipWithIndex.

@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116086 has finished for PR 26907 at commit e52c1ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

@steven-aerts steven-aerts deleted the spark-30267 branch January 4, 2020 13:41
gengliangwang added a commit that referenced this pull request Jan 8, 2020
### What changes were proposed in this pull request?

This is a follow-up of #26907
It changes the for loop `for (element <- array.asScala)` to while loop

### Why are the changes needed?

As per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we should use while loop for the performance-sensitive code.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #27127 from gengliangwang/SPARK-30267-FollowUp.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
@stackfun
Copy link

I'm encountering this error on the 2.4 branch. Can we have this merged to that branch as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants