-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30267][SQL] Avro arrays can be of any List #26907
Conversation
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
67dbb80
to
fc4693d
Compare
ok to test |
So this case is when it becomes |
cc @gengliangwang FYI. |
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Show resolved
Hide resolved
Test build #115453 has finished for PR 26907 at commit
|
fc4693d
to
19a821a
Compare
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Show resolved
Hide resolved
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Outdated
Show resolved
Hide resolved
Test build #115458 has finished for PR 26907 at commit
|
@@ -127,6 +127,25 @@ class AvroCatalystDataConversionSuite extends SparkFunSuite | |||
} | |||
} | |||
|
|||
test(s"array of nested schema with seed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the test case with GenericData.Array
and I can't reproduce the error you mentioned.
Do you know when the array type is not GenericData.Array
? You can also create a test case without random schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang you are right. This is an extra case which covers an extra scenario, but is was not triggering this issue. (I thought it did but it did not.)
I added an extra test to trigger the issue. I am wondering what you think of it.
Hope it does not feel too artificial. It tries to show that a GenericData can contain different list implementation types.
We hit this issue where we map an RDD[T]
to a DataFrame
using the AvroDeserializer
. Where T
is a class generate by the avro code generator. So it extends SpecifiData
which extends GenericData
. So all the code of the AvroDeserializer
works, except of this cast, which casts to a type which is too high in the type hierarchy.
Hope it is more clear now.
Thanks for your support.
19a821a
to
0811ca3
Compare
Test build #115511 has finished for PR 26907 at commit
|
0811ca3
to
190b5e8
Compare
Test build #115513 has finished for PR 26907 at commit
|
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Outdated
Show resolved
Hide resolved
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
Outdated
Show resolved
Hide resolved
190b5e8
to
48680a9
Compare
Test build #115555 has finished for PR 26907 at commit
|
@gengliangwang for me this patch is ready. Let me know if I can still do something. |
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment
48680a9
to
9bc6de7
Compare
Test build #116029 has finished for PR 26907 at commit
|
val deserializer = new AvroDeserializer(avroSchema, dataType) | ||
|
||
def checkDeserialization(data: GenericData.Record): Unit = { | ||
checkResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method checkResult
returns a Boolean
and won't fail if the result doesn't match the expected answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
var i = 0 | ||
while (i < len) { | ||
val element = array.get(i) | ||
for ((element, i) <- array.asScala.zipWithIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change this? zipWithIndex
is discouraged especially in a performance sensitive code path (https://github.com/databricks/scala-style-guide#perf-whileloops)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the old counter.
The Deserializer assumed that avro arrays are always of type GenericData$Array which is not the case. Assuming they are from java.util.List is safer and fixes a ClassCastException.
9bc6de7
to
e52c1ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code so it validates the checkResult and removed the zipWithIndex.
Test build #116086 has finished for PR 26907 at commit
|
Thanks, merging to master |
### What changes were proposed in this pull request? This is a follow-up of #26907 It changes the for loop `for (element <- array.asScala)` to while loop ### Why are the changes needed? As per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, we should use while loop for the performance-sensitive code. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #27127 from gengliangwang/SPARK-30267-FollowUp. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
I'm encountering this error on the 2.4 branch. Can we have this merged to that branch as well? |
The Deserializer assumed that avro arrays are always of type
GenericData$Array
which is not the case.Assuming they are from java.util.List is safer and fixes a ClassCastException in some avro code.
What changes were proposed in this pull request?
Java.util.List has all the necessary methods and is the base class of GenericData$Array.
Why are the changes needed?
To prevent the following exception in more complex avro objects:
Does this PR introduce any user-facing change?
No
How was this patch tested?
The current tests already test this behavior. In essesence this patch just changes a type case to a more basic type. So I expect no functional impact.