Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

Closed
wants to merge 4 commits into from
Closed

[SPARK-22003][SQL] support array column in vectorized reader with UDF #19230

wants to merge 4 commits into from

Conversation

liufengdb
Copy link

What changes were proposed in this pull request?

The UDF needs to deserialize the UnsafeRow. When the column type is Array, the get method from the ColumnVector, which is used by the vectorized reader, is called, but this method is not implemented.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member

viirya commented Sep 14, 2017

Add a test for it?

} else if (dt instanceof StringType) {
for (int i = 0; i < length; i++) {
if (!data.isNullAt(offset + i)) {
list[i] = getUTF8String(i).toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious. Why we get String before? Seems we should get UTF8String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug.

for (int i = 0; i < length; i++) {
if (!data.isNullAt(offset + i)) {
list[i] = data.getDouble(offset + i);
list[i] = getAtMethod.call(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call get(i + offset, dt)? The getAtMethod seems not very useful, as we still need to go through the if-else branches in get everytime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be get(i, dt)? I updated it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea should be get(i, dt).

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81759 has finished for PR 19230 at commit adbaeab.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

since ColumnVector is only used by vectorized parquet reader, and it currently doesn't support nested types, so I can't think of an end-to-end regression test. However we can still have a unit test for ColumnVector.

@viirya
Copy link
Member

viirya commented Sep 15, 2017

Yea we should add an unit test for it.

@liufengdb
Copy link
Author

@viirya @cloud-fan unit test updated.

@@ -16,6 +16,7 @@
*/
package org.apache.spark.sql.execution.vectorized;

import org.apache.spark.api.java.function.Function;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liufengdb I think we don't need to import this now?

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81835 has finished for PR 19230 at commit 19502f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Populate it with arrays [0], [1, 2], [], [3, 4, 5]
testVector.putArray(0, 0, 1)
testVector.putArray(1, 1, 2)
testVector.putArray(2, 2, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't affect the result. But looks like the third array should be testVector.putArray(2, 3, 0)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@viirya
Copy link
Member

viirya commented Sep 16, 2017

@liufengdb The PR description looks like an end-to-end failure. I'm curious are you facing the failure in an end-to-end case?

@@ -158,7 +158,7 @@ private static void appendValue(WritableColumnVector dst, DataType t, Object o)
dst.getChildColumn(0).appendInt(c.months);
dst.getChildColumn(1).appendLong(c.microseconds);
} else if (t instanceof DateType) {
dst.appendInt(DateTimeUtils.fromJavaDate((Date)o));
dst.appendInt((int) DateTimeUtils.fromJavaDate((Date)o));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary?

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@viirya
Copy link
Member

viirya commented Sep 16, 2017

LGTM too.

@SparkQA
Copy link

SparkQA commented Sep 17, 2017

Test build #81850 has finished for PR 19230 at commit 5cbf978.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -16,6 +16,7 @@
*/
package org.apache.spark.sql.execution.vectorized;

import org.apache.spark.api.java.function.Function;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert it back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, reverted it.

assert(array.get(1, schema).asInstanceOf[ColumnarBatch.Row].get(0, IntegerType) === 456)
assert(array.get(1, schema).asInstanceOf[ColumnarBatch.Row].get(1, DoubleType) === 5.67)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add a test for map, too?

Copy link
Author

@liufengdb liufengdb Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
Does your change expect that this call finally throws an exception for Map element in array?

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81861 has finished for PR 19230 at commit 5ea4e89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

1 similar comment
@cloud-fan
Copy link
Contributor

retest this please

@kiszk
Copy link
Member

kiszk commented Sep 18, 2017

Can we add test code for null row in a column for each type?

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81872 has finished for PR 19230 at commit 5ea4e89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81877 has finished for PR 19230 at commit 5ea4e89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in 3b049ab Sep 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants