[SPARK-30633][SQL] Append L to seed when type is LongType #27354

patrickcording · 2020-01-24T10:51:26Z

What changes were proposed in this pull request?

Allow for using longs as seed for xxHash.

Why are the changes needed?

Codegen fails when passing a seed to xxHash that is > 2^31.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests pass. Should more be added?

HyukjinKwon · 2020-01-24T11:48:45Z

ok to test

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

SparkQA · 2020-01-24T15:52:04Z

Test build #117351 has finished for PR 27354 at commit 0a0432f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-24T16:22:02Z

Test build #117353 has finished for PR 27354 at commit 77bfb37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-24T23:12:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala

@@ -282,6 +282,7 @@ abstract class HashExpression[E] extends Expression {
    }

    val hashResultType = CodeGenerator.javaType(dataType)
+    val typedSeed = if (dataType.sameType(LongType)) s"${seed}L" else s"$seed"


Thank you for making a PR, @patrickcording . BTW, this seems to change the hash result, doesn't it?

Would it change the result? it would just let it not fail.

It doesn't change the result. For xxHash, the generated code just becomes long varName = 123L instead of long varName = 123. When the seed is <= 2^31, theres no difference between the two statements. When it is > 2^31, compilation of the generated code would fail without the L, so it is now possible to use any 64 bit seed.

Yeah I think the concern was something like: a 4-byte int seed isn't the same as an 8-byte long seed when it comes to hashing, even if they have the same integer value. But here, this should only affect HashExpression[Long]s like XxHash64, which already would have promoted any int seed value to long in generated code.

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

dongjoon-hyun · 2020-01-25T00:22:50Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala

@@ -684,6 +684,21 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
    assert(murmur3HashPlan(wideRow).getInt(0) == murmursHashEval)
  }

+  test("SPARK-30633: Use Long seeds for xxHash") {


This is a good test candidate. Can we extend this test coverage for the other Murmur3Hash and HiveHash?

@dongjoon-hyun, I'm not sure I understand this request. Do we want to have a similar test for Murmur3 and Hive, but where the seeds are 32 bit?

patrickcording · 2020-01-26T17:40:24Z

@srowen, @dongjoon-hyun, I extended the first test to also run using integer seeds and when mixing integer and long seeds. I also extended testHash to explicitly use a long seed for hashing all sorts of inputs.

SparkQA · 2020-01-26T21:36:38Z

Test build #117429 has finished for PR 27354 at commit abe0be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @patrickcording , @HyukjinKwon , @srowen .
Merged to master/2.4.

### What changes were proposed in this pull request? Allow for using longs as seed for xxHash. ### Why are the changes needed? Codegen fails when passing a seed to xxHash that is > 2^31. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests pass. Should more be added? Closes #27354 from patrickcording/fix_xxhash_seed_bug. Authored-by: Patrick Cording <patrick.cording@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c5c580b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2020-01-27T18:36:56Z

Thank you for your first contribution, @patrickcording .
What is your Apache JIRA id? (It seems that Patrick Cording has two accounts.)

patrickcording · 2020-01-27T19:11:54Z

Thank you for your first contribution, @patrickcording .
What is your Apache JIRA id? (It seems that Patrick Cording has two accounts.)

They are both mine. I forgot that I had an account and signed up again. The most recent one that I used to create the ticket is Cording.

dongjoon-hyun · 2020-01-27T19:22:09Z

Got it. Now, you are added to the Apache Spark contributor group as Cording and SPARK-30633 is assigned to you.

Append L to seed when type is LongType

0a0432f

HyukjinKwon reviewed Jan 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala Show resolved Hide resolved

Add test

77bfb37

dongjoon-hyun added the SQL label Jan 24, 2020

dongjoon-hyun reviewed Jan 24, 2020

View reviewed changes

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jan 25, 2020

View reviewed changes

Extend tests

abe0be5

srowen approved these changes Jan 27, 2020

View reviewed changes

dongjoon-hyun approved these changes Jan 27, 2020

View reviewed changes

dongjoon-hyun closed this in c5c580b Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30633][SQL] Append L to seed when type is LongType #27354

[SPARK-30633][SQL] Append L to seed when type is LongType #27354

patrickcording commented Jan 24, 2020

HyukjinKwon commented Jan 24, 2020

SparkQA commented Jan 24, 2020

SparkQA commented Jan 24, 2020

dongjoon-hyun Jan 24, 2020

srowen Jan 25, 2020

patrickcording Jan 26, 2020

srowen Jan 26, 2020

dongjoon-hyun Jan 25, 2020

patrickcording Jan 26, 2020

patrickcording commented Jan 26, 2020

SparkQA commented Jan 26, 2020

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Jan 27, 2020

patrickcording commented Jan 27, 2020

dongjoon-hyun commented Jan 27, 2020

[SPARK-30633][SQL] Append L to seed when type is LongType #27354

[SPARK-30633][SQL] Append L to seed when type is LongType #27354

Conversation

patrickcording commented Jan 24, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jan 24, 2020

SparkQA commented Jan 24, 2020

SparkQA commented Jan 24, 2020

dongjoon-hyun Jan 24, 2020

Choose a reason for hiding this comment

srowen Jan 25, 2020

Choose a reason for hiding this comment

patrickcording Jan 26, 2020

Choose a reason for hiding this comment

srowen Jan 26, 2020

Choose a reason for hiding this comment

dongjoon-hyun Jan 25, 2020

Choose a reason for hiding this comment

patrickcording Jan 26, 2020

Choose a reason for hiding this comment

patrickcording commented Jan 26, 2020

SparkQA commented Jan 26, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 27, 2020

patrickcording commented Jan 27, 2020

dongjoon-hyun commented Jan 27, 2020

dongjoon-hyun left a comment •

edited

Loading