-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30633][SQL] Append L to seed when type is LongType #27354
Conversation
ok to test |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
Show resolved
Hide resolved
Test build #117351 has finished for PR 27354 at commit
|
Test build #117353 has finished for PR 27354 at commit
|
@@ -282,6 +282,7 @@ abstract class HashExpression[E] extends Expression { | |||
} | |||
|
|||
val hashResultType = CodeGenerator.javaType(dataType) | |||
val typedSeed = if (dataType.sameType(LongType)) s"${seed}L" else s"$seed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @patrickcording . BTW, this seems to change the hash result, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it change the result? it would just let it not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't change the result. For xxHash, the generated code just becomes long varName = 123L
instead of long varName = 123
. When the seed is <= 2^31, theres no difference between the two statements. When it is > 2^31, compilation of the generated code would fail without the L
, so it is now possible to use any 64 bit seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the concern was something like: a 4-byte int seed isn't the same as an 8-byte long seed when it comes to hashing, even if they have the same integer value. But here, this should only affect HashExpression[Long]
s like XxHash64
, which already would have promoted any int seed value to long in generated code.
...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala
Show resolved
Hide resolved
@@ -684,6 +684,21 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
assert(murmur3HashPlan(wideRow).getInt(0) == murmursHashEval) | |||
} | |||
|
|||
test("SPARK-30633: Use Long seeds for xxHash") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good test candidate. Can we extend this test coverage for the other Murmur3Hash
and HiveHash
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun, I'm not sure I understand this request. Do we want to have a similar test for Murmur3 and Hive, but where the seeds are 32 bit?
@srowen, @dongjoon-hyun, I extended the first test to also run using integer seeds and when mixing integer and long seeds. I also extended |
Test build #117429 has finished for PR 27354 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @patrickcording , @HyukjinKwon , @srowen .
Merged to master/2.4.
### What changes were proposed in this pull request? Allow for using longs as seed for xxHash. ### Why are the changes needed? Codegen fails when passing a seed to xxHash that is > 2^31. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests pass. Should more be added? Closes #27354 from patrickcording/fix_xxhash_seed_bug. Authored-by: Patrick Cording <patrick.cording@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c5c580b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Thank you for your first contribution, @patrickcording . |
They are both mine. I forgot that I had an account and signed up again. The most recent one that I used to create the ticket is |
Got it. Now, you are added to the Apache Spark contributor group as |
What changes were proposed in this pull request?
Allow for using longs as seed for xxHash.
Why are the changes needed?
Codegen fails when passing a seed to xxHash that is > 2^31.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests pass. Should more be added?