Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30633][SQL] Append L to seed when type is LongType #27354

Closed

Conversation

patrickcording
Copy link
Contributor

What changes were proposed in this pull request?

Allow for using longs as seed for xxHash.

Why are the changes needed?

Codegen fails when passing a seed to xxHash that is > 2^31.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests pass. Should more be added?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117351 has finished for PR 27354 at commit 0a0432f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117353 has finished for PR 27354 at commit 77bfb37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -282,6 +282,7 @@ abstract class HashExpression[E] extends Expression {
}

val hashResultType = CodeGenerator.javaType(dataType)
val typedSeed = if (dataType.sameType(LongType)) s"${seed}L" else s"$seed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @patrickcording . BTW, this seems to change the hash result, doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it change the result? it would just let it not fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the result. For xxHash, the generated code just becomes long varName = 123L instead of long varName = 123. When the seed is <= 2^31, theres no difference between the two statements. When it is > 2^31, compilation of the generated code would fail without the L, so it is now possible to use any 64 bit seed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the concern was something like: a 4-byte int seed isn't the same as an 8-byte long seed when it comes to hashing, even if they have the same integer value. But here, this should only affect HashExpression[Long]s like XxHash64, which already would have promoted any int seed value to long in generated code.

@@ -684,6 +684,21 @@ class HashExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
assert(murmur3HashPlan(wideRow).getInt(0) == murmursHashEval)
}

test("SPARK-30633: Use Long seeds for xxHash") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test candidate. Can we extend this test coverage for the other Murmur3Hash and HiveHash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun, I'm not sure I understand this request. Do we want to have a similar test for Murmur3 and Hive, but where the seeds are 32 bit?

@patrickcording
Copy link
Contributor Author

@srowen, @dongjoon-hyun, I extended the first test to also run using integer seeds and when mixing integer and long seeds. I also extended testHash to explicitly use a long seed for hashing all sorts of inputs.

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117429 has finished for PR 27354 at commit abe0be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @patrickcording , @HyukjinKwon , @srowen .
Merged to master/2.4.

dongjoon-hyun pushed a commit that referenced this pull request Jan 27, 2020
### What changes were proposed in this pull request?

Allow for using longs as seed for xxHash.

### Why are the changes needed?

Codegen fails when passing a seed to xxHash that is > 2^31.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests pass. Should more be added?

Closes #27354 from patrickcording/fix_xxhash_seed_bug.

Authored-by: Patrick Cording <patrick.cording@datarobot.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c5c580b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @patrickcording .
What is your Apache JIRA id? (It seems that Patrick Cording has two accounts.)

@patrickcording
Copy link
Contributor Author

Thank you for your first contribution, @patrickcording .
What is your Apache JIRA id? (It seems that Patrick Cording has two accounts.)

They are both mine. I forgot that I had an account and signed up again. The most recent one that I used to create the ticket is Cording.

@dongjoon-hyun
Copy link
Member

Got it. Now, you are added to the Apache Spark contributor group as Cording and SPARK-30633 is assigned to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants