-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3948][Shuffle]Fix stream corruption bug in sort-based shuffle #2824
Conversation
// Position will not be increased to the expected length after calling transferTo in | ||
// kernel version 2.6.32, this issue can be seen in | ||
// scalastyle:off | ||
// https://bugs.openjdk.java.net/browse/JDK-7052359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
en...I guess this line will trigger scalastyle checker error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I found some code in KafkaInputDStream also use this scalastyle:off
to turn off scalacheck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I didn't notice that line, I think that shall be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove url part after the '?' here.
Test FAILed. |
QA tests have started for PR 2824 at commit
|
QA tests have finished for PR 2824 at commit
|
Test PASSed. |
The patch works fine for me, on my 2.6.32 cluster. Thanks! |
@adrian-wang, Just so I understand - were you seeing the issue before applying this patch, and then the patch made it go away? @jerryshao Could we also have a branch-1.1 version of this that simply logs an error instead of throwing an exception (via |
Maybe this is being overly-conservative, but could we add an undocumented configuration option that allows users to bypass the |
@pwendell yes, I was suffering from running some certain queries over sort-base shuffle, just like the discussion in SPARK-3630. And with this patch the issue is gone. Thanks! -- My cluster is 4-node redhat 6.2, with kernel 2.6.32. |
I think if bug is occurred when running job, even if we do not throw an exception here, we will still meet other exceptions in reduce side, so I use |
Hi @JoshRosen , I just add a configuration that can bypass the NIO way of copying stream. Would you mind taking a look at it? |
QA tests have started for PR 2824 at commit
|
QA tests have finished for PR 2824 at commit
|
Test PASSed. |
val finalPos = outChannel.position() | ||
assert(finalPos == initialPos + size, | ||
s""" | ||
|Current position $finalPos do not equal to expected position ${initialPos + count} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert
checks whether finalPos
is initialPos + size
, but this error message uses initialPos + count
; could this lead to confusion?
I suppose that count >= size
here, so it's probably fine, but it might be confusing if count
was ever greater than size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think normally size
would be equal to count
. I will change to size
to keep consistency.
QA tests have started for PR 2824 at commit
|
Hi @JoshRosen , I just set Currently, only If future uses of The reason I didn't take So what is your opinion? Thanks a lot. |
HI @jerryshao, Changing the default is exactly what I had in mind. This looks good to me! (Going to bed now; I'll merge this tomorrow and backport to |
Thanks a lot :). |
QA tests have finished for PR 2824 at commit
|
Test PASSed. |
I've merged this into
Thank YOU (and @mridulm) for helping to diagnose this really subtle bug! |
Kernel 2.6.32 bug will lead to unexpected behavior of transferTo in copyStream, and this will corrupt the shuffle output file in sort-based shuffle, which will somehow introduce PARSING_ERROR(2), deserialization error or offset out of range. Here fix this by adding append flag, also add some position checking code. Details can be seen in [SPARK-3948](https://issues.apache.org/jira/browse/SPARK-3948). Author: jerryshao <saisai.shao@intel.com> Closes #2824 from jerryshao/SPARK-3948 and squashes the following commits: be0533a [jerryshao] Address the comments a82b184 [jerryshao] add configuration to control the NIO way of copying stream e17ada2 [jerryshao] Fix kernel 2.6.32 bug led unexpected behavior of transferTo (cherry picked from commit c7aeecd) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala
Kernel 2.6.32 bug will lead to unexpected behavior of transferTo in copyStream, and this will corrupt the shuffle output file in sort-based shuffle, which will somehow introduce PARSING_ERROR(2), deserialization error or offset out of range. Here fix this by adding append flag, also add some position checking code. Details can be seen in SPARK-3948.