Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1755] fix(spark): Avoid task failure of inconsistent record number #1756

Merged
merged 4 commits into from
May 30, 2024

Conversation

zuston
Copy link
Member

@zuston zuston commented May 30, 2024

What changes were proposed in this pull request?

  1. When the spill ratio is 1.0 , the process of calculating target spill size will be ignored to avoid potential race condition that the usedBytes and inSendBytes are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task.
  2. Adding the bufferManager's buffer remaining check

Why are the changes needed?

Due to the #1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases,
this will make task fail fast rather than silently data loss that should thanks the #1558

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@zuston zuston requested a review from jerqi May 30, 2024 02:27
@zuston
Copy link
Member Author

zuston commented May 30, 2024

cc @rickyma @leslizhang

Copy link

github-actions bot commented May 30, 2024

Test Results

 2 433 files  ±0   2 433 suites  ±0   5h 0m 5s ⏱️ -46s
   934 tests +1     933 ✅ +1   1 💤 ±0  0 ❌ ±0 
10 828 runs  +9  10 814 ✅ +9  14 💤 ±0  0 ❌ ±0 

Results for commit 42e6946. ± Comparison against base commit a3a49f0.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@rickyma rickyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Although we've never encountered this issue in prod.

LGTM. Left a comment.

partitionList.sort(
Comparator.comparingInt(o -> buffers.get(o) == null ? 0 : buffers.get(o).getMemoryUsed())
.reversed());
if (bufferSpillRatio != 1.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this line, we already have if (Double.compare(bufferSpillRatio, 1.0) < 0) {

@zuston
Copy link
Member Author

zuston commented May 30, 2024

Nice catch. Although we've never encountered this issue in prod.

LGTM. Left a comment.

More validation mechanisms should be added

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merged to master & branch 0.9.

@zuston zuston merged commit d182a03 into apache:master May 30, 2024
41 checks passed
@zuston
Copy link
Member Author

zuston commented May 30, 2024

LGTM, merged to master & branch 0.9.

cherry-pick failure to branch 0.9

@zhengchenyu
Copy link
Collaborator

@zuston Can we add a new PR to branch 0.9. It is a critical fix.

zuston added a commit to zuston/incubator-uniffle that referenced this pull request Dec 9, 2024
…umber (apache#1756)

### What changes were proposed in this pull request?

1. When the spill ratio is `1.0` , the process of calculating target spill size will be ignored to avoid potential race condition that the `usedBytes` and `inSendBytes` are not thread safe. This could guarantee that the all data is flushed to the shuffle server at the end of task.
2. Adding the `bufferManager's` buffer remaining check

### Why are the changes needed?

Due to the apache#1670 , the partial data held by the bufferManager will not be flushed to shuffle servers in some corner cases, 
this will make task fail fast rather than silently data loss that should thanks the apache#1558 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants