Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the race between memory release and task reclaim #8086

Closed
wants to merge 1 commit into from

Conversation

xiaoxmeng
Copy link
Contributor

@xiaoxmeng xiaoxmeng commented Dec 18, 2023

There is race between memory release and task reclaim that cause a
unnecessary local OOM in Meta internal shadowing test:
T1. memory arbitration triggers from a query
T2. memory arbitrator find the memory grow request exceeds its the query
memory capacity limit so starts to reclaim from this query
T3. one or more operators from this query releases a large chunk of memory
which has sufficient space to satisfy the memory arbitration request
T4. the memory arbitration wait for the task to complete and then reclaim
from each of its operator
T5. however all the operators have no memory to reclaim (T3 has freed all
the reclaimable memory)
T6. memory arbitrator found nothing has been reclaimed from this query and
fail the memory arbitration request.

However, there is free memory after T3. This PR fixes this issue by shrink the
query memory pool after the task reclaim pauses the task. Verify with table
writer which is the case found in Meta shadowing test.
Also improve the memory arbitration logging a bit.

Copy link

netlify bot commented Dec 18, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 65e9595
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6580b7e36da8fd0007620958

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2023
@xiaoxmeng xiaoxmeng marked this pull request as ready for review December 18, 2023 08:02
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@@ -2590,7 +2590,15 @@ uint64_t Task::MemoryReclaimer::reclaim(
if (task->isCancelled()) {
return 0;
}
return memory::MemoryReclaimer::reclaim(pool, targetBytes, maxWaitMs, stats);
// Before reclaiming from its operators, first to check if there is any free
// capacity in the root after stopping this task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because memory could have been freed while we waited for the task to pause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we didn't check the free memory after that.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

There is race between memory release and task reclaim that cause a
unnecessary local OOM in Meta internal shadowing test:
T1. memory arbitration triggers from a query
T2. memory arbitrator find the memory grow request exceeds its the query
memory capacity limit so starts to reclaim from this query
T3. one or more operators from this query releases a large chunk of memory
which has sufficient space to satisfy the memory arbitration request
T4. the memory arbitration wait for the task to complete and then reclaim
from each of its operator
T5. however all the operators have no memory to reclaim (T3 has freed all
the reclaimable memory)
T6. memory arbitrator found nothing has been reclaimed from this query and
fail the memory arbitration request.

However, there is free memory after T3. This PR fixes this issue by shrink the
query memory pool after the task reclaim pauses the task. Verify with table
writer which is the case found in Meta shadowing test.
Also improve the memory arbitration logging a bit.
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xiaoxmeng merged this pull request in fa6de06.

@xiaoxmeng xiaoxmeng deleted the task branch December 19, 2023 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants