-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the race between memory release and task reclaim #8086
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@@ -2590,7 +2590,15 @@ uint64_t Task::MemoryReclaimer::reclaim( | |||
if (task->isCancelled()) { | |||
return 0; | |||
} | |||
return memory::MemoryReclaimer::reclaim(pool, targetBytes, maxWaitMs, stats); | |||
// Before reclaiming from its operators, first to check if there is any free | |||
// capacity in the root after stopping this task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed because memory could have been freed while we waited for the task to pause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we didn't check the free memory after that.
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There is race between memory release and task reclaim that cause a unnecessary local OOM in Meta internal shadowing test: T1. memory arbitration triggers from a query T2. memory arbitrator find the memory grow request exceeds its the query memory capacity limit so starts to reclaim from this query T3. one or more operators from this query releases a large chunk of memory which has sufficient space to satisfy the memory arbitration request T4. the memory arbitration wait for the task to complete and then reclaim from each of its operator T5. however all the operators have no memory to reclaim (T3 has freed all the reclaimable memory) T6. memory arbitrator found nothing has been reclaimed from this query and fail the memory arbitration request. However, there is free memory after T3. This PR fixes this issue by shrink the query memory pool after the task reclaim pauses the task. Verify with table writer which is the case found in Meta shadowing test. Also improve the memory arbitration logging a bit.
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xiaoxmeng merged this pull request in fa6de06. |
There is race between memory release and task reclaim that cause a
unnecessary local OOM in Meta internal shadowing test:
T1. memory arbitration triggers from a query
T2. memory arbitrator find the memory grow request exceeds its the query
memory capacity limit so starts to reclaim from this query
T3. one or more operators from this query releases a large chunk of memory
which has sufficient space to satisfy the memory arbitration request
T4. the memory arbitration wait for the task to complete and then reclaim
from each of its operator
T5. however all the operators have no memory to reclaim (T3 has freed all
the reclaimable memory)
T6. memory arbitrator found nothing has been reclaimed from this query and
fail the memory arbitration request.
However, there is free memory after T3. This PR fixes this issue by shrink the
query memory pool after the task reclaim pauses the task. Verify with table
writer which is the case found in Meta shadowing test.
Also improve the memory arbitration logging a bit.