Reduce the likelihood of OOM aborts #285

JLockerman · 2021-10-21T18:22:51Z

By default rust will abort() the process when the allocator returns NULL. Since many systems can't reliably determine when an allocation will cause the process to run out of memory, and just rely on the OOM killer cleaning up afterwards, this is acceptable for many workloads. However, abort()-ing a Postgres will restart the database, and since we often run Postgres on systems which can reliably return NULL on out-of-memory, we would like to take advantage of this to cleanly shut down a single transaction when we fail to allocate. Long-term the solution for this likely involves the oom=panic flag (issue: rust-lang/rust#43596), but at the time of writing the flag is not yet stable.

This PR implements a partial solution for turning out-of-memory into transaction-rollback instead of process-abort using a custom global allocator. It is a thin shim over the System allocator that panic!()s when the System allocator returns NULL. In the event that still have enough remaining memory to serve the panic, this will unwind the stack all the way to transaction-rollback. In the event we don't even have enough memory to handle unwinding this will merely abort the process with a panic-in-panic instead of a memory-allocation-failure. Under the assumption that we're more likely to fail due to a few large allocations rather than a very large number of small allocations, it seems likely that we will have some memory remaining for unwinding, and that this will reduce the likelihood of aborts.

By default rust will `abort()` the process when the allocator returns NULL. Since many systems can't reliably determine when an allocation will cause the process to run out of memory, and just rely on the OOM killer cleaning up afterwards, this is acceptable for many workloads. However, `abort()`-ing a Postgres will restart the database, and since we often run Postgres on systems which _can_ reliably return NULL on out-of-memory, we would like to take advantage of this to cleanly shut down a single transaction when we fail to allocate. Long-term the solution for this likely involves the `oom=panic` flag[1], but at the time of writing the flag is not yet stable. This allocator implements a partial solution for turning out-of-memory into transaction-rollback instead of process-abort. It is a thin shim over the System allocator that `panic!()`s when the System allocator returns `NULL`. In the event that still have enough remaining memory to serve the panic, this will unwind the stack all the way to transaction-rollback. In the event we don't even have enough memory to handle unwinding this will merely abort the process with a panic-in-panic instead of a memory-allocation-failure. Under the assumption that we're more likely to fail due to a few large allocations rather than a very large number of small allocations, it seems likely that we will have some memory remaining for unwinding, and that this will reduce the likelihood of aborts. [1] `oom=panic` issue: rust-lang/rust#43596

cevian

LGTM

cevian · 2021-10-21T20:25:22Z

Can you add a test for this?

JLockerman · 2021-10-22T16:23:50Z

Can you add a test for this?

Hmm, I think we'd need to instrument malloc for that. Once @sgichohi oom-guard is ready we can probably build an image with that and use some of their instrumentation to test it...

JLockerman · 2021-10-25T18:25:06Z

bors r+

bors · 2021-10-25T18:36:18Z

Build succeeded:

JLockerman requested review from cevian and WireBaron October 21, 2021 18:22

cevian approved these changes Oct 21, 2021

View reviewed changes

bors bot merged commit 0530760 into main Oct 25, 2021

bors bot deleted the jl/oom-reduction branch October 25, 2021 18:36

jgallagher mentioned this pull request Nov 1, 2021

Likely undefined behavior in PanickingAllocator #294

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the likelihood of OOM aborts #285

Reduce the likelihood of OOM aborts #285

JLockerman commented Oct 21, 2021 •

edited

Loading

cevian left a comment

cevian commented Oct 21, 2021

JLockerman commented Oct 22, 2021

JLockerman commented Oct 25, 2021

bors bot commented Oct 25, 2021

Reduce the likelihood of OOM aborts #285

Reduce the likelihood of OOM aborts #285

Conversation

JLockerman commented Oct 21, 2021 • edited Loading

cevian left a comment

Choose a reason for hiding this comment

cevian commented Oct 21, 2021

JLockerman commented Oct 22, 2021

JLockerman commented Oct 25, 2021

bors bot commented Oct 25, 2021

JLockerman commented Oct 21, 2021 •

edited

Loading