Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the likelihood of OOM aborts #285

Merged
merged 1 commit into from
Oct 25, 2021
Merged

Reduce the likelihood of OOM aborts #285

merged 1 commit into from
Oct 25, 2021

Conversation

JLockerman
Copy link
Contributor

@JLockerman JLockerman commented Oct 21, 2021

By default rust will abort() the process when the allocator returns NULL. Since many systems can't reliably determine when an allocation will cause the process to run out of memory, and just rely on the OOM killer cleaning up afterwards, this is acceptable for many workloads. However, abort()-ing a Postgres will restart the database, and since we often run Postgres on systems which can reliably return NULL on out-of-memory, we would like to take advantage of this to cleanly shut down a single transaction when we fail to allocate. Long-term the solution for this likely involves the oom=panic flag (issue: rust-lang/rust#43596), but at the time of writing the flag is not yet stable.

This PR implements a partial solution for turning out-of-memory into transaction-rollback instead of process-abort using a custom global allocator. It is a thin shim over the System allocator that panic!()s when the System allocator returns NULL. In the event that still have enough remaining memory to serve the panic, this will unwind the stack all the way to transaction-rollback. In the event we don't even have enough memory to handle unwinding this will merely abort the process with a panic-in-panic instead of a memory-allocation-failure. Under the assumption that we're more likely to fail due to a few large allocations rather than a very large number of small allocations, it seems likely that we will have some memory remaining for unwinding, and that this will reduce the likelihood of aborts.

By default rust will `abort()` the process when the allocator returns
NULL. Since many systems can't reliably determine when an allocation
will cause the process to run out of memory, and just rely on the OOM
killer cleaning up afterwards, this is acceptable for many workloads.
However, `abort()`-ing a Postgres will restart the database, and since
we often run Postgres on systems which _can_ reliably return NULL on
out-of-memory, we would like to take advantage of this to cleanly shut
down a single transaction when we fail to allocate. Long-term the
solution for this likely involves the `oom=panic` flag[1], but at the
time of writing the flag is not yet stable.

This allocator implements a partial solution for turning out-of-memory
into transaction-rollback instead of process-abort. It is a thin shim
over the System allocator that `panic!()`s when the System allocator
returns `NULL`. In the event that still have enough remaining memory to
serve the panic, this will unwind the stack all the way to
transaction-rollback. In the event we don't even have enough memory to
handle unwinding this will merely abort the process with a
panic-in-panic instead of a memory-allocation-failure. Under the
assumption that we're more likely to fail due to a few large allocations
rather than a very large number of small allocations, it seems likely
that we will have some memory remaining for unwinding, and that this
will reduce the likelihood of aborts.

[1] `oom=panic` issue: rust-lang/rust#43596
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cevian
Copy link
Collaborator

cevian commented Oct 21, 2021

Can you add a test for this?

@JLockerman
Copy link
Contributor Author

Can you add a test for this?

Hmm, I think we'd need to instrument malloc for that. Once @sgichohi oom-guard is ready we can probably build an image with that and use some of their instrumentation to test it...

@JLockerman
Copy link
Contributor Author

bors r+

@bors
Copy link
Contributor

bors bot commented Oct 25, 2021

Build succeeded:

@bors bors bot merged commit 0530760 into main Oct 25, 2021
@bors bors bot deleted the jl/oom-reduction branch October 25, 2021 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants