Reduce the likelihood of OOM aborts #285
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By default rust will
abort()
the process when the allocator returnsNULL
. Since many systems can't reliably determine when an allocation will cause the process to run out of memory, and just rely on the OOM killer cleaning up afterwards, this is acceptable for many workloads. However,abort()
-ing a Postgres will restart the database, and since we often run Postgres on systems which can reliably return NULL on out-of-memory, we would like to take advantage of this to cleanly shut down a single transaction when we fail to allocate. Long-term the solution for this likely involves theoom=panic
flag (issue: rust-lang/rust#43596), but at the time of writing the flag is not yet stable.This PR implements a partial solution for turning out-of-memory into transaction-rollback instead of process-abort using a custom global allocator. It is a thin shim over the System allocator that
panic!()
s when the System allocator returnsNULL
. In the event that still have enough remaining memory to serve the panic, this will unwind the stack all the way to transaction-rollback. In the event we don't even have enough memory to handle unwinding this will merely abort the process with a panic-in-panic instead of a memory-allocation-failure. Under the assumption that we're more likely to fail due to a few large allocations rather than a very large number of small allocations, it seems likely that we will have some memory remaining for unwinding, and that this will reduce the likelihood of aborts.