Improved memory management #161

behlendorf · 2012-08-23T04:27:23Z

This patch stack improves the memory management in the SPL as follows:

Emergency slab objects. These prevent the possibility of a deadlock in vmalloc() due to the kernel not honoring the gfp flags without the need for a kernel patch.
Revert the use of PF_MEMALLOC which was the previous workaround for the vmalloc() deadlocks. Unfortunately, this fix resulted in side effects such as the depletion of critical memory zones.
Added PF_NOFS flag to automate detection of sites where KM_PUSHPAGE must be used instead of KM_SLEEP.
Added self-recursive mutex detection as additional paranoia.

behlendorf · 2012-08-23T04:46:43Z

@ryao Can you please carefully review and test these changes. They are working well for me in a RHEL 6.2 but they need significantly more testing.

This patch is designed to resolve a deadlock which can occur with __vmalloc() based slabs. The issue is that the Linux kernel does not honor the flags passed to __vmalloc(). This makes it unsafe to use in a writeback context. Unfortunately, this is a use case ZFS depends on for correct operation. Fixing this issue in the upstream kernel was pursued and patches are available which resolve the issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 However, these changes were rejected because upstream felt that using __vmalloc() in the context of writeback should never be done. Their solution was for us to rewrite parts of ZFS to accomidate the Linux VM. While that is probably the right long term solution, and it is something we want to pursue, it is not a trivial task and will likely destabilize the existing code. This work has been planned for the 0.7.0 release but in the meanwhile we want to improve the SPL slab implementation to accomidate this expected ZFS usage. This is accomplished by performing the __vmalloc() asynchronously in the context of a work queue. This doesn't prevent the posibility of the worker thread from deadlocking. However, the caller can now safely block on a wait queue for the slab allocation to complete. Normally this will occur in a reasonable amount of time and the caller will be woken up when the new slab is available,. The objects will then get cached in the per-cpu magazines and everything will proceed as usual. However, if the __vmalloc() deadlocks for the reasons described above, or is just very slow, then the callers on the wait queues will timeout out. When this rare situation occurs they will attempt to kmalloc() a single minimally sized object using the GFP_NOIO flags. This allocation will not deadlock because kmalloc() will honor the passed flags and the caller will be able to make forward progress. As long as forward progress can be maintained then even if the worker thread is deadlocked the critical thread will make progress. This will eventually allow the deadlocked worker thread to complete and normal operation will resume. These emergency allocations will likely be slow since they require contiguous pages. However, their use should be rare so the impact is expected to be minimal. If that turns out not to be the case in practice further optimizations are possible. One additional concern is if these emergency objects are long lived. Right now they are simply tracked on a list which must be walked when an object is freed. Is they accumulate on a system and the list grows freeing objects will become more expensive. This could be handled relatively easily by using a hash instead of a list, but that optimization (if needed) is left for a follow up patch. Additionally, these emeregency objects could be repacked in to existing slabs as objects are freed if the kmem_cache_set_move() functionality was implemented. See issue openzfs#26 for full details. This work would also help reduce ZFS's memory fragmentation problems. The /proc/spl/kmem/slab file has had two new columns added at the end. The 'emerg' column reports the current number of these emergency objects in use for the cache, and the following 'max' column shows the historical worst case. These value should give us a good idea of how often these objects are needed. Based on these values under real use cases we can tune the default behavior. Lastly, as a side benefit using a single work queue for the slab allocations should reduce cpu contention on the global virtual address space lock. This should manifest itself as reduced cpu usage for the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This reverts commit 372c257. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This reverts commit 36811b4. Which is no longer required because there is now SPL code in place to safely handle the deadlocks the kernel patch was designed to address. Therefore we can unconditionally use vmalloc() and drop all the PF_MEMALLOC code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This reverts commit b8b6e4c. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This reverts commit 2092cf6. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

PF_NOFS is a per-process debug flag which is set in current->flags to detect when a process is performing an unsafe allocation. All tasks with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because if they enter direct reclaim and initiate I/O they may deadlock. When debugging is disabled, any incorrect usage will be detected and a call stack with a warning will be printed to the console. The flags will then be automatically corrected to allow for safe execution. If debugging is enabled this will be treated as a fatal condition. To avoid any risk of conflicting with the existing PF_ flags. The PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused, will the PF_NOFS bit be valid. Happily, most existing distributions ship a kernel with CONFIG_RT_MUTEX_TESTER disabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Generate an assertion if we're going to deadlock the system by attempting to acquire a mutex the process is already holding. There are currently no known instances of this under normal operation, but it _might_ be possible when using a ZVOL as a swap device. I want to ensure we catch this immediately if it were to occur. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reduce the object size in the slab overcommit regression test from an order 6 to an order 1 allocation. We still overcommit memory by 4x but the smaller object size reduces the odds of an OOM event due to memory fragmentation. This change was made to prevent this test case from triggering and OOM which kills the buildbot test infrastructure. In addition, move the kmem_cache_free() outside the spin lock. Doing this under the spin lock isn't strictly safe, and if there are large number of emergency objects allocated it be very slow. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2012-08-31T03:19:33Z

Merged to master.

This was referenced Aug 23, 2012

Emergency slab objects #155

Closed

Make KM_SLEEP an alias of KM_PUSHPAGE #145

Closed

This was referenced Aug 23, 2012

Use Linux SLAB allocator for SPL SLAB allocations #147

Closed

zfs blocking everything, out of memory, and daily lockups openzfs/zfs#860

Closed

Support swap on zvol openzfs/zfs#342

Closed

behlendorf added 9 commits August 27, 2012 12:00

behlendorf closed this Aug 31, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved memory management #161

Improved memory management #161

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

behlendorf commented Aug 31, 2012

Improved memory management #161

Improved memory management #161

Conversation

behlendorf commented Aug 23, 2012

behlendorf commented Aug 23, 2012

behlendorf commented Aug 31, 2012