Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
During the debugging of some OOM-related stuff, I've noticed that sometimes OOM kills two processes instead of one. The problem can be easily reproduced on a vanilla kernel (allocate is a simple process which calls malloc() and faults each page in a infinite loop): [ 25.721494] allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 25.725658] allocate cpuset=/ mems_allowed=0 [ 25.727033] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ torvalds#181 <cut> [ 25.768293] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 25.768860] [ 121] 0 121 25672 133 50 3 0 0 systemd-journal [ 25.769530] [ 156] 0 156 11157 197 22 3 0 -1000 systemd-udevd [ 25.770206] [ 206] 0 206 13896 99 29 3 0 -1000 auditd [ 25.770822] [ 227] 0 227 11874 124 27 3 0 0 systemd-logind [ 25.771494] [ 229] 81 229 11577 146 28 3 0 -900 dbus-daemon [ 25.772126] [ 231] 997 231 27502 102 25 3 0 0 chronyd [ 25.772731] [ 233] 0 233 61519 5239 85 3 0 0 firewalld [ 25.773345] [ 238] 0 238 123495 529 74 4 0 0 NetworkManager [ 25.773988] [ 265] 0 265 25117 231 52 3 0 -1000 sshd [ 25.774569] [ 271] 0 271 6092 154 17 3 0 0 crond [ 25.775137] [ 277] 0 277 11297 93 26 3 0 0 systemd-hostnam [ 25.775766] [ 284] 0 284 1716 29 9 3 0 0 agetty [ 25.776342] [ 285] 0 285 2030 34 9 4 0 0 agetty [ 25.776919] [ 302] 998 302 133102 2578 58 3 0 0 polkitd [ 25.777505] [ 394] 0 394 21785 3076 45 3 0 0 dhclient [ 25.778092] [ 444] 0 444 36717 312 74 3 0 0 sshd [ 25.778744] [ 446] 0 446 15966 223 36 3 0 0 systemd [ 25.779304] [ 447] 0 447 23459 384 47 3 0 0 (sd-pam) [ 25.779877] [ 451] 0 451 36717 316 72 3 0 0 sshd [ 25.780450] [ 452] 0 452 3611 315 11 3 0 0 bash [ 25.781107] [ 492] 0 492 513092 473645 934 5 0 0 allocate [ 25.781882] Out of memory: Kill process 492 (allocate) score 899 or sacrifice child [ 25.783874] Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB <cut> [ 25.817589] allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0 [ 25.818821] allocate cpuset=/ mems_allowed=0 [ 25.819259] CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ torvalds#181 [ 25.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 25.820549] Call Trace: [ 25.820733] dump_stack+0x63/0x82 [ 25.820961] dump_header+0x97/0x21a [ 25.820961] ? security_capable_noaudit+0x45/0x60 [ 25.820961] oom_kill_process+0x219/0x3e0 [ 25.820961] out_of_memory+0x11d/0x480 [ 25.820961] pagefault_out_of_memory+0x68/0x80 [ 25.820961] mm_fault_error+0x8f/0x190 [ 25.820961] ? handle_mm_fault+0xf3/0x210 [ 25.820961] __do_page_fault+0x4b2/0x4e0 [ 25.820961] trace_do_page_fault+0x37/0xe0 [ 25.820961] do_async_page_fault+0x19/0x70 [ 25.820961] async_page_fault+0x28/0x30 <cut> [ 25.863078] Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child [ 25.863634] Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB After some investigations I've found some issues: 1) Prior to commit 1af8bb4 ("mm, oom: fortify task_will_free_mem()"), if a process with a pending SIGKILL was calling out_of_memory(), it was always immediately selected as a victim. But now, after some changes, it's not always a case. If a process has been reaped at the moment, MMF_SKIP_FLAG is set, task_will_free_mem() will return false, and a new victim selection logic will be started. This actually happens if a userspace pagefault causing an OOM. pagefault_out_of_memory() is called in a context of a faulting process after it has been selected as OOM victim (assuming, it has), and killed. With some probability (there is a race with oom_reaper thread) this process will be passed to the oom reaper again, or an innocent victim will be selected and killed. 2) We clear up the task->oom_reaper_list before setting the MMF_OOM_SKIP flag, so there is a race. 3) We skip the MMF_OOM_SKIP flag check in case of an sysrq-triggered OOM. To address these issues, the following is proposed: 1) If task is already an oom victim, skip out_of_memory() call from the pagefault_out_of_memory(). 2) Set the MMF_OOM_SKIP bit in wake_oom_reaper() before adding a process to the oom_reaper list. If it's already set, do nothing. Do not rely on tsk->oom_reaper_list value. 3) Check the MMF_OOM_SKIP even if OOM is triggered by a sysrq. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: kernel-team@fb.com Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org
- Loading branch information