khugepaged deadlock #17

liyi-ibm · 2018-12-19T08:24:04Z

see bug: 169701

The rcu stall is only report on one cpu and it looks a deadlock on khugepaged's __split_huge_pmd(). 
[Thu Jul 12 09:20:19 2018]      173-...: (1 GPs behind) idle=04e/140000000000001/0 softirq=31330298/31330300 fqs=704364
[Thu Jul 12 09:20:19 2018]      (detected by 170, t=1446404 jiffies, g=12950989, c=12950988, q=16164202)
[Thu Jul 12 09:20:19 2018] Sending NMI from CPU 170 to CPUs 173:
[Thu Jul 12 09:20:19 2018] NMI backtrace for cpu 173
[Thu Jul 12 09:20:19 2018] CPU: 173 PID: 900 Comm: khugepaged Tainted: G      D         4.14.49-1 #1
[Thu Jul 12 09:20:19 2018] task: c000003fe138a800 task.stack: c000003fe1420000
[Thu Jul 12 09:20:19 2018] NIP:  c000000000aebc98 LR: c000000000331f34 CTR: c0000000002e52c0
[Thu Jul 12 09:20:19 2018] REGS: c000003fe1422f60 TRAP: 0e81   Tainted: G      D          (4.14.49-1)
[Thu Jul 12 09:20:19 2018] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22444944  XER: 00000000
[Thu Jul 12 09:20:19 2018] CFAR: c000000000aebcb0 SOFTE: 1
GPR00: c000000000331f34 c000003fe14231e0 c0000000013d3000 c0002002af580064
GPR04: ffffffffffe00000 00007fff4d000000 0000000000000001 c00a000800bcf800
GPR08: 0000000000000001 000000008000008c 0000000000000000 00000000000001ff
GPR12: 0000000042444944 c000000007d56f00
[Thu Jul 12 09:20:19 2018] NIP [c000000000aebc98] _raw_spin_lock+0x68/0xc0
[Thu Jul 12 09:20:19 2018] LR [c000000000331f34] __split_huge_pmd+0xb4/0x1120
[Thu Jul 12 09:20:19 2018] Call Trace:
[Thu Jul 12 09:20:19 2018] [c000003fe14231e0] [c000003fe14235d0] 0xc000003fe14235d0 (unreliable)
[Thu Jul 12 09:20:19 2018] [c000003fe1423210] [c000000000331f34] __split_huge_pmd+0xb4/0x1120
[Thu Jul 12 09:20:19 2018] [c000003fe14232e0] [c0000000002e5a64] try_to_unmap_one+0x7a4/0x9c0
[Thu Jul 12 09:20:19 2018] [c000003fe14233f0] [c0000000002e3df4] rmap_walk_anon+0x1b4/0x3f0
[Thu Jul 12 09:20:19 2018] [c000003fe1423460] [c0000000002e6f64] try_to_unmap+0xb4/0x1a0
[Thu Jul 12 09:20:19 2018] [c000003fe14234c0] [c000000000335204] split_huge_page_to_list+0x184/0xca0
[Thu Jul 12 09:20:19 2018] [c000003fe14235c0] [c000000000335f60] deferred_split_scan+0x240/0x390
[Thu Jul 12 09:20:19 2018] [c000003fe1423650] [c0000000002976e0] shrink_slab+0x2d0/0x520
[Thu Jul 12 09:20:19 2018] [c000003fe14237a0] [c00000000029d564] shrink_node+0x2c4/0x410
[Thu Jul 12 09:20:19 2018] [c000003fe1423860] [c00000000029db78] do_try_to_free_pages+0x128/0x4b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423900] [c00000000029e02c] try_to_free_pages+0x12c/0x2b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423990] [c0000000002845e4] __alloc_pages_nodemask+0x714/0x1080
[Thu Jul 12 09:20:19 2018] [c000003fe1423b80] [c0000000003382ac] khugepaged_alloc_page+0x8c/0x140
[Thu Jul 12 09:20:19 2018] [c000003fe1423bb0] [c00000000033a7ec] khugepaged+0x9dc/0x2b60
[Thu Jul 12 09:20:19 2018] [c000003fe1423dc0] [c000000000128aa8] kthread+0x168/0x1b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423e30] [c00000000000bdd0] ret_from_kernel_thread+0x5c/0x8c
[Thu Jul 12 09:20:19 2018] Instruction dump:
[Thu Jul 12 09:20:19 2018] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 40de0018 38210030 e8010010
[Thu Jul 12 09:20:19 2018] ebe1fff8 7c0803a6 4e800020 7c210b78 <e92d0000> 89290009 792affe3 4082003c

The system is still reponsive but we cannot access one disk /data3 in the system(maybe due to khugepaged holds some file system mutex). khugepaged occupies 100% CPU.

The text was updated successfully, but these errors were encountered:

liyi-ibm · 2018-12-19T08:29:09Z

"commit 675d995
Author: Aneesh Kumar K.V aneesh.kumar@linux.vnet.ibm.com
Date: Mon Apr 16 16:57:24 2018 +0530

powerpc/book3s64: Enable split pmd ptlock."

possible fix.

commit 360cc79 upstream. The table field in nft_obj_filter is not an array. In order to check tablename, we should check if the pointer is set. Test commands: %nft add table ip filter %nft add counter ip filter ct1 %nft reset counters Splat looks like: [ 306.510504] kasan: CONFIG_KASAN_INLINE enabled [ 306.516184] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 306.524775] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 306.528284] Modules linked in: nft_objref nft_counter nf_tables nfnetlink ip_tables x_tables [ 306.528284] CPU: 0 PID: 1488 Comm: nft Not tainted 4.17.0-rc4+ #17 [ 306.528284] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015 [ 306.528284] RIP: 0010:nf_tables_dump_obj+0x52c/0xa70 [nf_tables] [ 306.528284] RSP: 0018:ffff8800b6cb7520 EFLAGS: 00010246 [ 306.528284] RAX: 0000000000000000 RBX: ffff8800b6c49820 RCX: 0000000000000000 [ 306.528284] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffed0016d96e9a [ 306.528284] RBP: ffff8800b6cb75c0 R08: ffffed00236fce7c R09: ffffed00236fce7b [ 306.528284] R10: ffffffff9f6241e8 R11: ffffed00236fce7c R12: ffff880111365108 [ 306.528284] R13: 0000000000000000 R14: ffff8800b6c49860 R15: ffff8800b6c49860 [ 306.528284] FS: 00007f838b007700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000 [ 306.528284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 306.528284] CR2: 00007ffeafabcf78 CR3: 00000000b6cbe000 CR4: 00000000001006f0 [ 306.528284] Call Trace: [ 306.528284] netlink_dump+0x470/0xa20 [ 306.528284] __netlink_dump_start+0x5ae/0x690 [ 306.528284] ? nf_tables_getobj+0x1b3/0x740 [nf_tables] [ 306.528284] nf_tables_getobj+0x2f5/0x740 [nf_tables] [ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables] [ 306.528284] ? nf_tables_getobj+0x740/0x740 [nf_tables] [ 306.528284] ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables] [ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables] [ 306.528284] nfnetlink_rcv_msg+0x8ff/0x932 [nfnetlink] [ 306.528284] ? nfnetlink_rcv_msg+0x216/0x932 [nfnetlink] [ 306.528284] netlink_rcv_skb+0x1c9/0x2f0 [ 306.528284] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink] [ 306.528284] ? debug_check_no_locks_freed+0x270/0x270 [ 306.528284] ? netlink_ack+0x7a0/0x7a0 [ 306.528284] ? ns_capable_common+0x6e/0x110 [ ... ] Fixes: e46abbc ("netfilter: nf_tables: Allow table names of up to 255 chars") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 67d2f87 upstream. Commit dec2c92 ("Bluetooth: hci_ldisc: Use rwlocking to avoid closing proto races") introduced locks in hci_ldisc that are held while calling the proto functions. These locks are rwlock's, and hence do not allow sleeping while they are held. However, the proto functions that hci_bcm registers use mutexes and hence need to be able to sleep. In more detail: hci_uart_tty_receive() and hci_uart_dequeue() both acquire the rwlock, after which they call proto->recv() and proto->dequeue(), respectively. In the case of hci_bcm these point to bcm_recv() and bcm_dequeue(). The latter both acquire the bcm_device_lock, which is a mutex, so doing so results in a call to might_sleep(). But since we're holding a rwlock in hci_ldisc, that results in the following BUG (this for the dequeue case - a similar one for the receive case is omitted for brevity): BUG: sleeping function called from invalid context at kernel/locking/mutex.c in_atomic(): 1, irqs_disabled(): 0, pid: 7303, name: kworker/7:3 INFO: lockdep is turned off. CPU: 7 PID: 7303 Comm: kworker/7:3 Tainted: G W OE 4.13.2+ #17 Hardware name: Apple Inc. MacBookPro13,3/Mac-A5C67F76ED83108C, BIOS MBP133.8 Workqueue: events hci_uart_write_work [hci_uart] Call Trace: dump_stack+0x8e/0xd6 ___might_sleep+0x164/0x250 __might_sleep+0x4a/0x80 __mutex_lock+0x59/0xa00 ? lock_acquire+0xa3/0x1f0 ? lock_acquire+0xa3/0x1f0 ? hci_uart_write_work+0xd3/0x160 [hci_uart] mutex_lock_nested+0x1b/0x20 ? mutex_lock_nested+0x1b/0x20 bcm_dequeue+0x21/0xc0 [hci_uart] hci_uart_write_work+0xe6/0x160 [hci_uart] process_one_work+0x253/0x6a0 worker_thread+0x4d/0x3b0 kthread+0x133/0x150 We can't replace the mutex in hci_bcm, because there are other calls there that might sleep. Therefore this replaces the rwlock's in hci_ldisc with rw_semaphore's (which allow sleeping). This is a safer approach anyway as it reduces the restrictions on the proto callbacks. Also, because acquiring write-lock is very rare compared to acquiring the read-lock, the percpu variant of rw_semaphore is used. Lastly, because hci_uart_tx_wakeup() may be called from an IRQ context, we can't block (sleep) while trying acquire the read lock there, so we use the trylock variant. Signed-off-by: Ronald Tschalär <ronald@innovation.ch> Reviewed-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

khugepaged deadlock #17

khugepaged deadlock #17

liyi-ibm commented Dec 19, 2018

liyi-ibm commented Dec 19, 2018

khugepaged deadlock #17

khugepaged deadlock #17

Comments

liyi-ibm commented Dec 19, 2018

liyi-ibm commented Dec 19, 2018