Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

khugepaged deadlock #17

Open
liyi-ibm opened this issue Dec 19, 2018 · 1 comment
Open

khugepaged deadlock #17

liyi-ibm opened this issue Dec 19, 2018 · 1 comment

Comments

@liyi-ibm
Copy link
Owner

see bug: 169701

The rcu stall is only report on one cpu and it looks a deadlock on khugepaged's __split_huge_pmd(). 
[Thu Jul 12 09:20:19 2018]      173-...: (1 GPs behind) idle=04e/140000000000001/0 softirq=31330298/31330300 fqs=704364
[Thu Jul 12 09:20:19 2018]      (detected by 170, t=1446404 jiffies, g=12950989, c=12950988, q=16164202)
[Thu Jul 12 09:20:19 2018] Sending NMI from CPU 170 to CPUs 173:
[Thu Jul 12 09:20:19 2018] NMI backtrace for cpu 173
[Thu Jul 12 09:20:19 2018] CPU: 173 PID: 900 Comm: khugepaged Tainted: G      D         4.14.49-1 #1
[Thu Jul 12 09:20:19 2018] task: c000003fe138a800 task.stack: c000003fe1420000
[Thu Jul 12 09:20:19 2018] NIP:  c000000000aebc98 LR: c000000000331f34 CTR: c0000000002e52c0
[Thu Jul 12 09:20:19 2018] REGS: c000003fe1422f60 TRAP: 0e81   Tainted: G      D          (4.14.49-1)
[Thu Jul 12 09:20:19 2018] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22444944  XER: 00000000
[Thu Jul 12 09:20:19 2018] CFAR: c000000000aebcb0 SOFTE: 1
GPR00: c000000000331f34 c000003fe14231e0 c0000000013d3000 c0002002af580064
GPR04: ffffffffffe00000 00007fff4d000000 0000000000000001 c00a000800bcf800
GPR08: 0000000000000001 000000008000008c 0000000000000000 00000000000001ff
GPR12: 0000000042444944 c000000007d56f00
[Thu Jul 12 09:20:19 2018] NIP [c000000000aebc98] _raw_spin_lock+0x68/0xc0
[Thu Jul 12 09:20:19 2018] LR [c000000000331f34] __split_huge_pmd+0xb4/0x1120
[Thu Jul 12 09:20:19 2018] Call Trace:
[Thu Jul 12 09:20:19 2018] [c000003fe14231e0] [c000003fe14235d0] 0xc000003fe14235d0 (unreliable)
[Thu Jul 12 09:20:19 2018] [c000003fe1423210] [c000000000331f34] __split_huge_pmd+0xb4/0x1120
[Thu Jul 12 09:20:19 2018] [c000003fe14232e0] [c0000000002e5a64] try_to_unmap_one+0x7a4/0x9c0
[Thu Jul 12 09:20:19 2018] [c000003fe14233f0] [c0000000002e3df4] rmap_walk_anon+0x1b4/0x3f0
[Thu Jul 12 09:20:19 2018] [c000003fe1423460] [c0000000002e6f64] try_to_unmap+0xb4/0x1a0
[Thu Jul 12 09:20:19 2018] [c000003fe14234c0] [c000000000335204] split_huge_page_to_list+0x184/0xca0
[Thu Jul 12 09:20:19 2018] [c000003fe14235c0] [c000000000335f60] deferred_split_scan+0x240/0x390
[Thu Jul 12 09:20:19 2018] [c000003fe1423650] [c0000000002976e0] shrink_slab+0x2d0/0x520
[Thu Jul 12 09:20:19 2018] [c000003fe14237a0] [c00000000029d564] shrink_node+0x2c4/0x410
[Thu Jul 12 09:20:19 2018] [c000003fe1423860] [c00000000029db78] do_try_to_free_pages+0x128/0x4b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423900] [c00000000029e02c] try_to_free_pages+0x12c/0x2b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423990] [c0000000002845e4] __alloc_pages_nodemask+0x714/0x1080
[Thu Jul 12 09:20:19 2018] [c000003fe1423b80] [c0000000003382ac] khugepaged_alloc_page+0x8c/0x140
[Thu Jul 12 09:20:19 2018] [c000003fe1423bb0] [c00000000033a7ec] khugepaged+0x9dc/0x2b60
[Thu Jul 12 09:20:19 2018] [c000003fe1423dc0] [c000000000128aa8] kthread+0x168/0x1b0
[Thu Jul 12 09:20:19 2018] [c000003fe1423e30] [c00000000000bdd0] ret_from_kernel_thread+0x5c/0x8c
[Thu Jul 12 09:20:19 2018] Instruction dump:
[Thu Jul 12 09:20:19 2018] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 40de0018 38210030 e8010010
[Thu Jul 12 09:20:19 2018] ebe1fff8 7c0803a6 4e800020 7c210b78 <e92d0000> 89290009 792affe3 4082003c

The system is still reponsive but we cannot access one disk /data3 in the system(maybe due to khugepaged holds some file system mutex). khugepaged occupies 100% CPU.
@liyi-ibm
Copy link
Owner Author

"commit 675d995
Author: Aneesh Kumar K.V aneesh.kumar@linux.vnet.ibm.com
Date: Mon Apr 16 16:57:24 2018 +0530

powerpc/book3s64: Enable split pmd ptlock."

possible fix.

liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
commit 360cc79 upstream.

The table field in nft_obj_filter is not an array. In order to check
tablename, we should check if the pointer is set.

Test commands:

   %nft add table ip filter
   %nft add counter ip filter ct1
   %nft reset counters

Splat looks like:

[  306.510504] kasan: CONFIG_KASAN_INLINE enabled
[  306.516184] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  306.524775] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  306.528284] Modules linked in: nft_objref nft_counter nf_tables nfnetlink ip_tables x_tables
[  306.528284] CPU: 0 PID: 1488 Comm: nft Not tainted 4.17.0-rc4+ #17
[  306.528284] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[  306.528284] RIP: 0010:nf_tables_dump_obj+0x52c/0xa70 [nf_tables]
[  306.528284] RSP: 0018:ffff8800b6cb7520 EFLAGS: 00010246
[  306.528284] RAX: 0000000000000000 RBX: ffff8800b6c49820 RCX: 0000000000000000
[  306.528284] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffed0016d96e9a
[  306.528284] RBP: ffff8800b6cb75c0 R08: ffffed00236fce7c R09: ffffed00236fce7b
[  306.528284] R10: ffffffff9f6241e8 R11: ffffed00236fce7c R12: ffff880111365108
[  306.528284] R13: 0000000000000000 R14: ffff8800b6c49860 R15: ffff8800b6c49860
[  306.528284] FS:  00007f838b007700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[  306.528284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  306.528284] CR2: 00007ffeafabcf78 CR3: 00000000b6cbe000 CR4: 00000000001006f0
[  306.528284] Call Trace:
[  306.528284]  netlink_dump+0x470/0xa20
[  306.528284]  __netlink_dump_start+0x5ae/0x690
[  306.528284]  ? nf_tables_getobj+0x1b3/0x740 [nf_tables]
[  306.528284]  nf_tables_getobj+0x2f5/0x740 [nf_tables]
[  306.528284]  ? nft_obj_notify+0x100/0x100 [nf_tables]
[  306.528284]  ? nf_tables_getobj+0x740/0x740 [nf_tables]
[  306.528284]  ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables]
[  306.528284]  ? nft_obj_notify+0x100/0x100 [nf_tables]
[  306.528284]  nfnetlink_rcv_msg+0x8ff/0x932 [nfnetlink]
[  306.528284]  ? nfnetlink_rcv_msg+0x216/0x932 [nfnetlink]
[  306.528284]  netlink_rcv_skb+0x1c9/0x2f0
[  306.528284]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[  306.528284]  ? debug_check_no_locks_freed+0x270/0x270
[  306.528284]  ? netlink_ack+0x7a0/0x7a0
[  306.528284]  ? ns_capable_common+0x6e/0x110
[ ... ]

Fixes: e46abbc ("netfilter: nf_tables: Allow table names of up to 255 chars")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
liyi-ibm pushed a commit that referenced this issue Dec 28, 2018
commit 67d2f87 upstream.

Commit dec2c92 ("Bluetooth: hci_ldisc:
Use rwlocking to avoid closing proto races") introduced locks in
hci_ldisc that are held while calling the proto functions. These locks
are rwlock's, and hence do not allow sleeping while they are held.
However, the proto functions that hci_bcm registers use mutexes and
hence need to be able to sleep.

In more detail: hci_uart_tty_receive() and hci_uart_dequeue() both
acquire the rwlock, after which they call proto->recv() and
proto->dequeue(), respectively. In the case of hci_bcm these point to
bcm_recv() and bcm_dequeue(). The latter both acquire the
bcm_device_lock, which is a mutex, so doing so results in a call to
might_sleep(). But since we're holding a rwlock in hci_ldisc, that
results in the following BUG (this for the dequeue case - a similar
one for the receive case is omitted for brevity):

  BUG: sleeping function called from invalid context at kernel/locking/mutex.c
  in_atomic(): 1, irqs_disabled(): 0, pid: 7303, name: kworker/7:3
  INFO: lockdep is turned off.
  CPU: 7 PID: 7303 Comm: kworker/7:3 Tainted: G        W  OE   4.13.2+ #17
  Hardware name: Apple Inc. MacBookPro13,3/Mac-A5C67F76ED83108C, BIOS MBP133.8
  Workqueue: events hci_uart_write_work [hci_uart]
  Call Trace:
   dump_stack+0x8e/0xd6
   ___might_sleep+0x164/0x250
   __might_sleep+0x4a/0x80
   __mutex_lock+0x59/0xa00
   ? lock_acquire+0xa3/0x1f0
   ? lock_acquire+0xa3/0x1f0
   ? hci_uart_write_work+0xd3/0x160 [hci_uart]
   mutex_lock_nested+0x1b/0x20
   ? mutex_lock_nested+0x1b/0x20
   bcm_dequeue+0x21/0xc0 [hci_uart]
   hci_uart_write_work+0xe6/0x160 [hci_uart]
   process_one_work+0x253/0x6a0
   worker_thread+0x4d/0x3b0
   kthread+0x133/0x150

We can't replace the mutex in hci_bcm, because there are other calls
there that might sleep. Therefore this replaces the rwlock's in
hci_ldisc with rw_semaphore's (which allow sleeping). This is a safer
approach anyway as it reduces the restrictions on the proto callbacks.
Also, because acquiring write-lock is very rare compared to acquiring
the read-lock, the percpu variant of rw_semaphore is used.

Lastly, because hci_uart_tx_wakeup() may be called from an IRQ context,
we can't block (sleep) while trying acquire the read lock there, so we
use the trylock variant.

Signed-off-by: Ronald Tschalär <ronald@innovation.ch>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant