Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

sathnaga · 2017-09-27T07:56:16Z

Host Kernel: 4.13.0-4.rel.git49564cb.el7.centos.ppc64le

Steps to reproduce:

Boot a guest(vm1)
pin emulator thread to last host cpu
virsh emulatorpin vm1 79 --live --config
Change host SMT from 4 to 2
ppc64_cpu --smt=2
====> Host hit with crash and become unresposive

part of guest xml

<domain type='kvm'>
  <name>vm1</name>
  <uuid>8914b703-4133-4564-bb39-108159f0f2b8</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <emulatorpin cpuset='79'/>
  </cputune>
  <os>
    <type arch='ppc64le' machine='pseries-2.10'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>

Host hung and unresponsive, needs a external reboot to bring back.

# [175192.775110] IRQ 33: no longer affine to CPU2
[175193.513117] IRQ 51: no longer affine to CPU7
[175193.918060] IRQ 36: no longer affine to CPU10
[175194.898718] IRQ 32: no longer affine to CPU15
[175195.497593] IRQ 24: no longer affine to CPU23
[175195.847274] IRQ 59: no longer affine to CPU27
[175196.156829] IRQ 39: no longer affine to CPU31
[175196.514113] IRQ 38: no longer affine to CPU35
[175196.845370] IRQ 52: no longer affine to CPU38
[175197.016417] IRQ 50: no longer affine to CPU39
[175197.935579] irq_migrate_all_off_this_cpu: 1 callbacks suppressed
[175197.935582] IRQ 69: no longer affine to CPU51
[175198.195199] IRQ 56: no longer affine to CPU55
[175198.345390] IRQ 57: no longer affine to CPU62
[175198.506220] IRQ 28: no longer affine to CPU63
[175199.224386] IRQ 66: no longer affine to CPU71
[175199.554113] IRQ 35: no longer affine to CPU75
[175199.694068] IRQ 37: no longer affine to CPU78
[175199.852866] Unable to handle kernel paging request for data at address 0x000008c8
[175199.852938] Faulting instruction address: 0xc0000000001d0184
[175199.852953] Oops: Kernel access of bad area, sig: 11 [#1]
[175199.853004] SMP NR_CPUS=1024 
[175199.853005] NUMA 
[175199.853045] PowerNV
[175199.853098] Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel opal_prd nfsd auth_rpcgss oid_registry nfs_acl
[175199.853785]  lockd grace kvm_hv sunrpc kvm tg3 ptp pps_core
[175199.853856] CPU: 79 PID: 64710 Comm: kworker/79:2 Not tainted 4.13.0-4.rel.git49564cb.el7.centos.ppc64le #1
[175199.853961] Workqueue: events cpuset_hotplug_workfn
[175199.854014] task: c0000003a2a22600 task.stack: c0000003a2ac8000
[175199.854077] NIP: c0000000001d0184 LR: c0000000001d0170 CTR: c0000000001d0130
[175199.854153] REGS: c0000003a2acb710 TRAP: 0300   Not tainted  (4.13.0-4.rel.git49564cb.el7.centos.ppc64le)
[175199.854241] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[175199.854249]   CR: 448e2022  XER: 20040000
[175199.854349] CFAR: c0000000001c3db0 DAR: 00000000000008c8 DSISR: 40000000 SOFTE: 1 
[175199.854349] GPR00: c0000000001d0170 c0000003a2acb990 c000000001397a00 0000000000000000 
[175199.854349] GPR04: c0000003a2acb9b0 0000000000000000 c0000003a2acbab0 c000000245975678 
[175199.854349] GPR08: c000000245975678 c0000003a2acb948 c0000000015a7a00 0000000000000000 
[175199.854349] GPR12: c0000000001d0130 c00000000fdb1600 c000000000124348 c000000036de4e80 
[175199.854349] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 
[175199.854349] GPR20: c000000005be6940 c000000005be6960 0000000000000000 0000000000000000 
[175199.854349] GPR24: c000000001334de0 c0000000015a09e0 c000000001264488 c0000003a2acbab0 
[175199.854349] GPR28: c0000003aae63c00 c0000003a2acbaa0 c0000003a2acba10 0000000000000000 
[175199.855075] NIP [c0000000001d0184] cpuset_can_attach+0x54/0x1a0
[175199.855191] LR [c0000000001d0170] cpuset_can_attach+0x40/0x1a0
[175199.855304] Call Trace:
[175199.855355] [c0000003a2acb990] [c0000000001d0170] cpuset_can_attach+0x40/0x1a0 (unreliable)
[175199.855519] [c0000003a2acb9f0] [c0000000001c4dd4] cgroup_migrate_execute+0xc4/0x4c0
[175199.855657] [c0000003a2acba60] [c0000000001cc3d4] cgroup_transfer_tasks+0x1e4/0x380
[175199.855796] [c0000003a2acbb90] [c0000000001d2810] cpuset_hotplug_workfn+0x6e0/0x900
[175199.855934] [c0000003a2acbc90] [c00000000011bc00] process_one_work+0x1a0/0x490
[175199.856072] [c0000003a2acbd30] [c00000000011bf88] worker_thread+0x98/0x520
[175199.856188] [c0000003a2acbdc0] [c0000000001244a8] kthread+0x168/0x1b0
[175199.856304] [c0000003a2acbe30] [c00000000000bc60] ret_from_kernel_thread+0x5c/0x7c
[175199.856441] Instruction dump:
[175199.856513] fbc1fff0 fbe1fff8 f8010010 f821ffa1 38810020 7c7d1b78 4bff3c6d 60000000 
[175199.856655] 3f42ffed 3d420021 eb610020 3b5aca88 <e92308c8> 7f43d378 e9290000 f92a90c8 
[175199.856800] ---[ end trace 5aa84a7cf504a434 ]---
[175199.868456] 
[175201.708433] process 150492 (vhost-150463) no longer affine to cpu79

cde:info Mirrored with LTC bug #159341 </cde:info>

The text was updated successfully, but these errors were encountered:

cdeadmin · 2017-09-27T09:15:36Z

------- Comment From bssrikanth@in.ibm.com 2017-09-27 05:08:00 EDT-------
Similar issue noted with Pegas 1.0 testing as well @ bug 159286

cdeadmin · 2017-10-20T18:15:26Z

------- Comment From jamesspo@us.ibm.com 2017-10-20 14:10:40 EDT-------
Moving to Sprint 2, but let's mention it in the annouce details.

cdeadmin · 2017-11-21T10:15:37Z

------- Comment From bssrikanth@in.ibm.com 2017-11-21 05:14:33 EDT-------
Have requested Satheesh to test with latest release branch

sathnaga · 2018-01-01T08:03:33Z

Tested in latest release branch 4.14.0-1.rel.git68b4afb.el7.centos.ppc64le, and issue is fixed.

# virsh destroy vm1;virsh start vm1
Domain vm1 destroyed

Domain vm1 started


# virsh emulatorpin vm1
emulator: CPU Affinity
----------------------------------
       *: 63

# virsh vcpupin vm1
VCPU: CPU Affinity
----------------------------------
   0: 0-63
   1: 0-63
   2: 0-63
   3: 0-63

# ppc64_cpu --smt
SMT=4
# ppc64_cpu --smt=2

#  ppc64_cpu --smt
SMT=2

cdeadmin · 2018-01-01T08:10:42Z

------- Comment From satheera@in.ibm.com 2017-11-28 05:25:35 EDT-------

Regards,
-Satheesh.

Matteo reported the following splat, testing the datapath of TC 'sample': BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310 Read of size 8 at addr 0000000000000000 by task nc/433 CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 Call Trace: kasan_report.cold.6+0x6c/0x2fa tcf_sample_act+0xc4/0x310 ? dev_hard_start_xmit+0x117/0x180 tcf_action_exec+0xa3/0x160 tcf_classify+0xdd/0x1d0 htb_enqueue+0x18e/0x6b0 ? deref_stack_reg+0x7a/0xb0 ? htb_delete+0x4b0/0x4b0 ? unwind_next_frame+0x819/0x8f0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 __dev_queue_xmit+0x722/0xca0 ? unwind_get_return_address_ptr+0x50/0x50 ? netdev_pick_tx+0xe0/0xe0 ? save_stack+0x8c/0xb0 ? kasan_kmalloc+0xbe/0xd0 ? __kmalloc_track_caller+0xe4/0x1c0 ? __kmalloc_reserve.isra.45+0x24/0x70 ? __alloc_skb+0xdd/0x2e0 ? sk_stream_alloc_skb+0x91/0x3b0 ? tcp_sendmsg_locked+0x71b/0x15a0 ? tcp_sendmsg+0x22/0x40 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ? __sys_sendto+0x1b0/0x250 ? __x64_sys_sendto+0x6f/0x80 ? do_syscall_64+0x5d/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 ip_finish_output2+0x495/0x590 ? ip_copy_metadata+0x2e0/0x2e0 ? skb_gso_validate_network_len+0x6f/0x110 ? ip_finish_output+0x174/0x280 __tcp_transmit_skb+0xb17/0x12b0 ? __tcp_select_window+0x380/0x380 tcp_write_xmit+0x913/0x1de0 ? __sk_mem_schedule+0x50/0x80 tcp_sendmsg_locked+0x49d/0x15a0 ? tcp_rcv_established+0x8da/0xa30 ? tcp_set_state+0x220/0x220 ? clear_user+0x1f/0x50 ? iov_iter_zero+0x1ae/0x590 ? __fget_light+0xa0/0xe0 tcp_sendmsg+0x22/0x40 __sys_sendto+0x1b0/0x250 ? __ia32_sys_getpeername+0x40/0x40 ? _copy_to_user+0x58/0x70 ? poll_select_copy_remaining+0x176/0x200 ? __pollwait+0x1c0/0x1c0 ? ktime_get_ts64+0x11f/0x140 ? kern_select+0x108/0x150 ? core_sys_select+0x360/0x360 ? vfs_read+0x127/0x150 ? kernel_write+0x90/0x90 __x64_sys_sendto+0x6f/0x80 do_syscall_64+0x5d/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fefef2b129d Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41 RSP: 002b:00007fff2f5350c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 000056118d60c120 RCX: 00007fefef2b129d RDX: 0000000000002000 RSI: 000056118d629320 RDI: 0000000000000003 RBP: 000056118d530370 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000 R13: 000056118d5c2a10 R14: 000056118d5c2a10 R15: 000056118d5303b8 tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init() forgot to allocate them, because tcf_idr_create() was called with a wrong value of 'cpustats'. Setting it to true proved to fix the reported crash. Reported-by: Matteo Croce <mcroce@redhat.com> Fixes: 65a206c ("net/sched: Change act_api and act_xxx modules to use IDR") Fixes: 5c5670f ("net/sched: Introduce sample tc action") Tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>

sathnaga changed the title ~~Power9: Host crash during SMT change with guest emulator thread cpupinned ""~~ Power9: Host crash during SMT change with guest emulator thread cpupinned "Oops: Kernel access of bad area, sig: 11 [#1]" Sep 27, 2017

sathnaga changed the title ~~Power9: Host crash during SMT change with guest emulator thread cpupinned "Oops: Kernel access of bad area, sig: 11 [#1]"~~ Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" Sep 27, 2017

sathnaga closed this as completed Jan 1, 2018

sathnaga reopened this Jan 1, 2018

sathnaga closed this as completed Jan 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

sathnaga commented Sep 27, 2017 •

edited by cdeadmin

Loading

cdeadmin commented Sep 27, 2017

cdeadmin commented Oct 20, 2017

cdeadmin commented Nov 21, 2017

sathnaga commented Jan 1, 2018

cdeadmin commented Jan 1, 2018

Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

Power9: Host crash during SMT change with guest emulator thread pinned "Oops: Kernel access of bad area, sig: 11 [#1]" #17

Comments

sathnaga commented Sep 27, 2017 • edited by cdeadmin Loading

cdeadmin commented Sep 27, 2017

cdeadmin commented Oct 20, 2017

cdeadmin commented Nov 21, 2017

sathnaga commented Jan 1, 2018

cdeadmin commented Jan 1, 2018

sathnaga commented Sep 27, 2017 •

edited by cdeadmin

Loading