Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCI passthrough: Frozen PE / EEH recovery happens in the host if driver is loaded after the guest is shutdown and device is reattached to the host #11

Open
mfoliveira opened this issue Feb 9, 2017 · 15 comments
Assignees

Comments

@mfoliveira
Copy link

mfoliveira commented Feb 9, 2017

Scenario: PCI passthrough of the SAS3008-based PCIe adapter in the 8001-22C system.

# lspci -nnv -s 1:3:0.0 | head -n2
0001:03:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 [1000:0097] (rev 02)
Subsystem: Super Micro Computer Inc Device [15d9:0808]

Steps to reproduce:

  1. Host: detach the adapter (virsh nodedev-detach pci_0001_03_00_0)
  2. Host: start a guest with PCI passthrough (virsh start --console <guest>)
  3. Guest: load the driver (initializes the adapter, scans for disks, etc) (modprobe mpt3sas)
  4. Guest: shutdown (poweroff)
  5. Host: reattach the adapter (virsh nodedev-reattach pci_0001_03_00_0)
  6. Host: load the driver (starts to init the adapter and hits Frozen PE/EEH recovery) (modprobe mpt3sas)

During driver initialization the following Frozen PE / EEH recovery is consistently observed.
There is an Oops in the driver code afterward, but that's another problem which I'll be looking at.

Decoding the PEST bits tells this is a DMA write w/ invalid page access. The suspicion is there are pending operations/configuration from the guest, and since the PE was not reset in a way that could actually clear these in this adapter, the problem is hit.

In that scenario, this problem is expected to be resolved by the patch series which was applied downstream on PowerKVM [1], and now is being worked in a VFIO-based approach by @aik .

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-February/124867.html

[  759.825059] mpt3sas 0001:03:00.0: enabling device (0400 -> 0402)
[  759.825165] mpt3sas 0001:03:00.0: Using 64-bit DMA iommu bypass
[  759.825223] mpt3sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (535679552 kB)
[  759.882919] mpt3sas_cm0: MSI-X vectors supported: 96, no of cores: 16, max_msix_vectors: -1
[  759.883772] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 706
[  759.883819] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 707
[  759.883863] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 708
[  759.883906] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 709
[  759.883949] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 710
[  759.883993] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 711
[  759.884035] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 712
[  759.884080] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 713
[  759.884123] mpt3sas0-msix8: PCI-MSI-X enabled: IRQ 714
[  759.884166] mpt3sas0-msix9: PCI-MSI-X enabled: IRQ 715
[  759.884210] mpt3sas0-msix10: PCI-MSI-X enabled: IRQ 716
[  759.884297] mpt3sas0-msix11: PCI-MSI-X enabled: IRQ 717
[  759.884339] mpt3sas0-msix12: PCI-MSI-X enabled: IRQ 718
[  759.884382] mpt3sas0-msix13: PCI-MSI-X enabled: IRQ 719
[  759.884427] mpt3sas0-msix14: PCI-MSI-X enabled: IRQ 720
[  759.884471] mpt3sas0-msix15: PCI-MSI-X enabled: IRQ 721
[  759.884516] mpt3sas_cm0: iomem(0x00003fe080140000), mapped(0xd0000800810a0000), size(65536)
[  759.884582] mpt3sas_cm0: ioport(0x0000000000000000), size(0)
[  759.975501] mpt3sas_cm0: Allocated physical memory: size(8887 kB)
[  759.975563] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)
[  759.975636] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[  760.021015] EEH: Frozen PE#fd on PHB#1 detected
[  760.021106] EEH: PE location: PLX Slot1, PHB location: N/A
[  760.021873] EEH: This PCI device has failed 1 times in the last hour
[  760.021927] EEH: Notify device drivers to shutdown
[  760.021970] mpt3sas_cm0: PCI error: detected callback, state(2)!!
[  760.022317] EEH: Collect temporary log
[  760.022378] EEH: of node=0001:03:00.0
[  760.022414] EEH: PCI device/vendor: 00971000
[  760.022461] EEH: PCI cmd/status register: 00180142
[  760.022503] EEH: PCI-E capabilities and status follow:
[  760.022558] EEH: PCI-E 00: 0002a810 10008025 0000281e 00415083 
[  760.022620] EEH: PCI-E 10: 10830000 00000000 00000000 00000000 
[  760.022675] EEH: PCI-E 20: 00000000 
[  760.022706] EEH: PCI-E AER capability register set follows:
[  760.022758] EEH: PCI-E AER 00: 1e020001 00000000 00000000 00462031 
[  760.022821] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000 
[  760.022881] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[  760.022935] EEH: PCI-E AER 30: 00000000 00000000 
[  760.022979] PHB3 PHB#1 Diag-data (Version: 1)
[  760.023022] brdgCtl:     00000002
[  760.023059] RootSts:     0000000f 00400000 b0830008 00100147 00002000
[  760.023112] PhbSts:      0000001c00000000 0000001c00000000
[  760.023156] Lem:         0000000004000000 42498e367f502eae 0000000000000000
[  760.023210] InAErr:      0000000000004000 0000000000004000 00000000612400fd 04000000000000fd
[  760.023284] PE[253] A/B: 8000302500000000 8000000061240000
[  760.023325] EEH: Reset without hotplug activity
[  762.174778] EEH: Notify device drivers the completion of reset
[  762.174860] mpt3sas_cm0: PCI error: slot reset callback!!
[  762.174985] mpt3sas 0001:03:00.0: Using 64-bit DMA iommu bypass
[  762.175044] mpt3sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (535679552 kB)
[  762.232259] mpt3sas_cm0: MSI-X vectors supported: 96, no of cores: 16, max_msix_vectors: -1
[  762.233046] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 706
[  762.233091] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 707
[  762.233135] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 708
[  762.233179] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 709
[  762.233223] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 710
[  762.233266] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 711
[  762.233309] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 712
[  762.233352] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 713
[  762.233395] mpt3sas0-msix8: PCI-MSI-X enabled: IRQ 714
[  762.233439] mpt3sas0-msix9: PCI-MSI-X enabled: IRQ 715
[  762.233482] mpt3sas0-msix10: PCI-MSI-X enabled: IRQ 716
[  762.233525] mpt3sas0-msix11: PCI-MSI-X enabled: IRQ 717
[  762.233569] mpt3sas0-msix12: PCI-MSI-X enabled: IRQ 718
[  762.233612] mpt3sas0-msix13: PCI-MSI-X enabled: IRQ 719
[  762.233656] mpt3sas0-msix14: PCI-MSI-X enabled: IRQ 720
[  762.233699] mpt3sas0-msix15: PCI-MSI-X enabled: IRQ 721
[  762.233743] mpt3sas_cm0: iomem(0x00003fe080140000), mapped(0xd0000800813b0000), size(65536)
[  762.233806] mpt3sas_cm0: ioport(0x0000000000000000), size(0)
[  762.234135] mpt3sas_cm0: _base_event_notification: timeout
[  762.234182] mf:
	[  762.234204] 07000000 
00000000 [  762.234238] 00000000 
00000000 [  762.234272] 00000000 
0f2f7fff [  762.234305] ffffff7c 
ffffffff [  762.234339] 
[  762.234339] 	
ffffffff [  762.234384] 00000000 
00000000 [  762.234418] 
[  762.236160] Unable to handle kernel paging request for data at address 0xd0000800813b0030
[  762.236230] Faulting instruction address: 0xd000000031fb072c
[  762.236286] Oops: Kernel access of bad area, sig: 11 [#1]
[  762.236329] SMP NR_CPUS=1024 [  762.236351] NUMA 
[  762.236374] PowerNV
[  762.236399] Modules linked in: mpt3sas raid_class scsi_transport_sas vhost_net vhost macvtap macvlan ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c at24 nvmem_core ofpart ipmi_powernv powernv_flash ipmi_msghandler opal_prd mtd i2c_opal kvm_hv nfsd kvm_pr auth_rpcgss oid_registry nfs_acl lockd kvm grace sunrpc joydev ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops i40e ttm ixgbe mdio ptp drm pps_core i2c_core [last unloaded: raid_class][  762.237373] CPU: 8 PID: 779 Comm: eehd Tainted: G        W       4.9.0-4.el7.centos.ppc64le #1
[  762.237448] task: c000003fcf301500 task.stack: c000003fcf384000
[  762.237501] NIP: d000000031fb072c LR: d000000031fb070c CTR: c000000000115490
[  762.237564] REGS: c000003fcf3874b0 TRAP: 0300   Tainted: G        W        (4.9.0-4.el7.centos.ppc64le)
[  762.237638] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>[  762.237811]   CR: 24002084  XER: 20000000
[  762.237844] CFAR: c000000000a276a8 DAR: d0000800813b0030 DSISR: 40000000 SOFTE: 1 
GPR00: d000000031fb070c c000003fcf387730 d000000031fef390 d0000800813b0030 
GPR04: c000003fcf301500 0000000003fde404 00000060e3c47241 0000000000000000 
GPR08: c000003fed20ed00 d0000800813b0000 0000000000000000 00000000ffffffff 
GPR12: 0000000000002200 c00000000fdc4800 c0000000000fbd18 c000007949100040 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR24: c000003fcf387920 0000000000000003 0000000000000005 0000000040000000 
GPR28: 0000000000001388 00000000c0000000 c000001f04e84810 0000000000000001 
NIP [d000000031fb072c] _base_wait_for_doorbell_ack+0x8c/0x1f0 [mpt3sas]
[  762.238822] LR [d000000031fb070c] _base_wait_for_doorbell_ack+0x6c/0x1f0 [mpt3sas]
[  762.238886] Call Trace:
[  762.238914] [c000003fcf387730] [d000000031fb070c] _base_wait_for_doorbell_ack+0x6c/0x1f0 [mpt3sas] (unreliable)
[  762.239015] [c000003fcf3877c0] [d000000031fb1c6c] _base_handshake_req_reply_wait+0x15c/0x7e0 [mpt3sas]
[  762.243871] [c000003fcf387880] [d000000031fb689c] _base_get_ioc_facts+0x10c/0x460 [mpt3sas]
[  762.250568] mpt3sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:8830/_scsih_probe()!
[  762.260515] [c000003fcf387950] [d000000031fb96d8] mpt3sas_base_hard_reset_handler+0x2c8/0x600 [mpt3sas]
[  762.270219] [c000003fcf387a30] [d000000031fbeba4] scsih_pci_slot_reset+0xa4/0x100 [mpt3sas]
[  762.278537] [c000003fcf387ab0] [c000000000042d48] eeh_report_reset+0x128/0x170
[  762.285474] [c000003fcf387b00] [c000000000041128] eeh_pe_dev_traverse+0x98/0x170
[  762.292412] [c000003fcf387b90] [c00000000004347c] eeh_handle_normal_event+0x3ec/0x510
[  762.300722] [c000003fcf387c30] [c000000000043858] eeh_handle_event+0x178/0x360
[  762.307665] [c000003fcf387ce0] [c000000000043bf8] eeh_event_handler+0x1b8/0x1c0
[  762.314598] [c000003fcf387d80] [c0000000000fbe20] kthread+0x110/0x130
[  762.321520] [c000003fcf387e30] [c00000000000c360] ret_from_kernel_thread+0x5c/0x7c
[  762.328465] Instruction dump:
[  762.332603] 40820074 386003e8 388005dc 48028219 e8410018 393f0001 7f9c4840 793f0020 
[  762.339539] 41de010c e93e00a8 38690030 7c0004ac <81290030> 0c090000 4c00012c 2f89ffff 
[  762.430835] ---[ end trace ee34b74dd6657653 ]---
[  762.430881] 
$ ./pest 8000302500000000 8000000061240000
Transaction type: DMA Write
TCE Page Fault
TCE Access Fault
LEM Bit Number 37
Requestor 0:0.0
MSI Data 0x0000
Fault Address = 0x0000000061240000
@rmatinata-ibm
Copy link
Member

rmatinata-ibm commented Feb 9, 2017

@aik Can you please take a look into this, as soon as possible ? Thank you !
@paulusmack @laggarcia @bjking1 @sgarfinkle FYI

@aik
Copy link

aik commented Feb 10, 2017

On my local setup it is just enough to boot the host without mpt3sas driver and then simply do "modprobe mpt3sas" - there is an EEH exactly as reported here. Will continue on monday...

@mfoliveira
Copy link
Author

mfoliveira commented Feb 10, 2017 via email

@aik
Copy link

aik commented Feb 12, 2017

Yes please send the patch. Thanks.

ps. I just cannot make mpt3sas load on the upstream kernel at all. hm.

@mfoliveira
Copy link
Author

@aik just sent via e-mail.

@aik
Copy link

aik commented Feb 14, 2017

Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.

update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.

How much RAM does the guest in the test get?

@aik
Copy link

aik commented Feb 14, 2017

In meanwhile, could you try patching QEMU like this?

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 9d090270f6..80a5d0e3dd 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -166,7 +166,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
     entries = create.window_size >> create.page_shift;
     pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
     pages = MAX(pow2ceil(pages) - 1, 1); /* Round up */
-    create.levels = ctz64(pages) / 6 + 1;
+    create.levels = 1;//ctz64(pages) / 6 + 1;
 
     ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
     if (ret) {

@mfoliveira
Copy link
Author

@aik

Thanks. Did not help though, it still crashes, slightly different, the net result is the same - mpt3sas does not bind to the device.

Surprised; this patch fixed this problem for us in several tests.
Can you please send me the Oops log? (at least stack trace + NIP/LR)

update. Turns out multilevel TCE tables for 32bit DMA do not work properly. Hm. Disabled them and can proceed.

Cool. But wasn't this adapter/driver doing 64-bit DMA?
And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

How much RAM does the guest in the test get?

32 GiB.

In meanwhile, could you try patching QEMU like this?

Yes, I'll setup a local box to try it (the original one is not accessible in a client's network).
It might take a while but I'll do it.

Thanks!

@aik
Copy link

aik commented Feb 15, 2017

Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)

Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH, 3 levels are ok, it has not been noticed so far because by default 32bit windows only use 1 level and most devices are 64bit only anyway; only my test branch exposed the problem which seems to be unrelated to what this bug is about.

Cool. But wasn't this adapter/driver doing 64-bit DMA?

It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.

And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

IODA spec describes it in "Multi-level table TCE Fetching".

In meanwhile, could you try patching QEMU like this?
Yes, I'll setup a local box to try it (the original one is not accessible in a client's network).
It might take a while but I'll do it.

Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.

For now, please try this particular patch on the host kernel:
aik@cbd0c45

@mfoliveira
Copy link
Author

Hi @aik

Surprised; this patch fixed this problem for us in several tests. Can you please send me the Oops log? (at least stack trace + NIP/LR)

Does not make much sense, in fact everything trying to use 4-level TCE tables fails with EEH [snip]

Okay, but I didn't say the patch fixes the Frozen PHB problem, only the Oops in the mp3sas driver's slot-reset hook -- during the respective EEH recovery. :- )
If you still hit that Oops during EEH recovery there, I'd be interested in the stack trace/NIP/LR in order to improve the patch to catch more cases, please.

Cool. But wasn't this adapter/driver doing 64-bit DMA?

It is using 32bit for coherent mask and 64bit for noncoherent, different DMA pages for different purposes.

Ah.

And, please, where can I find out more about multilevel TCE tables? (source files are OK; if there are docs, those are more than welcome).

IODA spec describes it in "Multi-level table TCE Fetching".

Cool, thanks!

In meanwhile, could you try patching QEMU like this?
[snip]
Never mind, QEMU picks levels=1 for 32GB anyway so it won't make a difference.

Ack.

For now, please try this particular patch on the host kernel:
aik/linux@cbd0c45

Sure; posting results soon.

Thank you.

@aik
Copy link

aik commented Feb 27, 2017

Any luck?

@mfoliveira
Copy link
Author

Any luck?

Sorry, should have posted news earlier.

While checking this patch I noticed there's something 'different' (a problem) happening in the guest, so I've been trying to confirm whether it's due to this patch, a regression between the 4.9 and the 4.10 kernel, a misbuilt qemu, or something else.

I can tell you that I no longer see this original problem in the host (very good news, thank you very much for the patch!), but I guess we cannot confirm it's all good until that other problem is understood.

I should return to this task today.

Thanks!

@mfoliveira
Copy link
Author

Er, couldn't get to it today, sorry. Planning for tomorrow / Tuesday.

@mfoliveira
Copy link
Author

So, it seems there's a regression w/ the 4.10 kernel in HostOS (without this patch applied) which produces adapter firmware faults in the PCI passthrough mode. This problem didn't happen w/ the 4.9 kernel.

I'll rebuild the 4.9 kernel w/ your patch, in order to validate it properly. Then track down this regression.

Sorry for the delay with this one.

@mfoliveira
Copy link
Author

@aik

Your patch resolved the problem.
Tested 3 times, no errors (the error occurred every single time without this patch).

Kernel package version used for comparison:

# uname -r
4.10.0-5.gitb0bad18.el7.centos.ppc64le

The regression I mentioned is present in the original/unpatched kernel, and is likely a VFIO thing.

Thank you.

cuinutanix pushed a commit to NXPower/linux that referenced this issue May 4, 2017
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 open-power-host-os#8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 open-power-host-os#9 [] tcp_rcv_established at ffffffff81580b64
open-power-host-os#10 [] tcp_v4_do_rcv at ffffffff8158b54a
open-power-host-os#11 [] tcp_v4_rcv at ffffffff8158cd02
open-power-host-os#12 [] ip_local_deliver_finish at ffffffff815668f4
open-power-host-os#13 [] ip_local_deliver at ffffffff81566bd9
open-power-host-os#14 [] ip_rcv_finish at ffffffff8156656d
open-power-host-os#15 [] ip_rcv at ffffffff81566f06
open-power-host-os#16 [] __netif_receive_skb_core at ffffffff8152b3a2
open-power-host-os#17 [] __netif_receive_skb at ffffffff8152b608
open-power-host-os#18 [] netif_receive_skb at ffffffff8152b690
open-power-host-os#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
open-power-host-os#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
open-power-host-os#21 [] net_rx_action at ffffffff8152bac2
open-power-host-os#22 [] __do_softirq at ffffffff81084b4f
open-power-host-os#23 [] call_softirq at ffffffff8164845c
open-power-host-os#24 [] do_softirq at ffffffff81016fc5
open-power-host-os#25 [] irq_exit at ffffffff81084ee5
open-power-host-os#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paulusmack pushed a commit that referenced this issue May 17, 2017
commit 45caeaa upstream.

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paulusmack pushed a commit that referenced this issue May 17, 2017
commit 4dfce57 upstream.

There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

[nborisov: backported to 4.4]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paulusmack pushed a commit that referenced this issue Nov 8, 2017
Thomas reported that 'perf buildid-list' gets a SEGFAULT due to NULL
pointer deref when he ran it on a data with namespace events.  It was
because the buildid_id__mark_dso_hit_ops lacks the namespace event
handler and perf_too__fill_default() didn't set it.

  Program received signal SIGSEGV, Segmentation fault.
  0x0000000000000000 in ?? ()
  Missing separate debuginfos, use: dnf debuginfo-install audit-libs-2.7.7-1.fc25.s390x bzip2-libs-1.0.6-21.fc25.s390x elfutils-libelf-0.169-1.fc25.s390x
  +elfutils-libs-0.169-1.fc25.s390x libcap-ng-0.7.8-1.fc25.s390x numactl-libs-2.0.11-2.ibm.fc25.s390x openssl-libs-1.1.0e-1.1.ibm.fc25.s390x perl-libs-5.24.1-386.fc25.s390x
  +python-libs-2.7.13-2.fc25.s390x slang-2.3.0-7.fc25.s390x xz-libs-5.2.3-2.fc25.s390x zlib-1.2.8-10.fc25.s390x
  (gdb) where
  #0  0x0000000000000000 in ?? ()
  #1  0x00000000010fad6a in machines__deliver_event (machines=<optimized out>, machines@entry=0x2c6fd18,
      evlist=<optimized out>, event=event@entry=0x3fffdf00470, sample=0x3ffffffe880, sample@entry=0x3ffffffe888,
      tool=tool@entry=0x1312968 <build_id.mark_dso_hit_ops>, file_offset=1136) at util/session.c:1287
  #2  0x00000000010fbf4e in perf_session__deliver_event (file_offset=1136, tool=0x1312968 <build_id.mark_dso_hit_ops>,
      sample=0x3ffffffe888, event=0x3fffdf00470, session=0x2c6fc30) at util/session.c:1340
  #3  perf_session__process_event (session=0x2c6fc30, session@entry=0x0, event=event@entry=0x3fffdf00470,
      file_offset=file_offset@entry=1136) at util/session.c:1522
  #4  0x00000000010fddde in __perf_session__process_events (file_size=11880, data_size=<optimized out>,
      data_offset=<optimized out>, session=0x0) at util/session.c:1899
  #5  perf_session__process_events (session=0x0, session@entry=0x2c6fc30) at util/session.c:1953
  #6  0x000000000103b2ac in perf_session__list_build_ids (with_hits=<optimized out>, force=<optimized out>)
      at builtin-buildid-list.c:83
  #7  cmd_buildid_list (argc=<optimized out>, argv=<optimized out>) at builtin-buildid-list.c:115
  #8  0x00000000010a026c in run_builtin (p=0x1311f78 <commands+24>, argc=argc@entry=2, argv=argv@entry=0x3fffffff3c0)
      at perf.c:296
  #9  0x000000000102bc00 in handle_internal_command (argv=<optimized out>, argc=2) at perf.c:348
  #10 run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:392
  #11 main (argc=<optimized out>, argv=0x3fffffff3c0) at perf.c:536
  (gdb)

Fix it by adding a stub event handler for namespace event.

Committer testing:

Further clarifying, plain using 'perf buildid-list' will not end up in a
SEGFAULT when processing a perf.data file with namespace info:

  # perf record -a --namespaces sleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 2.024 MB perf.data (1058 samples) ]
  # perf buildid-list | wc -l
  38
  # perf buildid-list | head -5
  e2a171c7b905826fc8494f0711ba76ab6abbd604 /lib/modules/4.14.0-rc3+/build/vmlinux
  874840a02d8f8a31cedd605d0b8653145472ced3 /lib/modules/4.14.0-rc3+/kernel/arch/x86/kvm/kvm-intel.ko
  ea7223776730cd8a22f320040aae4d54312984bc /lib/modules/4.14.0-rc3+/kernel/drivers/gpu/drm/i915/i915.ko
  5961535e6732a8edb7f22b3f148bb2fa2e0be4b9 /lib/modules/4.14.0-rc3+/kernel/drivers/gpu/drm/drm.ko
  f045f54aa78cf1931cc893f78b6cbc52c72a8cb1 /usr/lib64/libc-2.25.so
  #

It is only when one asks for checking what of those entries actually had
samples, i.e. when we use either -H or --with-hits, that we will process
all the PERF_RECORD_ events, and since tools/perf/builtin-buildid-list.c
neither explicitely set a perf_tool.namespaces() callback nor the
default stub was set that we end up, when processing a
PERF_RECORD_NAMESPACE record, causing a SEGFAULT:

  # perf buildid-list -H
  Segmentation fault (core dumped)
  ^C
  #

Reported-and-Tested-by: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas-Mich Richter <tmricht@linux.vnet.ibm.com>
Fixes: f3b3614 ("perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related info")
Link: http://lkml.kernel.org/r/20171017132900.11043-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
paulusmack pushed a commit that referenced this issue Apr 3, 2018
If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and split
the underlying vma such that it is not huge page aligned.  This will
untimately result in the following BUG:

  kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
  CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
  NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
  REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
  MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
  CFAR: c00000000036ee44 SOFTE: 1
  NIP __unmap_hugepage_range+0xa4/0x760
  LR __unmap_hugepage_range_final+0x28/0x50
  Call Trace:
    0x7115e4e00000 (unreliable)
    __unmap_hugepage_range_final+0x28/0x50
    unmap_single_vma+0x11c/0x190
    unmap_vmas+0x94/0x140
    exit_mmap+0x9c/0x1d0
    mmput+0xa8/0x1d0
    do_exit+0x360/0xc80
    do_group_exit+0x60/0x100
    SyS_exit_group+0x24/0x30
    system_call+0x58/0x6c
  ---[ end trace ee88f958a1c62605 ]---

This bug was introduced by commit 31383c6 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct").  A split function was
added to vm_operations_struct to determine if a mapping can be split.
This was mostly for device-dax and hugetlbfs mappings which have
specific alignment constraints.

Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
original vm_ops if needed.  Add such a split function to shm_vm_ops.

Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c6 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
liyi-ibm referenced this issue in liyi-ibm/linux Dec 6, 2018
This reverts commit e70a3aa.

This change causes use-after-free on dst->_metrics.
The crash trace looks like this:
[   97.763269] BUG: KASAN: use-after-free in ip6_mtu+0x116/0x140
[   97.769038] Read of size 4 at addr ffff881781d2cf84 by task svw_NetThreadEv/8801

[   97.777954] CPU: 76 PID: 8801 Comm: svw_NetThreadEv Not tainted 4.15.0-smp-DEV #11
[   97.777956] Hardware name: Default string Default string/Indus_QC_02, BIOS 5.46.4 03/29/2018
[   97.777957] Call Trace:
[   97.777971]  [<ffffffff895709db>] dump_stack+0x4d/0x72
[   97.777985]  [<ffffffff881651df>] print_address_description+0x6f/0x260
[   97.777997]  [<ffffffff88165747>] kasan_report+0x257/0x370
[   97.778001]  [<ffffffff894488e6>] ? ip6_mtu+0x116/0x140
[   97.778004]  [<ffffffff881658b9>] __asan_report_load4_noabort+0x19/0x20
[   97.778008]  [<ffffffff894488e6>] ip6_mtu+0x116/0x140
[   97.778013]  [<ffffffff892bb91e>] tcp_current_mss+0x12e/0x280
[   97.778016]  [<ffffffff892bb7f0>] ? tcp_mtu_to_mss+0x2d0/0x2d0
[   97.778022]  [<ffffffff887b45b8>] ? depot_save_stack+0x138/0x4a0
[   97.778037]  [<ffffffff87c38985>] ? __mmdrop+0x145/0x1f0
[   97.778040]  [<ffffffff881643b1>] ? save_stack+0xb1/0xd0
[   97.778046]  [<ffffffff89264c82>] tcp_send_mss+0x22/0x220
[   97.778059]  [<ffffffff89273a49>] tcp_sendmsg_locked+0x4f9/0x39f0
[   97.778062]  [<ffffffff881642b4>] ? kasan_check_write+0x14/0x20
[   97.778066]  [<ffffffff89273550>] ? tcp_sendpage+0x60/0x60
[   97.778070]  [<ffffffff881cb359>] ? rw_copy_check_uvector+0x69/0x280
[   97.778075]  [<ffffffff8873c65f>] ? import_iovec+0x9f/0x430
[   97.778078]  [<ffffffff88164be7>] ? kasan_slab_free+0x87/0xc0
[   97.778082]  [<ffffffff8873c5c0>] ? memzero_page+0x140/0x140
[   97.778085]  [<ffffffff881642b4>] ? kasan_check_write+0x14/0x20
[   97.778088]  [<ffffffff89276f6c>] tcp_sendmsg+0x2c/0x50
[   97.778092]  [<ffffffff89276f6c>] ? tcp_sendmsg+0x2c/0x50
[   97.778098]  [<ffffffff89352d43>] inet_sendmsg+0x103/0x480
[   97.778102]  [<ffffffff89352c40>] ? inet_gso_segment+0x15b0/0x15b0
[   97.778105]  [<ffffffff890294da>] sock_sendmsg+0xba/0xf0
[   97.778108]  [<ffffffff8902ab6a>] ___sys_sendmsg+0x6ca/0x8e0
[   97.778113]  [<ffffffff87dccac1>] ? hrtimer_try_to_cancel+0x71/0x3b0
[   97.778116]  [<ffffffff8902a4a0>] ? copy_msghdr_from_user+0x3d0/0x3d0
[   97.778119]  [<ffffffff881646d1>] ? memset+0x31/0x40
[   97.778123]  [<ffffffff87a0cff5>] ? schedule_hrtimeout_range_clock+0x165/0x380
[   97.778127]  [<ffffffff87a0ce90>] ? hrtimer_nanosleep_restart+0x250/0x250
[   97.778130]  [<ffffffff87dcc700>] ? __hrtimer_init+0x180/0x180
[   97.778133]  [<ffffffff87dd1f82>] ? ktime_get_ts64+0x172/0x200
[   97.778137]  [<ffffffff8822b8ec>] ? __fget_light+0x8c/0x2f0
[   97.778141]  [<ffffffff8902d5c6>] __sys_sendmsg+0xe6/0x190
[   97.778144]  [<ffffffff8902d5c6>] ? __sys_sendmsg+0xe6/0x190
[   97.778147]  [<ffffffff8902d4e0>] ? SyS_shutdown+0x20/0x20
[   97.778152]  [<ffffffff87cd4370>] ? wake_up_q+0xe0/0xe0
[   97.778155]  [<ffffffff8902d670>] ? __sys_sendmsg+0x190/0x190
[   97.778158]  [<ffffffff8902d683>] SyS_sendmsg+0x13/0x20
[   97.778162]  [<ffffffff87a1600c>] do_syscall_64+0x2ac/0x430
[   97.778166]  [<ffffffff87c17515>] ? do_page_fault+0x35/0x3d0
[   97.778171]  [<ffffffff8960131f>] ? page_fault+0x2f/0x50
[   97.778174]  [<ffffffff89600071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   97.778177] RIP: 0033:0x7f83fa36000d
[   97.778178] RSP: 002b:00007f83ef9229e0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   97.778180] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f83fa36000d
[   97.778182] RDX: 0000000000004000 RSI: 00007f83ef922f00 RDI: 0000000000000036
[   97.778183] RBP: 00007f83ef923040 R08: 00007f83ef9231f8 R09: 00007f83ef923168
[   97.778184] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f83f69c5b40
[   97.778185] R13: 000000000000001c R14: 0000000000000001 R15: 0000000000004000

[   97.779684] Allocated by task 5919:
[   97.783185]  save_stack+0x46/0xd0
[   97.783187]  kasan_kmalloc+0xad/0xe0
[   97.783189]  kmem_cache_alloc_trace+0xdf/0x580
[   97.783190]  ip6_convert_metrics.isra.79+0x7e/0x190
[   97.783192]  ip6_route_info_create+0x60a/0x2480
[   97.783193]  ip6_route_add+0x1d/0x80
[   97.783195]  inet6_rtm_newroute+0xdd/0xf0
[   97.783198]  rtnetlink_rcv_msg+0x641/0xb10
[   97.783200]  netlink_rcv_skb+0x27b/0x3e0
[   97.783202]  rtnetlink_rcv+0x15/0x20
[   97.783203]  netlink_unicast+0x4be/0x720
[   97.783204]  netlink_sendmsg+0x7bc/0xbf0
[   97.783205]  sock_sendmsg+0xba/0xf0
[   97.783207]  ___sys_sendmsg+0x6ca/0x8e0
[   97.783208]  __sys_sendmsg+0xe6/0x190
[   97.783209]  SyS_sendmsg+0x13/0x20
[   97.783211]  do_syscall_64+0x2ac/0x430
[   97.783213]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[   97.784709] Freed by task 0:
[   97.785056] knetbase: Error: /proc/sys/net/core/txcs_enable does not exist
[   97.794497]  save_stack+0x46/0xd0
[   97.794499]  kasan_slab_free+0x71/0xc0
[   97.794500]  kfree+0x7c/0xf0
[   97.794501]  fib6_info_destroy_rcu+0x24f/0x310
[   97.794504]  rcu_process_callbacks+0x38b/0x1730
[   97.794506]  __do_softirq+0x1c8/0x5d0

Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants