Skip to content
This repository has been archived by the owner on Sep 28, 2024. It is now read-only.

RH7: PCI: hv: Fix the affinity setting for the NVMe crash #713

Merged

Conversation

dcui
Copy link
Contributor

@dcui dcui commented May 22, 2019

In the case of cpumask_equal(mask, cpu_online_mask) == false, "mask" may
be a superset of "cfg->domain", and the real affinity is still saved in
"cfg->domain", after __ioapic_set_affinity() returns. See the line
"cpumask_copy(cfg->domain, tmp_mask);" in RHEL 7.x's kernel function
__assign_irq_vector().

So we should always use "cfg->domain", otherwise the NVMe driver may
fail to receive the expected interrupt, and later the buggy error
handling code in nvme_dev_disable() can cause the below panic:

[ 71.695565] nvme nvme7: I/O 19 QID 0 timeout, disable controller
[ 71.724221] ------------[ cut here ]------------
[ 71.725067] WARNING: CPU: 4 PID: 11317 at kernel/irq/manage.c:1348 __free_irq+0xb3/0x280
[ 71.725067] Trying to free already-free IRQ 226
[ 71.725067] Modules linked in: ...
[ 71.725067] CPU: 4 PID: 11317 Comm: kworker/4:1H Tainted: G OE ------------ T 3.10.0-957.10.1.el7.x86_64 #1
[ 71.725067] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018
[ 71.725067] Workqueue: kblockd blk_mq_timeout_work
[ 71.725067] Call Trace:
[ 71.725067] [] dump_stack+0x19/0x1b
[ 71.725067] [] __warn+0xd8/0x100
[ 71.725067] [] warn_slowpath_fmt+0x5f/0x80
[ 71.725067] [] __free_irq+0xb3/0x280
[ 71.725067] [] free_irq+0x39/0x90
[ 71.725067] [] nvme_dev_disable+0x11c/0x4b0 [nvme]
[ 71.725067] [] ? dev_warn+0x6c/0x90
[ 71.725067] [] nvme_timeout+0x204/0x2d0 [nvme]
[ 71.725067] [] ? blk_mq_do_dispatch_sched+0x9d/0x130
[ 71.725067] [] ? update_curr+0x14c/0x1e0
[ 71.725067] [] blk_mq_rq_timed_out+0x32/0x80
[ 71.725067] [] blk_mq_check_expired+0x5c/0x60
[ 71.725067] [] bt_iter+0x54/0x60
[ 71.725067] [] blk_mq_queue_tag_busy_iter+0x11b/0x290
[ 71.725067] [] ? blk_mq_rq_timed_out+0x80/0x80
[ 71.725067] [] ? blk_mq_rq_timed_out+0x80/0x80
[ 71.725067] [] blk_mq_timeout_work+0x8b/0x180
[ 71.725067] [] process_one_work+0x17f/0x440
[ 71.725067] [] worker_thread+0x126/0x3c0
[ 71.725067] [] ? manage_workers.isra.25+0x2a0/0x2a0
[ 71.725067] [] kthread+0xd1/0xe0
[ 71.725067] [] ? insert_kthread_work+0x40/0x40
[ 71.725067] [] ret_from_fork_nospec_begin+0xe/0x21
[ 71.725067] [] ? insert_kthread_work+0x40/0x40
[ 71.725067] ---[ end trace b3257623bc50d02a ]---
[ 72.196556] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 72.211013] IP: [] free_irq+0x39/0x90

It looks the bug is more easily triggered when the VM has a lot of
vCPUs, e.g. L64v2 or L80v2 VM sizes. Presumably, in such a VM, the NVMe
driver can pass a "mask" which has multiple bits of 1, but is not equal
to "cpu_online_mask". Previously we incorrctly assumed the "mask" either
contains only 1 bit of "1" or equals to "cpu_online_mask".

Fixes: 9c8bbae ("RH7: PCI: hv: respect the affinity setting")
Signed-off-by: Dexuan Cui decui@microsoft.com

In the case of cpumask_equal(mask, cpu_online_mask) == false, "mask" may
be a superset of "cfg->domain", and the real affinity is still saved in
"cfg->domain", after __ioapic_set_affinity() returns. See the line
"cpumask_copy(cfg->domain, tmp_mask);" in RHEL 7.x's kernel function
__assign_irq_vector().

So we should always use "cfg->domain", otherwise the NVMe driver may
fail to receive the expected interrupt, and later the buggy error
handling code in nvme_dev_disable() can cause the below panic:

[   71.695565] nvme nvme7: I/O 19 QID 0 timeout, disable controller
[   71.724221] ------------[ cut here ]------------
[   71.725067] WARNING: CPU: 4 PID: 11317 at kernel/irq/manage.c:1348 __free_irq+0xb3/0x280
[   71.725067] Trying to free already-free IRQ 226
[   71.725067] Modules linked in: ...
[   71.725067] CPU: 4 PID: 11317 Comm: kworker/4:1H Tainted: G OE  ------------ T 3.10.0-957.10.1.el7.x86_64 LIS#1
[   71.725067] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[   71.725067] Workqueue: kblockd blk_mq_timeout_work
[   71.725067] Call Trace:
[   71.725067]  [<ffffffff8cf62e41>] dump_stack+0x19/0x1b
[   71.725067]  [<ffffffff8c897688>] __warn+0xd8/0x100
[   71.725067]  [<ffffffff8c89770f>] warn_slowpath_fmt+0x5f/0x80
[   71.725067]  [<ffffffff8c94ac83>] __free_irq+0xb3/0x280
[   71.725067]  [<ffffffff8c94aed9>] free_irq+0x39/0x90
[   71.725067]  [<ffffffffc046b33c>] nvme_dev_disable+0x11c/0x4b0 [nvme]
[   71.725067]  [<ffffffff8cca465c>] ? dev_warn+0x6c/0x90
[   71.725067]  [<ffffffffc046bb34>] nvme_timeout+0x204/0x2d0 [nvme]
[   71.725067]  [<ffffffff8cb55c6d>] ? blk_mq_do_dispatch_sched+0x9d/0x130
[   71.725067]  [<ffffffff8c8e015c>] ? update_curr+0x14c/0x1e0
[   71.725067]  [<ffffffff8cb505a2>] blk_mq_rq_timed_out+0x32/0x80
[   71.725067]  [<ffffffff8cb5064c>] blk_mq_check_expired+0x5c/0x60
[   71.725067]  [<ffffffff8cb53924>] bt_iter+0x54/0x60
[   71.725067]  [<ffffffff8cb5425b>] blk_mq_queue_tag_busy_iter+0x11b/0x290
[   71.725067]  [<ffffffff8cb505f0>] ? blk_mq_rq_timed_out+0x80/0x80
[   71.725067]  [<ffffffff8cb505f0>] ? blk_mq_rq_timed_out+0x80/0x80
[   71.725067]  [<ffffffff8cb4f1db>] blk_mq_timeout_work+0x8b/0x180
[   71.725067]  [<ffffffff8c8b9d8f>] process_one_work+0x17f/0x440
[   71.725067]  [<ffffffff8c8bae26>] worker_thread+0x126/0x3c0
[   71.725067]  [<ffffffff8c8bad00>] ? manage_workers.isra.25+0x2a0/0x2a0
[   71.725067]  [<ffffffff8c8c1c71>] kthread+0xd1/0xe0
[   71.725067]  [<ffffffff8c8c1ba0>] ? insert_kthread_work+0x40/0x40
[   71.725067]  [<ffffffff8cf75c24>] ret_from_fork_nospec_begin+0xe/0x21
[   71.725067]  [<ffffffff8c8c1ba0>] ? insert_kthread_work+0x40/0x40
[   71.725067] ---[ end trace b3257623bc50d02a ]---
[   72.196556] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[   72.211013] IP: [<ffffffff8c94aed9>] free_irq+0x39/0x90

It looks the bug is more easily triggered when the VM has a lot of
vCPUs, e.g. L64v2 or L80v2 VM sizes. Presumably, in such a VM, the NVMe
driver can pass a "mask" which has multiple bits of 1, but is not equal
to "cpu_online_mask". Previously we incorrctly assumed the "mask" either
contains only 1 bit of "1" or equals to "cpu_online_mask".

Fixes: 9c8bbae ("RH7: PCI: hv: respect the affinity setting")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
@johnsongeorge-w
Copy link
Contributor

Test results looks good. Merging the changes to master.

@johnsongeorge-w johnsongeorge-w merged commit 433c71a into LIS:master May 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants