This repository has been archived by the owner on Sep 28, 2024. It is now read-only.
RH7: PCI: hv: Fix the affinity setting for the NVMe crash #713
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the case of cpumask_equal(mask, cpu_online_mask) == false, "mask" may
be a superset of "cfg->domain", and the real affinity is still saved in
"cfg->domain", after __ioapic_set_affinity() returns. See the line
"cpumask_copy(cfg->domain, tmp_mask);" in RHEL 7.x's kernel function
__assign_irq_vector().
So we should always use "cfg->domain", otherwise the NVMe driver may
fail to receive the expected interrupt, and later the buggy error
handling code in nvme_dev_disable() can cause the below panic:
[ 71.695565] nvme nvme7: I/O 19 QID 0 timeout, disable controller
[ 71.724221] ------------[ cut here ]------------
[ 71.725067] WARNING: CPU: 4 PID: 11317 at kernel/irq/manage.c:1348 __free_irq+0xb3/0x280
[ 71.725067] Trying to free already-free IRQ 226
[ 71.725067] Modules linked in: ...
[ 71.725067] CPU: 4 PID: 11317 Comm: kworker/4:1H Tainted: G OE ------------ T 3.10.0-957.10.1.el7.x86_64 #1
[ 71.725067] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018
[ 71.725067] Workqueue: kblockd blk_mq_timeout_work
[ 71.725067] Call Trace:
[ 71.725067] [] dump_stack+0x19/0x1b
[ 71.725067] [] __warn+0xd8/0x100
[ 71.725067] [] warn_slowpath_fmt+0x5f/0x80
[ 71.725067] [] __free_irq+0xb3/0x280
[ 71.725067] [] free_irq+0x39/0x90
[ 71.725067] [] nvme_dev_disable+0x11c/0x4b0 [nvme]
[ 71.725067] [] ? dev_warn+0x6c/0x90
[ 71.725067] [] nvme_timeout+0x204/0x2d0 [nvme]
[ 71.725067] [] ? blk_mq_do_dispatch_sched+0x9d/0x130
[ 71.725067] [] ? update_curr+0x14c/0x1e0
[ 71.725067] [] blk_mq_rq_timed_out+0x32/0x80
[ 71.725067] [] blk_mq_check_expired+0x5c/0x60
[ 71.725067] [] bt_iter+0x54/0x60
[ 71.725067] [] blk_mq_queue_tag_busy_iter+0x11b/0x290
[ 71.725067] [] ? blk_mq_rq_timed_out+0x80/0x80
[ 71.725067] [] ? blk_mq_rq_timed_out+0x80/0x80
[ 71.725067] [] blk_mq_timeout_work+0x8b/0x180
[ 71.725067] [] process_one_work+0x17f/0x440
[ 71.725067] [] worker_thread+0x126/0x3c0
[ 71.725067] [] ? manage_workers.isra.25+0x2a0/0x2a0
[ 71.725067] [] kthread+0xd1/0xe0
[ 71.725067] [] ? insert_kthread_work+0x40/0x40
[ 71.725067] [] ret_from_fork_nospec_begin+0xe/0x21
[ 71.725067] [] ? insert_kthread_work+0x40/0x40
[ 71.725067] ---[ end trace b3257623bc50d02a ]---
[ 72.196556] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 72.211013] IP: [] free_irq+0x39/0x90
It looks the bug is more easily triggered when the VM has a lot of
vCPUs, e.g. L64v2 or L80v2 VM sizes. Presumably, in such a VM, the NVMe
driver can pass a "mask" which has multiple bits of 1, but is not equal
to "cpu_online_mask". Previously we incorrctly assumed the "mask" either
contains only 1 bit of "1" or equals to "cpu_online_mask".
Fixes: 9c8bbae ("RH7: PCI: hv: respect the affinity setting")
Signed-off-by: Dexuan Cui decui@microsoft.com