Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling zio_interrupt() from vdev_disk_io_start() may cause RCU stall #3840

Closed
behlendorf opened this issue Sep 25, 2015 · 16 comments
Closed
Milestone

Comments

@behlendorf
Copy link
Contributor

This issue is for the remainder of the work alluded to in #3652. The vdev_disk_io_start() function needs to updated such that it doesn't called zio_interrupt() directly in the case of an error. This work item should be deferred to prevent the same zio from appearing on multiple taskqs concurrently.

------------[ cut here ]------------
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: P
-- ------------   )
Hardware name: RHEV Hypervisor
list_add corruption. prev->next should be next (ffff88021495ca40), but was
ffff880123b05610. (prev=ffff880123b05610).
Modules linked in: zfs(P)(U) zcommon(P)(U) zunicode(P)(U) znvpair(P)(U)
zavl(P)(U) splat(U) spl(U) zlib_deflate nfsd nfs_acl auth_rpcgss exportfs
lockd sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter
ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
ip6table_filter ip6_tables ipv6 sg serio_raw virtio_balloon virtio_console
virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_scsi virtio_blk sr_mod
cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: spl]
Pid: 14249, comm: z_wr_int_6 Tainted: P           -- ------------
2.6.32-573.3.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81077491>] ? warn_slowpath_common+0x91/0xe0
 [<ffffffff81077596>] ? warn_slowpath_fmt+0x46/0x60
 [<ffffffff812a38df>] ? __list_add+0x8f/0xa0
 [<ffffffffa0745ac0>] ? zio_execute+0x0/0x180 [zfs]
 [<ffffffffa021c42f>] ? taskq_dispatch_ent+0x15f/0x170 [spl]
 [<ffffffffa0745ac0>] ? zio_execute+0x0/0x180 [zfs]
 [<ffffffffa06e7a6e>] ? spa_taskq_dispatch_ent+0x7e/0xb0 [zfs]
 [<ffffffffa073f643>] ? zio_taskq_dispatch+0x73/0xa0 [zfs]
 [<ffffffffa06ff2c1>] ? vdev_disk_dio_free+0x51/0x60 [zfs]
 [<ffffffffa073f6a5>] ? zio_interrupt+0x15/0x20 [zfs]
 [<ffffffffa06ff349>] ? vdev_disk_dio_put+0x79/0x90 [zfs]      
 [<ffffffffa06ff750>] ? __vdev_disk_physio+0x3f0/0x460 [zfs]
 [<ffffffffa06ff84f>] ? vdev_disk_io_start+0x8f/0x1d0 [zfs]
 [<ffffffffa0740dcf>] ? zio_vdev_io_start+0xaf/0x300 [zfs]
 [<ffffffffa07436c5>] ? zio_nowait+0xb5/0x180 [zfs]
 [<ffffffffa06fbbeb>] ? vdev_probe+0x15b/0x240 [zfs]
 [<ffffffffa06fcdf0>] ? vdev_probe_done+0x0/0x280 [zfs]
 [<ffffffffa073fecd>] ? zio_vdev_io_done+0xdd/0x190 [zfs]
 [<ffffffffa0745b98>] ? zio_execute+0xd8/0x180 [zfs]
 [<ffffffffa021bfff>] ? taskq_thread+0x24f/0x520 [spl]
 [<ffffffff810672b0>] ? default_wake_function+0x0/0x20
 [<ffffffffa021bdb0>] ? taskq_thread+0x0/0x520 [spl]
 [<ffffffff810a101e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a0f80>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
---[ end trace d8ec92eb438a10af ]---
@bprotopopov
Copy link
Contributor

@behlendorf can you please provide more detail as to the situations that might lead to a zio appearing on multiple taskqs concurrently ?

@behlendorf
Copy link
Contributor Author

@bprotopopov my concern was specifically with the vdev_probe_done() done function which does something a little strange by resubmitting the vd->vdev_probe_zio to zio_nowait(). It does this in the context of one of the taskq threads so it's critical that taskq_thread() initializes the t->tqent_list before calling the task's function.

The current code does clearly do this so perhaps this is no longer an issue. To my knowledge we haven't had any subsequent reports of this.

@bprotopopov
Copy link
Contributor

bprotopopov commented Mar 1, 2017

Hi, @behlendorf

to make sure we are on the same page, I believe you are referring to

static int
taskq_thread(void *args)
{
...
                if ((t = taskq_next_ent(tq)) != NULL) {
                        list_del_init(&t->tqent_list);
...
}

Given the list_del_init() above and subsequently dropped tq->tq_lock, it would seem that __zio_execute() or other tqent_func() runs in the context of a taskq_thread unencumbered, and is free to re-dispatch if it wishes to do so. One could be even more specific by adding a defensive INIT_LIST_HEAD(&t->tqent_list) in taskq_dispatch_ent() before adding to lists, and/or perhaps an assert() to that effect.

I have also seen this stack on my systems a few times, but it never seemed to lead to a kind of meltdown one would expect from an entry actually showing up on two linked lists. Which leads me to believe that this is a warning for an uninitialized data structure that is subsequently re-initialized in a benign fashion.

What do you think ?

@bprotopopov
Copy link
Contributor

bprotopopov commented Mar 1, 2017

@behlendorf, I still feel like something does not quite fit right in this picture.

Looking at taskq_dispatch_ent() and the warning above, I get an impression that the warning is communicating invalid state of the list (head->prev->next != head), as opposed to invalid state of the taskq_ent that is being added to the list. I can only think of this happening when an entry is in fact added to two lists, and next time we try to add, we find the (first) list in this state.

So this might be something more serious, e.g. a zio is dispatched twice in a row without going through the taskq_thread() context.

@behlendorf
Copy link
Contributor Author

Right. That's exactly what the warning is detecting, and what should have been prevented by the lines you posted above. If you're able to reproduce on a test system one thing which might be worth doing is adding another ASSERT to the top of taskq_dispatch_ent() to verify the passed taskq_ent_t is always properly initialized. That way we could catch the issue slightly sooner.

        ASSERT(taskq_empty_ent(t));

@bprotopopov
Copy link
Contributor

Yes, so, the stack above is the 'victim' stack, and the damage has already been done.

The fact is that this type of ASSERT()

 ASSERT(taskq_empty_ent(&zio->io_tqent));

already seems to be present in placed where zios are dispatched. Which means that a zio is likely being acted on by two threads concurrently (each thread is adding to a different list), and the ASSERT()s are not catching this, due to timing and lack of serialization ?

Maybe in addition to the ASSERTS(), I can add some sort of atomic reference count that is incremented before ASSERT()/dispatch and atomically decremented/tested in dispatch after list_add() to make sure only one thread at a time is dispatching a given zio ?

@bprotopopov
Copy link
Contributor

bprotopopov commented Mar 1, 2017

Actually, there is a locked section around adding to the list already:

...
        spin_lock(&t->tqent_lock);

	/*
	 * Mark it as a prealloc'd task.  This is important
	 * to ensure that we don't free it later.
	 */
	t->tqent_flags |= TQENT_FLAG_PREALLOC;

	/* Queue to the priority list instead of the pending list */
	if (flags & TQ_FRONT)
		list_add_tail(&t->tqent_list, &tq->tq_prio_list);
	else
		list_add_tail(&t->tqent_list, &tq->tq_pend_list);

	t->tqent_id = tq->tq_next_id;
	tq->tq_next_id++;
	t->tqent_func = func;
	t->tqent_arg = arg;
	t->tqent_taskq = tq;
	t->tqent_birth = jiffies;

	spin_unlock(&t->tqent_lock);
...

The
ASSERT(taskq_empty_ent(t)) could be simply placed after spin_lock(&t->tqent_lock) and before list_add_tail().

@behlendorf
Copy link
Contributor Author

Yes, the problem is I'm not sure it's going to tell you anything new. The above stack trace was dumped only a few lines farther down from the list_add_tail() in this same locked section.

@bprotopopov
Copy link
Contributor

bprotopopov commented Mar 1, 2017

Well, no - the previous code did not check under lock, so even if two threads were to check independently for non-empty list head, they could do

thread1 - check entry() - OK
thread2 - check entry() -OK
thread1 - list_add() (checks the list head - fine)
thread2 - list_add() (checks the list head - fine)

and then next time someone removes from the second list (and resets the head->prev->next to prev) and then adds to the first list, we would get the warning message.

With the check under lock, I believe we will catch the second list_add() in action, which seems like the info we need.

@behlendorf
Copy link
Contributor Author

My mistake, I was under the mistaken impression that __list_add_valid() verified that the list head being added was initialized and empty. That seems not to be the case. Given that I'm all for adding the ASSERT as you've described.

@bprotopopov
Copy link
Contributor

@behlendorf unfortunately, I don't have access to the system that reproduced this anymore
Do you think it might make sense to put the ASSERT() in just so next time someone sees this, the info is available ?

@behlendorf
Copy link
Contributor Author

Yup, I think that'd be reasonable.

@bprotopopov
Copy link
Contributor

Hi, @behlendorf

would you prefer I enter a separate issue with a short explanation of what this is for ?

@behlendorf
Copy link
Contributor Author

Let's just reference this issue which includes the full discussion above in the commit message.

@bprotopopov
Copy link
Contributor

Sounds good

behlendorf pushed a commit to openzfs/spl that referenced this issue Aug 8, 2017
taskq work item to more than one queue concurrently. Also, please
see discussion in openzfs/zfs#3840.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #609
tonyhutter pushed a commit to openzfs/spl that referenced this issue Aug 8, 2017
taskq work item to more than one queue concurrently. Also, please
see discussion in openzfs/zfs#3840.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #609
Fabian-Gruenbichler pushed a commit to Fabian-Gruenbichler/spl that referenced this issue Sep 6, 2017
taskq work item to more than one queue concurrently. Also, please
see discussion in openzfs/zfs#3840.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes openzfs#609
@behlendorf
Copy link
Contributor Author

Closing for now, this issue in no longer reproducible and a debug patch was merged in case it surfaces again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants