-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calling zio_interrupt() from vdev_disk_io_start() may cause RCU stall #3840
Comments
@behlendorf can you please provide more detail as to the situations that might lead to a zio appearing on multiple taskqs concurrently ? |
@bprotopopov my concern was specifically with the The current code does clearly do this so perhaps this is no longer an issue. To my knowledge we haven't had any subsequent reports of this. |
Hi, @behlendorf to make sure we are on the same page, I believe you are referring to
Given the I have also seen this stack on my systems a few times, but it never seemed to lead to a kind of meltdown one would expect from an entry actually showing up on two linked lists. Which leads me to believe that this is a warning for an uninitialized data structure that is subsequently re-initialized in a benign fashion. What do you think ? |
@behlendorf, I still feel like something does not quite fit right in this picture. Looking at So this might be something more serious, e.g. a |
Right. That's exactly what the warning is detecting, and what should have been prevented by the lines you posted above. If you're able to reproduce on a test system one thing which might be worth doing is adding another ASSERT to the top of
|
Yes, so, the stack above is the 'victim' stack, and the damage has already been done. The fact is that this type of ASSERT()
already seems to be present in placed where zios are dispatched. Which means that a zio is likely being acted on by two threads concurrently (each thread is adding to a different list), and the ASSERT()s are not catching this, due to timing and lack of serialization ? Maybe in addition to the ASSERTS(), I can add some sort of atomic reference count that is incremented before ASSERT()/dispatch and atomically decremented/tested in dispatch after list_add() to make sure only one thread at a time is dispatching a given zio ? |
Actually, there is a locked section around adding to the list already:
The |
Yes, the problem is I'm not sure it's going to tell you anything new. The above stack trace was dumped only a few lines farther down from the |
Well, no - the previous code did not check under lock, so even if two threads were to check independently for non-empty list head, they could do thread1 - check entry() - OK and then next time someone removes from the second list (and resets the head->prev->next to prev) and then adds to the first list, we would get the warning message. With the check under lock, I believe we will catch the second list_add() in action, which seems like the info we need. |
My mistake, I was under the mistaken impression that |
@behlendorf unfortunately, I don't have access to the system that reproduced this anymore |
Yup, I think that'd be reasonable. |
Hi, @behlendorf would you prefer I enter a separate issue with a short explanation of what this is for ? |
Let's just reference this issue which includes the full discussion above in the commit message. |
Sounds good |
taskq work item to more than one queue concurrently. Also, please see discussion in openzfs/zfs#3840. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #609
taskq work item to more than one queue concurrently. Also, please see discussion in openzfs/zfs#3840. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #609
taskq work item to more than one queue concurrently. Also, please see discussion in openzfs/zfs#3840. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes openzfs#609
Closing for now, this issue in no longer reproducible and a debug patch was merged in case it surfaces again. |
This issue is for the remainder of the work alluded to in #3652. The
vdev_disk_io_start()
function needs to updated such that it doesn't calledzio_interrupt()
directly in the case of an error. This work item should be deferred to prevent the same zio from appearing on multiple taskqs concurrently.The text was updated successfully, but these errors were encountered: