-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: Fix double-list-removal corruption case in timeout handling #9620
Conversation
This flag is vestigial. It gets set but never read. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Codecov Report
@@ Coverage Diff @@
## master #9620 +/- ##
==========================================
- Coverage 52.15% 52.14% -0.01%
==========================================
Files 212 212
Lines 25916 25913 -3
Branches 5582 5582
==========================================
- Hits 13517 13513 -4
Misses 10149 10149
- Partials 2250 2251 +1
Continue to review full report at Codecov.
|
Hey @andyross, is this covered with tests or do we need to add one? |
I keep flipping on this. The specific bug is sort of obtusely specific: you have to have two timer events expire in the same tick and the second one handled needs to re-add itself to the timeout queue. I think aesthetically, rather than spend time writing a test I'd rather rework the code here to be simpler and not use the "re-use the dlist node" trick. Then it wouldn't have divergent code paths depending on the number of timeouts in a tick and wouldn't need a test. |
Added #9627 to track the need for refactoring here. |
new bug |
@andyross @nashif @carlescufi when working on new shell #9362 we encountered an issue (hardfault) after rebase on master. I couldn't figure out the issue so i ended up rebasing back and finally found that 2376a77 introduces this fault. Example is fairly simple. It has two periodic instances of k_timer. When only one is running app is stable, enabling second periodic timer results in hardfault. Can you recommend something? |
Could this actually be also the source of some faults we've been seing recently? |
in my case it was usually hardfault pointing to sys_dlist_peek_prev_no_check or mpu fault (attempt to execute from ram address) |
@jarz-nordic does reverting this commit also fix it for you if you use |
@carlescufi @nordic-krch @andyross : Once I've reverted this commit the problem is gone. |
@andyross as the original author of this patch, could you take a look at the reports here? |
This fixes #8669, and is distressingly subtle for a one-line patch:
The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler. But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node. If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.
Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing. But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.
As it happens, there's no value in trying to remove this timeout from
the temporary list at all. Just iterate over it naturally.
Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism. But this fix should be good for now.