kernel: Fix double-list-removal corruption case in timeout handling #9620

andyross · 2018-08-24T17:17:51Z

This fixes #8669, and is distressingly subtle for a one-line patch:

The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler. But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node. If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.

Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing. But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.

As it happens, there's no value in trying to remove this timeout from
the temporary list at all. Just iterate over it naturally.

Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism. But this fix should be good for now.

This flag is vestigial. It gets set but never read. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

codecov-io · 2018-08-24T17:56:17Z

Codecov Report

Merging #9620 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #9620      +/-   ##
==========================================
- Coverage   52.15%   52.14%   -0.01%     
==========================================
  Files         212      212              
  Lines       25916    25913       -3     
  Branches     5582     5582              
==========================================
- Hits        13517    13513       -4     
  Misses      10149    10149              
- Partials     2250     2251       +1

Impacted Files	Coverage Δ
kernel/sys_clock.c	`95.08% <ø> (-0.16%)`	⬇️
kernel/include/timeout_q.h	`94.44% <0%> (-1.45%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0be1875...2376a77. Read the comment docs.

nashif · 2018-08-25T13:57:10Z

Hey @andyross, is this covered with tests or do we need to add one?

andyross · 2018-08-25T14:25:16Z

I keep flipping on this. The specific bug is sort of obtusely specific: you have to have two timer events expire in the same tick and the second one handled needs to re-add itself to the timeout queue.

I think aesthetically, rather than spend time writing a test I'd rather rework the code here to be simpler and not use the "re-use the dlist node" trick. Then it wouldn't have divergent code paths depending on the number of timeouts in a tick and wouldn't need a test.

andyross · 2018-08-25T14:30:48Z

Added #9627 to track the need for refactoring here.

findlayfeng · 2018-08-30T06:41:39Z

new bug

nordic-krch · 2018-09-18T12:26:49Z

@andyross @nashif @carlescufi when working on new shell #9362 we encountered an issue (hardfault) after rebase on master. I couldn't figure out the issue so i ended up rebasing back and finally found that 2376a77 introduces this fault.

Example is fairly simple. It has two periodic instances of k_timer. When only one is running app is stable, enabling second periodic timer results in hardfault. Can you recommend something?

carlescufi · 2018-09-18T12:40:32Z

Could this actually be also the source of some faults we've been seing recently?

nordic-krch · 2018-09-18T12:42:39Z

in my case it was usually hardfault pointing to sys_dlist_peek_prev_no_check or mpu fault (attempt to execute from ram address)

jakub-uC · 2018-09-18T12:57:42Z

I've also observed this problem when I used: k_sleep(1); in my application. When I have replaced k_sleep(1); with k_busy_wait(1000); problem is gone.
My error msg is following:

carlescufi · 2018-09-18T13:03:39Z

@jarz-nordic does reverting this commit also fix it for you if you use k_sleep(1)?

jakub-uC · 2018-09-18T13:16:49Z

@carlescufi @nordic-krch @andyross : Once I've reverted this commit the problem is gone.

carlescufi · 2018-09-18T13:49:39Z

@andyross as the original author of this patch, could you take a look at the reports here?

Andy Ross added 2 commits August 24, 2018 09:32

kernel: Remove unused variable

e6b45fe

This flag is vestigial. It gets set but never read. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross requested a review from andrewboie as a code owner August 24, 2018 17:17

andyross requested review from ceolin, dcpleung and nashif August 24, 2018 17:18

andyross mentioned this pull request Aug 25, 2018

Clean up timeout handling #9627

Closed

dcpleung approved these changes Aug 26, 2018

View reviewed changes

nashif approved these changes Aug 27, 2018

View reviewed changes

nashif merged commit d8d5ec3 into zephyrproject-rtos:master Aug 27, 2018

carlescufi mentioned this pull request Sep 19, 2018

nRF52: MPU Fault issue #10055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: Fix double-list-removal corruption case in timeout handling #9620

kernel: Fix double-list-removal corruption case in timeout handling #9620

andyross commented Aug 24, 2018

codecov-io commented Aug 24, 2018

nashif commented Aug 25, 2018

andyross commented Aug 25, 2018

andyross commented Aug 25, 2018

findlayfeng commented Aug 30, 2018

nordic-krch commented Sep 18, 2018 •

edited

Loading

carlescufi commented Sep 18, 2018

nordic-krch commented Sep 18, 2018

jakub-uC commented Sep 18, 2018

carlescufi commented Sep 18, 2018

jakub-uC commented Sep 18, 2018 •

edited

Loading

carlescufi commented Sep 18, 2018

kernel: Fix double-list-removal corruption case in timeout handling #9620

kernel: Fix double-list-removal corruption case in timeout handling #9620

Conversation

andyross commented Aug 24, 2018

codecov-io commented Aug 24, 2018

Codecov Report

nashif commented Aug 25, 2018

andyross commented Aug 25, 2018

andyross commented Aug 25, 2018

findlayfeng commented Aug 30, 2018

nordic-krch commented Sep 18, 2018 • edited Loading

carlescufi commented Sep 18, 2018

nordic-krch commented Sep 18, 2018

jakub-uC commented Sep 18, 2018

carlescufi commented Sep 18, 2018

jakub-uC commented Sep 18, 2018 • edited Loading

carlescufi commented Sep 18, 2018

nordic-krch commented Sep 18, 2018 •

edited

Loading

jakub-uC commented Sep 18, 2018 •

edited

Loading