Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Fix double-list-removal corruption case in timeout handling #9620

Merged
merged 2 commits into from
Aug 27, 2018

Conversation

andyross
Copy link
Contributor

This fixes #8669, and is distressingly subtle for a one-line patch:

The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler. But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node. If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.

Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing. But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.

As it happens, there's no value in trying to remove this timeout from
the temporary list at all. Just iterate over it naturally.

Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism. But this fix should be good for now.

Andy Ross added 2 commits August 24, 2018 09:32
This flag is vestigial.  It gets set but never read.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch:

The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler.  But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node.  If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.

Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing.  But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.

As it happens, there's no value in trying to remove this timeout from
the temporary list at all.  Just iterate over it naturally.

Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism.  But this fix should be good for now.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross andyross requested a review from andrewboie as a code owner August 24, 2018 17:17
@andyross andyross requested review from ceolin, dcpleung and nashif August 24, 2018 17:18
@codecov-io
Copy link

Codecov Report

Merging #9620 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9620      +/-   ##
==========================================
- Coverage   52.15%   52.14%   -0.01%     
==========================================
  Files         212      212              
  Lines       25916    25913       -3     
  Branches     5582     5582              
==========================================
- Hits        13517    13513       -4     
  Misses      10149    10149              
- Partials     2250     2251       +1
Impacted Files Coverage Δ
kernel/sys_clock.c 95.08% <ø> (-0.16%) ⬇️
kernel/include/timeout_q.h 94.44% <0%> (-1.45%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0be1875...2376a77. Read the comment docs.

@nashif
Copy link
Member

nashif commented Aug 25, 2018

Hey @andyross, is this covered with tests or do we need to add one?

@andyross
Copy link
Contributor Author

I keep flipping on this. The specific bug is sort of obtusely specific: you have to have two timer events expire in the same tick and the second one handled needs to re-add itself to the timeout queue.

I think aesthetically, rather than spend time writing a test I'd rather rework the code here to be simpler and not use the "re-use the dlist node" trick. Then it wouldn't have divergent code paths depending on the number of timeouts in a tick and wouldn't need a test.

@andyross
Copy link
Contributor Author

Added #9627 to track the need for refactoring here.

@nashif nashif merged commit d8d5ec3 into zephyrproject-rtos:master Aug 27, 2018
@findlayfeng
Copy link
Contributor

new bug

@nordic-krch
Copy link
Contributor

nordic-krch commented Sep 18, 2018

@andyross @nashif @carlescufi when working on new shell #9362 we encountered an issue (hardfault) after rebase on master. I couldn't figure out the issue so i ended up rebasing back and finally found that 2376a77 introduces this fault.

Example is fairly simple. It has two periodic instances of k_timer. When only one is running app is stable, enabling second periodic timer results in hardfault. Can you recommend something?

@carlescufi
Copy link
Member

Could this actually be also the source of some faults we've been seing recently?

@nordic-krch
Copy link
Contributor

in my case it was usually hardfault pointing to sys_dlist_peek_prev_no_check or mpu fault (attempt to execute from ram address)

@jakub-uC
Copy link
Contributor

I've also observed this problem when I used: k_sleep(1); in my application. When I have replaced k_sleep(1); with k_busy_wait(1000); problem is gone.
My error msg is following:
capture

@carlescufi
Copy link
Member

@jarz-nordic does reverting this commit also fix it for you if you use k_sleep(1)?

@jakub-uC
Copy link
Contributor

jakub-uC commented Sep 18, 2018

@carlescufi @nordic-krch @andyross : Once I've reverted this commit the problem is gone.

@carlescufi
Copy link
Member

@andyross as the original author of this patch, could you take a look at the reports here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fault during my timer testing
8 participants