-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault during my timer testing #8669
Comments
Confirmed on HEAD (the report was against 1.9 -- @wayen30 please try to validate against a recent version when reporting new bugs) with a little porting of the sample code. I can get a fault (inside the scheduler, implying a corrupt run queue) on x86, and on cortex_m3 I see a hang where the timeout handlers appear to run correctly, but the k_sleep() out of the main thread (sleep uses the same timeout framework internally) fails to wake up the second time it is called. Unsurprisingly, adding logging to try to inspect things has the side effect of delaying timing and "fixing" the bug. Sigh. The proximate cause seems to be the handling of more than one timeout in a single timer interrupt. We... don't actually seem to have a test for this condition, and if I adjust the numbers in the code above so they don't land at the same time it seems to work. I hate to say it, but we probably should make this bug a P1. This is pretty bad if we can't handle arbitrary timeout scheduling robustly. |
This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This fixes #8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
new bug~ |
hello
I found a fault during my timer testing.
zephyr version: 1.9.1
code:
analysis:
when timer1 & timer2 expired in the same tick, timer1 & timer2 will be dequeue from _timeout_q to expired.
In timer2 callback function, k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20)) will re-insert timer1 to _timeout_q. After timer2 callback function, the expired sys_dlist(in _handle_expired_timeouts()) has changed.
The callback of timer linked in the _timeout_q will be called in order. when run last timeout(_timeout_q which actually is not a timer structure),run timeout->func will trigger a fault.
The text was updated successfully, but these errors were encountered: