Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault during my timer testing #8669

Closed
wayen30 opened this issue Jul 2, 2018 · 2 comments · Fixed by #9620
Closed

fault during my timer testing #8669

wayen30 opened this issue Jul 2, 2018 · 2 comments · Fixed by #9620
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Milestone

Comments

@wayen30
Copy link

wayen30 commented Jul 2, 2018

hello
I found a fault during my timer testing.

zephyr version: 1.9.1
code:

_struct k_timer test_timer, test_timer2;
static void test_timeout_event(os_timer *timer)
{
}

static void test2_timeout_event(os_timer *timer)
{
    k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20));
}

void test_timer(void)
{
    k_timer_init(&test_timer, test_timeout_event, NULL); 
    k_timer_init(&test_timer2, test2_timeout_event, NULL); 

    k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20));  

    while(1) {
        k_timer_start(&test_timer2, K_MSEC(100), 0);  
        k_sleep(K_MSEC(1000));
    }
}

analysis:
when timer1 & timer2 expired in the same tick, timer1 & timer2 will be dequeue from _timeout_q to expired.
In timer2 callback function, k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20)) will re-insert timer1 to _timeout_q. After timer2 callback function, the expired sys_dlist(in _handle_expired_timeouts()) has changed.
The callback of timer linked in the _timeout_q will be called in order. when run last timeout(_timeout_q which actually is not a timer structure),run timeout->func will trigger a fault.

@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label Jul 3, 2018
@nashif nashif added the priority: medium Medium impact/importance bug label Jul 13, 2018
@andyross
Copy link
Contributor

Confirmed on HEAD (the report was against 1.9 -- @wayen30 please try to validate against a recent version when reporting new bugs) with a little porting of the sample code. I can get a fault (inside the scheduler, implying a corrupt run queue) on x86, and on cortex_m3 I see a hang where the timeout handlers appear to run correctly, but the k_sleep() out of the main thread (sleep uses the same timeout framework internally) fails to wake up the second time it is called.

Unsurprisingly, adding logging to try to inspect things has the side effect of delaying timing and "fixing" the bug. Sigh.

The proximate cause seems to be the handling of more than one timeout in a single timer interrupt. We... don't actually seem to have a test for this condition, and if I adjust the numbers in the code above so they don't land at the same time it seems to work.

I hate to say it, but we probably should make this bug a P1. This is pretty bad if we can't handle arbitrary timeout scheduling robustly.

andyross pushed a commit to andyross/zephyr that referenced this issue Aug 24, 2018
This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch:

The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler.  But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node.  If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.

Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing.  But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.

As it happens, there's no value in trying to remove this timeout from
the temporary list at all.  Just iterate over it naturally.

Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism.  But this fix should be good for now.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@nashif nashif added this to the v1.13.0 milestone Aug 26, 2018
nashif pushed a commit that referenced this issue Aug 27, 2018
This fixes #8669, and is distressingly subtle for a one-line patch:

The list iteration code in _handle_expired_timeouts() would remove the
timeout from our (temporary -- the dlist header is on the stack of our
calling function) list of expired timeouts before invoking the
handler.  But sys_dlist_remove() only fixes up the containing list
pointers, leaving garbage in the node.  If the action of that handler
is to re-add the timeout (which is very common!) then that will then
try to remove it AGAIN from the same list.

Even then, the common case is that the expired list contains only one
item, so the result is a perfectly valid empty list that affects
nothing.  But if you have more than one, you get a corrupt cycle in
the iteration list and things get weird.

As it happens, there's no value in trying to remove this timeout from
the temporary list at all.  Just iterate over it naturally.

Really, this design is fragile: we shouldn't be reusing the list nodes
in struct _timeout for this purpose and should figure out some other
mechanism.  But this fix should be good for now.

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@findlayfeng
Copy link
Contributor

new bug~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants