fault during my timer testing #8669

wayen30 · 2018-07-02T08:33:48Z

hello
I found a fault during my timer testing.

zephyr version: 1.9.1
code:

_struct k_timer test_timer, test_timer2;
static void test_timeout_event(os_timer *timer)
{
}

static void test2_timeout_event(os_timer *timer)
{
    k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20));
}

void test_timer(void)
{
    k_timer_init(&test_timer, test_timeout_event, NULL); 
    k_timer_init(&test_timer2, test2_timeout_event, NULL); 

    k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20));  

    while(1) {
        k_timer_start(&test_timer2, K_MSEC(100), 0);  
        k_sleep(K_MSEC(1000));
    }
}

analysis:
when timer1 & timer2 expired in the same tick, timer1 & timer2 will be dequeue from _timeout_q to expired.
In timer2 callback function, k_timer_start(&test_timer, K_MSEC(10), K_MSEC(20)) will re-insert timer1 to _timeout_q. After timer2 callback function, the expired sys_dlist(in _handle_expired_timeouts()) has changed.
The callback of timer linked in the _timeout_q will be called in order. when run last timeout(_timeout_q which actually is not a timer structure)，run timeout->func will trigger a fault.

The text was updated successfully, but these errors were encountered:

andyross · 2018-08-23T19:46:12Z

Confirmed on HEAD (the report was against 1.9 -- @wayen30 please try to validate against a recent version when reporting new bugs) with a little porting of the sample code. I can get a fault (inside the scheduler, implying a corrupt run queue) on x86, and on cortex_m3 I see a hang where the timeout handlers appear to run correctly, but the k_sleep() out of the main thread (sleep uses the same timeout framework internally) fails to wake up the second time it is called.

Unsurprisingly, adding logging to try to inspect things has the side effect of delaying timing and "fixing" the bug. Sigh.

The proximate cause seems to be the handling of more than one timeout in a single timer interrupt. We... don't actually seem to have a test for this condition, and if I adjust the numbers in the code above so they don't land at the same time it seems to work.

I hate to say it, but we probably should make this bug a P1. This is pretty bad if we can't handle arbitrary timeout scheduling robustly.

This fixes zephyrproject-rtos#8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

This fixes #8669, and is distressingly subtle for a one-line patch: The list iteration code in _handle_expired_timeouts() would remove the timeout from our (temporary -- the dlist header is on the stack of our calling function) list of expired timeouts before invoking the handler. But sys_dlist_remove() only fixes up the containing list pointers, leaving garbage in the node. If the action of that handler is to re-add the timeout (which is very common!) then that will then try to remove it AGAIN from the same list. Even then, the common case is that the expired list contains only one item, so the result is a perfectly valid empty list that affects nothing. But if you have more than one, you get a corrupt cycle in the iteration list and things get weird. As it happens, there's no value in trying to remove this timeout from the temporary list at all. Just iterate over it naturally. Really, this design is fragile: we shouldn't be reusing the list nodes in struct _timeout for this purpose and should figure out some other mechanism. But this fix should be good for now. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

findlayfeng · 2018-08-30T06:35:16Z

new bug~

nashif added the bug The issue is a bug, or the PR is fixing a bug label Jul 3, 2018

nashif added the priority: medium Medium impact/importance bug label Jul 13, 2018

nashif assigned andyross Jul 17, 2018

This was referenced Aug 24, 2018

kernel: Fix double-list-removal corruption case in timeout handling #9620

Merged

Clean up timeout handling #9627

Closed

nashif added this to the v1.13.0 milestone Aug 26, 2018

nashif closed this as completed in #9620 Aug 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fault during my timer testing #8669

fault during my timer testing #8669

wayen30 commented Jul 2, 2018 •

edited by nashif

Loading

andyross commented Aug 23, 2018

findlayfeng commented Aug 30, 2018

fault during my timer testing #8669

fault during my timer testing #8669

Comments

wayen30 commented Jul 2, 2018 • edited by nashif Loading

andyross commented Aug 23, 2018

findlayfeng commented Aug 30, 2018

wayen30 commented Jul 2, 2018 •

edited by nashif

Loading