-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/timer_api: Correct precision and fix correctness mistakes #32277
Conversation
Correct a bunch of precision/analysis errors in this test: * Test items weren't consistent about tick alignment and resetting of the timestamp, so put these steps into init_timer_data() and call that immediately before k_timer_start(). * Many items would calculate the initial timestamp AFTER k_timer_start(), leading to an extra (third!) point where the timer computation could alias by an extra tick. Always do this consistently before the timer is started (via init_timer-data()). * Tickless systems with high tick rates can easily advance the system uptime while the timer ISR is running, so the system can't expect perfect accuracy even there (this test was originally written for ticked systmes where the ISR was by definition happening "at the same time"). (Unfortunately our most popular high tick rate tickless system, nRF5, also has a clock that doesn't divide milliseconds exactly, so it had a special path through all these precision comparisons and avoided the bugs. We finally found it on a x86 HPET system with 10 kHz ticks.) * The interval validation was placing a minimum bound on the interval time but not a maximum (this mistake was what had hidden the failure to reset the timestamp mentioned above). Longer term, the millisecond precision math in these tests is at this point an out of control complexity explosion. We should look at reworking the core OS tests of k_timer to use tick precision (which is by definition exact) pervasively and leave the millisecond stuff to a separate layer testing the alternative/legacy APIs. Fixes zephyrproject-rtos#31964 (probably -- that was reported against up_squared, on which I had trouble reproducing, but it was a common failure on ehl_crb). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
This is one of those fixes where I spend more time writing the commit message than the code to try to convince people I'm not just cheating to make the test pass. But I swear I'm not just cheating to make the test pass! Careful review appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on HW (up_squared, ehl_crb), resolves the issue. LGTM.
@chen-png fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can work because the epsilon to WITHIN_ERROR
is 1, but I don't believe the changes are correct in the general case, and conflating upper and lower epsilons may cause future tests to be too lax.
A timer must never expire in less time than it was configured for, but it can appear to take longer due to delays. So the switch to a double-sided error with the same latitude in each direction isn't right.
The original motivation for allowing a timer to appear to expire one unit early was specifically for cases where the millisecond boundaries fell between ticks, and even though the correct duration did elapse that could not be confirmed using more coarse millisecond-aligned observations which discarded fractional ticks.
If millisecond conversions are kept the interval check should continue to have a tighter lower bound, zero slop if conversions are precise, and if they aren't just one units below the expected duration regardless of whether the high-side epsilon is larger. (I don't believe two units below is possible when the tick frequency exceeds 1 kHz, and I'm skeptical it can happen in other cases either.)
But since we have a whole release cycle to address this it would be much better to update the test now to operate on ticks (as suggested in the commit message), and get rid of the whole lossy millisecond conversion. We spend far too much time going back and patching this thing to make it pass.
@@ -17,8 +17,8 @@ struct timer_data { | |||
#define DURATION 100 | |||
#define PERIOD 50 | |||
#define EXPIRE_TIMES 4 | |||
#define WITHIN_ERROR(var, target, epsilon) \ | |||
(((var) >= (target)) && ((var) <= (target) + (epsilon))) | |||
#define WITHIN_ERROR(var, target, epsilon) (abs((target) - (var)) <= (epsilon)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the check from a single-sided error to two-sided. I.e. in the past var < target
would fail. Is the switch to two-sided really necessary? (added: in the case where conversion is precise)
slop += 2 * k_ticks_to_ms_ceil32(1); | ||
} | ||
|
||
if (abs(interval - desired) > slop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here: interval less than desired should be an error for precise conversions, and for inexact conversions should be limited to one less.
The time was being checked inside the ISR, though. An elapsed tick in that realm can absolutely result in a timer expiring "too early", and it was. It's not about unit precision at all, really, though your changes on Nordic to relax the requirements did have the effect of hiding the bugs there. Likewise the spots where the timestamp was being retrieved late would cause the same "too early" failure, though I believe I caught all those. I'll see if some of the other cases are being caught as false positives, but in general the test was wrong about what it was assuming. As far as rewriting the test for clarity: yes, that should happen. But it's wrong now and causing failures. We need to merge this. |
This must not happen. Applications cannot tolerate timers firing before they're supposed to fire. Let's figure out why that appears to be happening, whether that's due to:
This path of hacking around each new failure we discover by further relaxing the "pass" criteria is not working for us. If the pass criteria are wrong, let's demonstrate that by eliminating the conversion error we already know is causing problems, then figure out what the real problem is. I'd rather rewrite this test to use ticks myself than continue having to tweak it every single release. Would you like me to do that? I don't have an Elkhart Lake CRB though, so since that's a problematic platform somebody else would have to do the testing (or send me one). |
I think you're still misunderstanding. The timer fired when it was supposed to fire. Time (real time) elapsed between the time the kernel interrupt was delivered and the moment the counter was retrieved in the ISR. And because this counter was late, the interval until the next timer will be early, leading to a short count. This isn't something that can be prevented by design, it's a bug in the test that assumed that ISR's see time as "frozen" (because time was implemented as a count of interrupts delivered). That stopped being true when everything became tickless. Please remove your -1 so we can merge this, if there's a new bug in the test please file it. |
OK. I do not approve this, but I'll let it in, and will make correcting the test a priority, because I'm really tired of dealing with this hackery. |
Correct a bunch of precision/analysis errors in this test:
Test items weren't consistent about tick alignment and resetting of
the timestamp, so put these steps into init_timer_data() and call
that immediately before k_timer_start().
Many items would calculate the initial timestamp AFTER
k_timer_start(), leading to an extra (third!) point where the timer
computation could alias by an extra tick. Always do this
consistently before the timer is started (via init_timer-data()).
Tickless systems with high tick rates can easily advance the system
uptime while the timer ISR is running, so the system can't expect
perfect accuracy even there (this test was originally written for
ticked systmes where the ISR was by definition happening "at the
same time").
(Unfortunately our most popular high tick rate tickless system,
nRF5, also has a clock that doesn't divide milliseconds exactly, so
it had a special path through all these precision comparisons and
avoided the bugs. We finally found it on a x86 HPET system with 10
kHz ticks.)
The interval validation was placing a minimum bound on the interval
time but not a maximum (this mistake was what had hidden the failure
to reset the timestamp mentioned above).
Longer term, the millisecond precision math in these tests is at this
point an out of control complexity explosion. We should look at
reworking the core OS tests of k_timer to use tick precision (which is
by definition exact) pervasively and leave the millisecond stuff to a
separate layer testing the alternative/legacy APIs.
Fixes #31964 (probably -- that was reported against up_squared, on
which I had trouble reproducing, but it was a common failure on
ehl_crb).
Signed-off-by: Andy Ross andrew.j.ross@intel.com