tests/timer_api: Correct precision and fix correctness mistakes #32277

andyross · 2021-02-12T20:22:19Z

Correct a bunch of precision/analysis errors in this test:

Test items weren't consistent about tick alignment and resetting of
the timestamp, so put these steps into init_timer_data() and call
that immediately before k_timer_start().
Many items would calculate the initial timestamp AFTER
k_timer_start(), leading to an extra (third!) point where the timer
computation could alias by an extra tick. Always do this
consistently before the timer is started (via init_timer-data()).
Tickless systems with high tick rates can easily advance the system
uptime while the timer ISR is running, so the system can't expect
perfect accuracy even there (this test was originally written for
ticked systmes where the ISR was by definition happening "at the
same time").

(Unfortunately our most popular high tick rate tickless system,
nRF5, also has a clock that doesn't divide milliseconds exactly, so
it had a special path through all these precision comparisons and
avoided the bugs. We finally found it on a x86 HPET system with 10
kHz ticks.)
The interval validation was placing a minimum bound on the interval
time but not a maximum (this mistake was what had hidden the failure
to reset the timestamp mentioned above).

Longer term, the millisecond precision math in these tests is at this
point an out of control complexity explosion. We should look at
reworking the core OS tests of k_timer to use tick precision (which is
by definition exact) pervasively and leave the millisecond stuff to a
separate layer testing the alternative/legacy APIs.

Fixes #31964 (probably -- that was reported against up_squared, on
which I had trouble reproducing, but it was a common failure on
ehl_crb).

Signed-off-by: Andy Ross andrew.j.ross@intel.com

Correct a bunch of precision/analysis errors in this test: * Test items weren't consistent about tick alignment and resetting of the timestamp, so put these steps into init_timer_data() and call that immediately before k_timer_start(). * Many items would calculate the initial timestamp AFTER k_timer_start(), leading to an extra (third!) point where the timer computation could alias by an extra tick. Always do this consistently before the timer is started (via init_timer-data()). * Tickless systems with high tick rates can easily advance the system uptime while the timer ISR is running, so the system can't expect perfect accuracy even there (this test was originally written for ticked systmes where the ISR was by definition happening "at the same time"). (Unfortunately our most popular high tick rate tickless system, nRF5, also has a clock that doesn't divide milliseconds exactly, so it had a special path through all these precision comparisons and avoided the bugs. We finally found it on a x86 HPET system with 10 kHz ticks.) * The interval validation was placing a minimum bound on the interval time but not a maximum (this mistake was what had hidden the failure to reset the timestamp mentioned above). Longer term, the millisecond precision math in these tests is at this point an out of control complexity explosion. We should look at reworking the core OS tests of k_timer to use tick precision (which is by definition exact) pervasively and leave the millisecond stuff to a separate layer testing the alternative/legacy APIs. Fixes zephyrproject-rtos#31964 (probably -- that was reported against up_squared, on which I had trouble reproducing, but it was a common failure on ehl_crb). Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2021-02-12T20:23:44Z

This is one of those fixes where I spend more time writing the commit message than the code to try to convince people I'm not just cheating to make the test pass. But I swear I'm not just cheating to make the test pass! Careful review appreciated.

jenmwms

Tested on HW (up_squared, ehl_crb), resolves the issue. LGTM.

jenmwms · 2021-02-12T21:32:03Z

@chen-png fyi

pabigot

This can work because the epsilon to WITHIN_ERROR is 1, but I don't believe the changes are correct in the general case, and conflating upper and lower epsilons may cause future tests to be too lax.

A timer must never expire in less time than it was configured for, but it can appear to take longer due to delays. So the switch to a double-sided error with the same latitude in each direction isn't right.

The original motivation for allowing a timer to appear to expire one unit early was specifically for cases where the millisecond boundaries fell between ticks, and even though the correct duration did elapse that could not be confirmed using more coarse millisecond-aligned observations which discarded fractional ticks.

If millisecond conversions are kept the interval check should continue to have a tighter lower bound, zero slop if conversions are precise, and if they aren't just one units below the expected duration regardless of whether the high-side epsilon is larger. (I don't believe two units below is possible when the tick frequency exceeds 1 kHz, and I'm skeptical it can happen in other cases either.)

But since we have a whole release cycle to address this it would be much better to update the test now to operate on ticks (as suggested in the commit message), and get rid of the whole lossy millisecond conversion. We spend far too much time going back and patching this thing to make it pass.

pabigot · 2021-02-14T17:33:06Z

tests/kernel/timer/timer_api/src/main.c

@@ -17,8 +17,8 @@ struct timer_data {
 #define DURATION 100
 #define PERIOD 50
 #define EXPIRE_TIMES 4
-#define WITHIN_ERROR(var, target, epsilon)       \
-		(((var) >= (target)) && ((var) <= (target) + (epsilon)))
+#define WITHIN_ERROR(var, target, epsilon) (abs((target) - (var)) <= (epsilon))


This changes the check from a single-sided error to two-sided. I.e. in the past var < target would fail. Is the switch to two-sided really necessary? (added: in the case where conversion is precise)

pabigot · 2021-02-14T17:37:34Z

tests/kernel/timer/timer_api/src/main.c

+		slop += 2 * k_ticks_to_ms_ceil32(1);
+	}
+
+	if (abs(interval - desired) > slop) {


Same issue here: interval less than desired should be an error for precise conversions, and for inexact conversions should be limited to one less.

andyross · 2021-02-15T01:10:35Z

The time was being checked inside the ISR, though. An elapsed tick in that realm can absolutely result in a timer expiring "too early", and it was. It's not about unit precision at all, really, though your changes on Nordic to relax the requirements did have the effect of hiding the bugs there. Likewise the spots where the timestamp was being retrieved late would cause the same "too early" failure, though I believe I caught all those. I'll see if some of the other cases are being caught as false positives, but in general the test was wrong about what it was assuming.

As far as rewriting the test for clarity: yes, that should happen. But it's wrong now and causing failures. We need to merge this.

pabigot · 2021-02-15T11:23:54Z

The time was being checked inside the ISR, though. An elapsed tick in that realm can absolutely result in a timer expiring "too early", and it was.

This must not happen. Applications cannot tolerate timers firing before they're supposed to fire.

Let's figure out why that appears to be happening, whether that's due to:

use of imprecise millisecond conversions causing estimated elapsed time to be incorrectly calculated in the test, or
failure to calculate the correct deadline to satisfy the specified delay, or
use of relative deadlines when absolute ones are required (to mitigate "late-to-set" bugs), or
failure of the timer infrastructure to correctly reflect the number of ticks elapsed at the point the handler is invoked, or
whatever the root cause happens to be.

This path of hacking around each new failure we discover by further relaxing the "pass" criteria is not working for us. If the pass criteria are wrong, let's demonstrate that by eliminating the conversion error we already know is causing problems, then figure out what the real problem is.

I'd rather rewrite this test to use ticks myself than continue having to tweak it every single release. Would you like me to do that? I don't have an Elkhart Lake CRB though, so since that's a problematic platform somebody else would have to do the testing (or send me one).

andyross · 2021-02-15T15:12:38Z

I think you're still misunderstanding. The timer fired when it was supposed to fire. Time (real time) elapsed between the time the kernel interrupt was delivered and the moment the counter was retrieved in the ISR. And because this counter was late, the interval until the next timer will be early, leading to a short count. This isn't something that can be prevented by design, it's a bug in the test that assumed that ISR's see time as "frozen" (because time was implemented as a count of interrupts delivered). That stopped being true when everything became tickless.

Please remove your -1 so we can merge this, if there's a new bug in the test please file it.

Beaten into submission.

pabigot · 2021-02-15T15:43:59Z

... it's a bug in the test that assumed that ISR's see time as "frozen" (because time was implemented as a count of interrupts delivered). That stopped being true when everything became tickless.

Please remove your -1 so we can merge this, if there's a new bug in the test please file it.

OK. I do not approve this, but I'll let it in, and will make correcting the test a priority, because I'm really tired of dealing with this hackery.

andyross requested review from dcpleung and nashif as code owners February 12, 2021 20:22

andyross requested review from aasthagr and jenmwms February 12, 2021 20:22

github-actions bot added area: Kernel area: Tests Issues related to a particular existing or missing test labels Feb 12, 2021

dcpleung approved these changes Feb 12, 2021

View reviewed changes

jenmwms approved these changes Feb 12, 2021

View reviewed changes

aasthagr approved these changes Feb 12, 2021

View reviewed changes

jenmwms added the bug The issue is a bug, or the PR is fixing a bug label Feb 12, 2021

nashif added the backport v2.5-branch label Feb 13, 2021

zephyrbot requested a review from ceolin February 13, 2021 13:27

zephyrbot assigned andyross Feb 13, 2021

pabigot previously requested changes Feb 14, 2021

View reviewed changes

pabigot mentioned this pull request Feb 15, 2021

reimplement tests/kernel/timer/timer_api #32339

Closed

nashif merged commit c2339db into zephyrproject-rtos:master Feb 16, 2021

zephyrbot mentioned this pull request Feb 16, 2021

[Backport v2.5-branch] tests/timer_api: Correct precision and fix correctness mistakes #32371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/timer_api: Correct precision and fix correctness mistakes #32277

tests/timer_api: Correct precision and fix correctness mistakes #32277

andyross commented Feb 12, 2021

andyross commented Feb 12, 2021

jenmwms left a comment

jenmwms commented Feb 12, 2021

pabigot left a comment

pabigot Feb 14, 2021 •

edited

Loading

pabigot Feb 14, 2021

andyross commented Feb 15, 2021

pabigot commented Feb 15, 2021

andyross commented Feb 15, 2021

pabigot commented Feb 15, 2021 •

edited

Loading

tests/timer_api: Correct precision and fix correctness mistakes #32277

tests/timer_api: Correct precision and fix correctness mistakes #32277

Conversation

andyross commented Feb 12, 2021

andyross commented Feb 12, 2021

jenmwms left a comment

Choose a reason for hiding this comment

jenmwms commented Feb 12, 2021

pabigot left a comment

Choose a reason for hiding this comment

pabigot Feb 14, 2021 • edited Loading

Choose a reason for hiding this comment

pabigot Feb 14, 2021

Choose a reason for hiding this comment

andyross commented Feb 15, 2021

pabigot commented Feb 15, 2021

andyross commented Feb 15, 2021

pabigot commented Feb 15, 2021 • edited Loading

pabigot Feb 14, 2021 •

edited

Loading

pabigot commented Feb 15, 2021 •

edited

Loading