Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/timer_api: Correct precision and fix correctness mistakes #32277

Merged
merged 1 commit into from
Feb 16, 2021

Conversation

andyross
Copy link
Contributor

Correct a bunch of precision/analysis errors in this test:

  • Test items weren't consistent about tick alignment and resetting of
    the timestamp, so put these steps into init_timer_data() and call
    that immediately before k_timer_start().

  • Many items would calculate the initial timestamp AFTER
    k_timer_start(), leading to an extra (third!) point where the timer
    computation could alias by an extra tick. Always do this
    consistently before the timer is started (via init_timer-data()).

  • Tickless systems with high tick rates can easily advance the system
    uptime while the timer ISR is running, so the system can't expect
    perfect accuracy even there (this test was originally written for
    ticked systmes where the ISR was by definition happening "at the
    same time").

    (Unfortunately our most popular high tick rate tickless system,
    nRF5, also has a clock that doesn't divide milliseconds exactly, so
    it had a special path through all these precision comparisons and
    avoided the bugs. We finally found it on a x86 HPET system with 10
    kHz ticks.)

  • The interval validation was placing a minimum bound on the interval
    time but not a maximum (this mistake was what had hidden the failure
    to reset the timestamp mentioned above).

Longer term, the millisecond precision math in these tests is at this
point an out of control complexity explosion. We should look at
reworking the core OS tests of k_timer to use tick precision (which is
by definition exact) pervasively and leave the millisecond stuff to a
separate layer testing the alternative/legacy APIs.

Fixes #31964 (probably -- that was reported against up_squared, on
which I had trouble reproducing, but it was a common failure on
ehl_crb).

Signed-off-by: Andy Ross andrew.j.ross@intel.com

Correct a bunch of precision/analysis errors in this test:

* Test items weren't consistent about tick alignment and resetting of
  the timestamp, so put these steps into init_timer_data() and call
  that immediately before k_timer_start().

* Many items would calculate the initial timestamp AFTER
  k_timer_start(), leading to an extra (third!) point where the timer
  computation could alias by an extra tick.  Always do this
  consistently before the timer is started (via init_timer-data()).

* Tickless systems with high tick rates can easily advance the system
  uptime while the timer ISR is running, so the system can't expect
  perfect accuracy even there (this test was originally written for
  ticked systmes where the ISR was by definition happening "at the
  same time").

  (Unfortunately our most popular high tick rate tickless system,
  nRF5, also has a clock that doesn't divide milliseconds exactly, so
  it had a special path through all these precision comparisons and
  avoided the bugs.  We finally found it on a x86 HPET system with 10
  kHz ticks.)

* The interval validation was placing a minimum bound on the interval
  time but not a maximum (this mistake was what had hidden the failure
  to reset the timestamp mentioned above).

Longer term, the millisecond precision math in these tests is at this
point an out of control complexity explosion.  We should look at
reworking the core OS tests of k_timer to use tick precision (which is
by definition exact) pervasively and leave the millisecond stuff to a
separate layer testing the alternative/legacy APIs.

Fixes zephyrproject-rtos#31964 (probably -- that was reported against up_squared, on
which I had trouble reproducing, but it was a common failure on
ehl_crb).

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross andyross requested review from aasthagr and jenmwms February 12, 2021 20:22
@github-actions github-actions bot added area: Kernel area: Tests Issues related to a particular existing or missing test labels Feb 12, 2021
@andyross
Copy link
Contributor Author

This is one of those fixes where I spend more time writing the commit message than the code to try to convince people I'm not just cheating to make the test pass. But I swear I'm not just cheating to make the test pass! Careful review appreciated.

Copy link
Collaborator

@jenmwms jenmwms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on HW (up_squared, ehl_crb), resolves the issue. LGTM.

@jenmwms jenmwms added the bug The issue is a bug, or the PR is fixing a bug label Feb 12, 2021
@jenmwms
Copy link
Collaborator

jenmwms commented Feb 12, 2021

@chen-png fyi

Copy link
Collaborator

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can work because the epsilon to WITHIN_ERROR is 1, but I don't believe the changes are correct in the general case, and conflating upper and lower epsilons may cause future tests to be too lax.

A timer must never expire in less time than it was configured for, but it can appear to take longer due to delays. So the switch to a double-sided error with the same latitude in each direction isn't right.

The original motivation for allowing a timer to appear to expire one unit early was specifically for cases where the millisecond boundaries fell between ticks, and even though the correct duration did elapse that could not be confirmed using more coarse millisecond-aligned observations which discarded fractional ticks.

If millisecond conversions are kept the interval check should continue to have a tighter lower bound, zero slop if conversions are precise, and if they aren't just one units below the expected duration regardless of whether the high-side epsilon is larger. (I don't believe two units below is possible when the tick frequency exceeds 1 kHz, and I'm skeptical it can happen in other cases either.)

But since we have a whole release cycle to address this it would be much better to update the test now to operate on ticks (as suggested in the commit message), and get rid of the whole lossy millisecond conversion. We spend far too much time going back and patching this thing to make it pass.

@@ -17,8 +17,8 @@ struct timer_data {
#define DURATION 100
#define PERIOD 50
#define EXPIRE_TIMES 4
#define WITHIN_ERROR(var, target, epsilon) \
(((var) >= (target)) && ((var) <= (target) + (epsilon)))
#define WITHIN_ERROR(var, target, epsilon) (abs((target) - (var)) <= (epsilon))
Copy link
Collaborator

@pabigot pabigot Feb 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the check from a single-sided error to two-sided. I.e. in the past var < target would fail. Is the switch to two-sided really necessary? (added: in the case where conversion is precise)

slop += 2 * k_ticks_to_ms_ceil32(1);
}

if (abs(interval - desired) > slop) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: interval less than desired should be an error for precise conversions, and for inexact conversions should be limited to one less.

@andyross
Copy link
Contributor Author

The time was being checked inside the ISR, though. An elapsed tick in that realm can absolutely result in a timer expiring "too early", and it was. It's not about unit precision at all, really, though your changes on Nordic to relax the requirements did have the effect of hiding the bugs there. Likewise the spots where the timestamp was being retrieved late would cause the same "too early" failure, though I believe I caught all those. I'll see if some of the other cases are being caught as false positives, but in general the test was wrong about what it was assuming.

As far as rewriting the test for clarity: yes, that should happen. But it's wrong now and causing failures. We need to merge this.

@pabigot
Copy link
Collaborator

pabigot commented Feb 15, 2021

The time was being checked inside the ISR, though. An elapsed tick in that realm can absolutely result in a timer expiring "too early", and it was.

This must not happen. Applications cannot tolerate timers firing before they're supposed to fire.

Let's figure out why that appears to be happening, whether that's due to:

  • use of imprecise millisecond conversions causing estimated elapsed time to be incorrectly calculated in the test, or
  • failure to calculate the correct deadline to satisfy the specified delay, or
  • use of relative deadlines when absolute ones are required (to mitigate "late-to-set" bugs), or
  • failure of the timer infrastructure to correctly reflect the number of ticks elapsed at the point the handler is invoked, or
  • whatever the root cause happens to be.

This path of hacking around each new failure we discover by further relaxing the "pass" criteria is not working for us. If the pass criteria are wrong, let's demonstrate that by eliminating the conversion error we already know is causing problems, then figure out what the real problem is.

I'd rather rewrite this test to use ticks myself than continue having to tweak it every single release. Would you like me to do that? I don't have an Elkhart Lake CRB though, so since that's a problematic platform somebody else would have to do the testing (or send me one).

@andyross
Copy link
Contributor Author

I think you're still misunderstanding. The timer fired when it was supposed to fire. Time (real time) elapsed between the time the kernel interrupt was delivered and the moment the counter was retrieved in the ISR. And because this counter was late, the interval until the next timer will be early, leading to a short count. This isn't something that can be prevented by design, it's a bug in the test that assumed that ISR's see time as "frozen" (because time was implemented as a count of interrupts delivered). That stopped being true when everything became tickless.

Please remove your -1 so we can merge this, if there's a new bug in the test please file it.

@pabigot pabigot dismissed their stale review February 15, 2021 15:43

Beaten into submission.

@pabigot
Copy link
Collaborator

pabigot commented Feb 15, 2021

... it's a bug in the test that assumed that ISR's see time as "frozen" (because time was implemented as a count of interrupts delivered). That stopped being true when everything became tickless.

Please remove your -1 so we can merge this, if there's a new bug in the test please file it.

OK. I do not approve this, but I'll let it in, and will make correcting the test a priority, because I'm really tired of dealing with this hackery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

up_squared: tests/kernel/timer/timer_api failed.
6 participants