Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time.monotonic_ns sometimes jumps forward, and then returns the correct value on the next call #5985

Closed
jwfmcclain opened this issue Feb 6, 2022 · 19 comments · Fixed by #6004 or #6202
Assignees
Milestone

Comments

@jwfmcclain
Copy link

CircuitPython version

pi@serialpi:~ $ more /media/CIRCUITPY/boot_out.txt 
Adafruit CircuitPython 7.0.0 on 2021-09-20; Adafruit Feather M4 Express with samd51j19
Board ID:feather_m4_express

Code/REPL

# system is bigger than this, but this is the only code making calls to monotonic_ns
def get_new_now(prev):
    global previous_now
    global monotonic_ns_calls

    if prev is not None:
        previous_now = prev

    monotonic_ns_calls += 1
    now = time.monotonic_ns()

    if previous_now is not None:
        if previous_now > now:
            print(f"monotonic_ns MOVED BACKWARD! {previous_now} {now} : {monotonic_ns_calls} {count_string()}")
        elif now - previous_now > SECONDS_PER_NS * 3600 * 24:
            print(f"monotonic_ns JUMPED MORE THAN A DAY! {previous_now} {now} : {monotonic_ns_calls} {count_string()}")
            # monotonic_ns_calls += 1
            # now = time.monotonic_ns()
            # print(f"called monotonic_ns() again, now: {now}")

    return now

Behavior

monotonic_ns JUMPED MORE THAN A DAY! 4194303992614752 4456447999938977 : 13388088 loop_count: 3868063 state_clock: 3868062
[...other stuff...]
monotonic_ns MOVED BACKWARD! 4456447999938977 4194304025573733 : 13388089 loop_count: 3868064 state_clock: 3868063

Description

I have two Adafruit Feather M4 running programs that call time.monotonic_ns ~100 times a second. Each day with approximately a 10% chance it seems one them makes a monotonic_ns call and gets a result more than a day in the future. The next call then seem to be ok.

I have an Adafruit Bluefuit Circuit Playground running similar code that doesn't seem to present this problem. Aside from the CPU one difference is the M4's are setting rtc.RTC().datetime once an hour...thought the discontinues in monotonic_ns don't seem correlated with these calls.

Additional information

No response

@jwfmcclain jwfmcclain added the bug label Feb 6, 2022
@dhalbert dhalbert added this to the 7.x.x milestone Feb 6, 2022
@jepler
Copy link
Member

jepler commented Feb 6, 2022

Looking for interesting relationships between the two timestamps, I found that they're 262144 (2^18) seconds apart, almost exactly.

$ units '(4456447999938977-4194303992614752)ns'
        Definition: 262144.01 s

Is a power on time of >40 days required to reproduce this bug, or will it equally reproduce during the first day(s) of power on time?

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 6, 2022

This sounds like fetching two halves of a count non-atomically, but I thought we guarded against that.

@jwfmcclain
Copy link
Author

The other pair I have (this has been going on for over a month, but it took me a while to get the right logging) is:

monotonic_ns jumped more than a day! 3932159995422367 4194303999938977 : 48220698 loop_count: 13949689 state_clock: 13949688
monotonic_ns MOVED BACKWARD! 4194303999938977 3932160026794439 : 48220699 loop_count: 13949690 state_clock: 13949689

About the same jump

In [1]: (4194303999938977-3932160026794439) / 1_000_000_000                     
Out[1]: 262143.973144538

It certainly takes less than 40 days. Last two happened on Feb 3 and 6th. And it doesn't seem periodic either ... haven't been keeping good enough records, but feels Poisson

@jepler
Copy link
Member

jepler commented Feb 7, 2022

I have a general theory as to the cause, but didn't get a chance to prove it. My theory's specific to SAM D5x / E5x microcontrollers.

I found that "Our RTC is 32 bits and we're clocking it at 16.384khz" which means that 262144 seconds is the RTC overflow period.

An interrupt will fire to increment the "overflowed_ticks" value, which together with the RTC count is read with interrupts disabled:

    common_hal_mcu_disable_interrupts();
    uint32_t count = RTC->MODE0.COUNT.reg;
    if (overflow_count != NULL) {
        *overflow_count = overflowed_ticks;
    }
    common_hal_mcu_enable_interrupts();

Here's where my account of what's going on is hazy. Your logs make it look like the erroneous timestamp is 3 days in the future, but the possibility I see is for COUNT.reg to overflow between common_hal_mcu_disable_interrupts and the load of the register; but that's going to be extremely low-probability and the low timer bits would all have to be close to zero. To fix that, you'd need to check whether the overflow register is set and count is very small (I think?) and if so, increment *overflow_count here.

But if you catch it every 3 days, it can't be a rarity compared to a register that overflows every 3 days; and your jump seems to be in the wrong direction.

hmmmmm .. ! there is something special about your wrong values. "A tick is 976562.5 nanoseconds", dividing out that factor does give an interesting number in hex:

>>> stamps = 4194303992614752, 4456447999938977, 4194304025573733, 3932159995422367, 4194303999938977, 3932160026794439
>>> for s in stamps: print("%010x" % floor(s / 976562.5))
... 
00fffffff8
010fffffff
010000001a
00effffffb
00ffffffff
00f000001b

I still don't see what the bug is though. It looks like overflow_count has been incremented, but COUNT.reg hasn't. Silicon bug? Insufficient understanding of what it takes to ensure COUNT.reg is synchronized?

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 7, 2022

but the possibility I see is for COUNT.reg to overflow between common_hal_mcu_disable_interrupts and the load of the register

I looked at the code earlier today and wondered the same thing, but it did seem like very low probability.

I looked at the errata and found this interesting problem. We are setting COUNTSYNC, which we have to set to be able to read the count. but we set it in rtc_init(), so I'm not sure how this problem would persist after the first read.

image

@jwfmcclain
Copy link
Author

I would be surprised if was happening every 3 days...it would not surprise me if every 3 days it happens with 25% to 50% probably. Will add new cases to the ticket as they come in.

@jwfmcclain
Copy link
Author

I should also add, it is possible there are times where we go backwards in time, and then jumps forward. Before I had my logging I probably would not have noticed (because the system would have otherwise appeared to work...it is the jumping forward then back that gets my code in a bad state and I know to look). Back-then-forword would should up in this new logging, but the logs sometimes scroll off before I get to read them if nothing else bad appears to be happening.

TL;DR - I am confident I am seeing all the forword-then-back events, it is possible there are also back-then-forward events that I have missed.

@jwfmcclain
Copy link
Author

I got another this time on the "other" Feather, lets call it "unit 2"

monotonic_ns JUMPED MORE THAN A DAY! 262143935363780 524287999938977 : 5499007 loop_count: 2489985 state_clock: 2489984
monotonic_ns MOVED BACKWARD! 524287999938977 262144015563977 : 5499008 loop_count: 2489986 state_clock: 2489985
In [2]: (524287999938977 - 262143935363780) / 1_000_000_000                                                                                                                                                  
Out[2]: 262144.064575197
In [3]: 262143935363780 / (1_000_000_000 * 3600)                                                                                                                                                             
Out[3]: 72.81775982327223

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 9, 2022

here is a more careful implementation:
https://github.com/tinygo-org/tinygo/blob/release/src/runtime/runtime_atsamd51.go
and a PR which fixed a problem:
tinygo-org/tinygo#1870
Long thread about general technique (referenced in the code and PR above):
https://www.eevblog.com/forum/microcontrollers/correct-timing-by-timer-overflow-count/

@jwfmcclain
Copy link
Author

Somewhat surprisingly (to me) unit 1 had the error again last night:

monotonic_ns JUMPED MORE THAN A DAY! 4456447997497564 4718591999938977 : 47534236 loop_count: 14258678 state_clock: 14258677
monotonic_ns MOVED BACKWARD! 4718591999938977 4456448026550295 : 47534237 loop_count: 14258679 state_clock: 14258678
In [29]: (4718591999938977 - 4456447997497564) / (1_000_000_000)                                                                                              
Out[29]: 262144.002441413

# ...

# time since last event
In [34]: (4456447997497564 - 4194303992614752) / (1_000_000_000 * 3600)                                                                                       
Out[34]: 72.81777913411445

So maybe it really is just happening every 3 days and it just seem longer because I was impatient for reproducibility while tweaking the logging....

Happy to keep on reporting, or stop if it is just noise at this point

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 9, 2022

We hope to give you a test version relatively soon, I think. And we want to try to reproduce this in a simpler way, maybe by presetting the counter value so we don't have to wait three days :). I just talked to @jepler about this, this morning.

@jwfmcclain
Copy link
Author

FWIW, Unit 2 got the bug again

monotonic_ns JUMPED MORE THAN A DAY! 524287829284670 786431999938977 : 5503672 loop_count: 2486214 state_clock: 2486213
monotonic_ns MOVED BACKWARD! 786431999938977 524288017272958 : 5503673 loop_count: 2486215 state_clock: 2486214
In [47]: (786431999938977 - 524287829284670) / 1_000_000_000                                                                                                  
Out[47]: 262144.170654307

In [48]: (524287829284670 - 262144015563977) / (1_000_000_000 * 3600)                                                                                         
Out[48]: 72.81772603352583

Will try to get the build running tomorrow.

@jwfmcclain
Copy link
Author

@dhalbert Installed 7.2.0-alpha.1-366-g528c2a322 on both M4 Feathers Thursday afternoon (evening?), and got the error on both boards Sunday night (EST).

Unit 1

pi@serialpi:~ $ cat /media/CIRCUITPY/boot_out.txt 
Adafruit CircuitPython 7.2.0-alpha.1-366-g528c2a322 on 2022-02-09; Adafruit Feather M4 Express with samd51j19
Board ID:feather_m4_express
pi@serialpi:~ $ grep monotonic_ns screenlog.0
monotonic_ns JUMPED MORE THAN A DAY! 262143997924816 524287999877941 : 34111132 loop_count: 11828534 state_clock: 11828533
monotonic_ns MOVED BACKWARD! 524287999877941 262144032226562 : 34111133 loop_count: 11828535 state_clock: 11828534
pi@serialpi:~ $ 
In [1]: (524287999877941-262143997924816) / 1_000_000_000                                                                                            
Out[1]: 262144.001953125

Unit 2

Big-iMac:2021-10-10 pdp8$ cat /Volumes/CIRCUITPY/boot_out.txt 
Adafruit CircuitPython 7.2.0-alpha.1-366-g528c2a322 on 2022-02-09; Adafruit Feather M4 Express with samd51j19
Board ID:feather_m4_express
Big-iMac:2021-10-10 pdp8$ grep monotonic_ns screenlog.0
monotonic_ns JUMPED MORE THAN A DAY! 262143996887218 524287999938977 : 14753113 loop_count: 4346831 state_clock: 4346830
monotonic_ns MOVED BACKWARD! 524287999938977 262144015808108 : 14753114 loop_count: 4346832 state_clock: 4346831
Big-iMac:2021-10-10 pdp8$
In [2]: (524287999938977 - 262143996887218) / 1_000_000_000                                                                                          
Out[2]: 262144.003051759

Is it possable I have the wrong build?

@dhalbert
Copy link
Collaborator

That is the build with the commit of my changes :( . It's also in 7.2.0-alpha.2.

@dhalbert
Copy link
Collaborator

I will try to make some other kind of test that sets the count register forward so it doesn't take three days for confirmation.

@dhalbert dhalbert reopened this Feb 14, 2022
@bludin
Copy link

bludin commented Feb 14, 2022

(probably just a coincidence: 3 days is also the half-period of the ticks_ms counter which relies on the same clock as monotonic_ns as far as I remember from another thread. might there be a connection? )

@jwfmcclain
Copy link
Author

At ~day 6 unit 1 didn't show the problem, but unit 2 did

monotonic_ns JUMPED MORE THAN A DAY! 524287999023437 786431999877941 : 14147094 loop_count: 4617955 state_clock: 4617954
monotonic_ns MOVED BACKWARD! 786431999877941 524288015747072 : 14147095 loop_count: 4617956 state_clock: 4617955

During the day unit 1 doesn't call monotonic_ns a often, so it might just be that it didn't call monotonic_ns during the window (but not sure where the period is v. ET day/night....)

@dhalbert
Copy link
Collaborator

Thanks for the continued testing. I had hoped I found the non-atomic reading of the register and incrementing the overflow count, but there must be something further to do atomically.

@jepler
Copy link
Member

jepler commented Mar 25, 2022

I instrumented CircuitPython so that I could easily cause RTC overflows.

diff --git a/ports/atmel-samd/common-hal/rtc/RTC.c b/ports/atmel-samd/common-hal/rtc/RTC.c
index e2a67bd17..1d16e9b86 100644
--- a/ports/atmel-samd/common-hal/rtc/RTC.c
+++ b/ports/atmel-samd/common-hal/rtc/RTC.c
@@ -68,6 +68,9 @@ int common_hal_rtc_get_calibration(void) {
 }
 
 void common_hal_rtc_set_calibration(int calibration) {
+    mp_printf(&mp_plat_print, "Warping RTC in calibration setter\n");
+    RTC->MODE0.COUNT.reg = 0xffffff00;
+
     if (calibration > 127 || calibration < -127) {
         #if CIRCUITPY_FULL_BUILD
         mp_raise_ValueError(translate("calibration value out of range +/-127"));
diff --git a/ports/atmel-samd/supervisor/port.c b/ports/atmel-samd/supervisor/port.c
index 7be1fdb53..372b17095 100644
--- a/ports/atmel-samd/supervisor/port.c
+++ b/ports/atmel-samd/supervisor/port.c
@@ -241,6 +241,8 @@ static void rtc_init(void) {
     #endif
 
     RTC->MODE0.INTENSET.reg = RTC_MODE0_INTENSET_OVF;
+    RTC->MODE0.COUNT.reg = 0xffffff00;
+
 
     // Set all peripheral interrupt priorities to the lowest priority by default.
     for (uint16_t i = 0; i < PERIPH_COUNT_IRQn; i++) {
@@ -496,8 +498,14 @@ uint32_t port_get_saved_word(void) {
 // TODO: Move this to an RTC backup register so we can preserve it when only the BACKUP power domain
 // is enabled.
 static volatile uint64_t overflowed_ticks = 0;
+volatile bool overflow_flag;
+volatile uint32_t overflow_ticks;
 
 static uint32_t _get_count(uint64_t *overflow_count) {
+    if(overflow_flag) {
+        mp_printf(&mp_plat_print, "RTC_Handler overflowed with overflow_ticks=0x%08x\n",overflow_ticks);
+        overflow_flag = 0;
+    }
     while(1) {
         // Disable interrupts so we can grab the count and the overflow atomically.
         common_hal_mcu_disable_interrupts();
@@ -530,6 +538,10 @@ volatile bool _woken_up;
 void RTC_Handler(void) {
     uint32_t intflag = RTC->MODE0.INTFLAG.reg;
     if (intflag & RTC_MODE0_INTFLAG_OVF) {
+
+overflow_flag = true;
+overflow_ticks = RTC->MODE0.COUNT.reg;
+
         RTC->MODE0.INTFLAG.reg = RTC_MODE0_INTFLAG_OVF;
         // Our RTC is 32 bits and we're clocking it at 16.384khz which is 16 (2 ** 4) subticks per
         // tick.

and I ran this program:

import time
from rtc import RTC

while True:
    RTC().calibration = 1
    t0 = time.monotonic_ns()
    et = t0 + 100_000_000 # .1 second
    while (t1 := time.monotonic_ns()) < et: pass
    print(f"duration {t1-t0}")
    print()

It turns out that the RTC overflow interrupt can occur while the RTC count register is merely close to wrapping around. I saw values as small as overflow_ticks=0xfffffffd! However, the code is written with the assumption that delivery of the interrupt happens EXACTLY WHEN the count register will read 0.

Warping RTC in calibration setter
RTC_Handler overflowed with overflow_ticks=0xfffffffd
duration 262144014770509

Warping RTC in calibration setter
RTC_Handler overflowed with overflow_ticks=0xfffffffe
duration 100097663

Warping RTC in calibration setter
RTC_Handler overflowed with overflow_ticks=0xfffffffe
duration 262144015502941

Warping RTC in calibration setter
duration 262143983703604

jepler added a commit to jepler/circuitpython that referenced this issue Mar 25, 2022
I instrumented RTC_Handler and determined that on SAMD51 it was possible
for the interrupt to be delivered well before the actual overflow of the
RTC COUNT register (e.g., a value as small as 0xffff_fffd could be seen
at the time of overflow)

Rather than depending on the overflow interrupt coming in at the same time
as COUNT overflows (exactly), rely only on observed values of COUNT in
_get_count, overflowing when it wraps around from a high value to a low
one.

With this change, PLUS a second change so that it is possible to warp
the RTC counter close to an overflow and test in 20ms instead of 3 days,
there was no problem detected over 20000+ overflows. Before, a substantial
fraction (much greater than 10%) of overflows failed.

Fixes adafruit#5985

Change to common-hal/rtc/RTC.c for time warping (plus make rtc_old_count non-static):
```patch
 void common_hal_rtc_set_calibration(int calibration) {
+
+    common_hal_mcu_disable_interrupts();
+
+        RTC->MODE0.COUNT.reg = 0xffffff00;
+        rtc_old_count = 0;
+        do {
+        while ((RTC->MODE0.SYNCBUSY.reg & (RTC_MODE0_SYNCBUSY_COUNTSYNC | RTC_MODE0_SYNCBUSY_COUNT)) != 0) { }
+    }
+    while(RTC->MODE0.COUNT.reg < 0xffffff00);
+    common_hal_mcu_enable_interrupts();
+
+    mp_printf(&mp_plat_print, "Warping RTC in calibration setter count=%08x rtc_old_count=%08x\n", RTC->MODE0.COUNT.reg, rtc_old_count);
```

Test program:
```python
import time
from rtc import RTC

i = 0
while True:
    RTC().calibration = 1 # Warps to ~16ms before overflow, with patch to RTC code
    t0 = time.monotonic_ns()
    et = t0 + 20_000_000 # 20ms
    while (t1 := time.monotonic_ns()) < et: pass
    i += 1
    print(f"{i:6d}: duration {t1-t0}")
    if t1-t0 > 200_000_000: break
    print()
```
jepler added a commit to jepler/circuitpython that referenced this issue Mar 25, 2022
I instrumented RTC_Handler and determined that on SAMD51 it was possible
for the interrupt to be delivered well before the actual overflow of the
RTC COUNT register (e.g., a value as small as 0xffff_fffd could be seen
at the time of overflow)

Rather than depending on the overflow interrupt coming in at the same time
as COUNT overflows (exactly), rely only on observed values of COUNT in
_get_count, overflowing when it wraps around from a high value to a low
one.

With this change, PLUS a second change so that it is possible to warp
the RTC counter close to an overflow and test in 20ms instead of 3 days,
there was no problem detected over 20000+ overflows. Before, a substantial
fraction (much greater than 10%) of overflows failed.

Fixes adafruit#5985

Change to common-hal/rtc/RTC.c for time warping (plus make rtc_old_count non-static):
```patch
 void common_hal_rtc_set_calibration(int calibration) {
+
+    common_hal_mcu_disable_interrupts();
+
+        RTC->MODE0.COUNT.reg = 0xffffff00;
+        rtc_old_count = 0;
+        do {
+        while ((RTC->MODE0.SYNCBUSY.reg & (RTC_MODE0_SYNCBUSY_COUNTSYNC | RTC_MODE0_SYNCBUSY_COUNT)) != 0) { }
+    }
+    while(RTC->MODE0.COUNT.reg < 0xffffff00);
+    common_hal_mcu_enable_interrupts();
+
+    mp_printf(&mp_plat_print, "Warping RTC in calibration setter count=%08x rtc_old_count=%08x\n", RTC->MODE0.COUNT.reg, rtc_old_count);
```

Test program:
```python
import time
from rtc import RTC

i = 0
while True:
    RTC().calibration = 1 # Warps to ~16ms before overflow, with patch to RTC code
    t0 = time.monotonic_ns()
    et = t0 + 20_000_000 # 20ms
    while (t1 := time.monotonic_ns()) < et: pass
    i += 1
    print(f"{i:6d}: duration {t1-t0}")
    if t1-t0 > 200_000_000: break
    print()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants