-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time.monotonic_ns sometimes jumps forward, and then returns the correct value on the next call #5985
Comments
Looking for interesting relationships between the two timestamps, I found that they're 262144 (2^18) seconds apart, almost exactly.
Is a power on time of >40 days required to reproduce this bug, or will it equally reproduce during the first day(s) of power on time? |
This sounds like fetching two halves of a count non-atomically, but I thought we guarded against that. |
The other pair I have (this has been going on for over a month, but it took me a while to get the right logging) is:
About the same jump
It certainly takes less than 40 days. Last two happened on Feb 3 and 6th. And it doesn't seem periodic either ... haven't been keeping good enough records, but feels Poisson |
I have a general theory as to the cause, but didn't get a chance to prove it. My theory's specific to SAM D5x / E5x microcontrollers. I found that "Our RTC is 32 bits and we're clocking it at 16.384khz" which means that 262144 seconds is the RTC overflow period. An interrupt will fire to increment the "overflowed_ticks" value, which together with the RTC count is read with interrupts disabled:
Here's where my account of what's going on is hazy. Your logs make it look like the erroneous timestamp is 3 days in the future, but the possibility I see is for But if you catch it every 3 days, it can't be a rarity compared to a register that overflows every 3 days; and your jump seems to be in the wrong direction. hmmmmm .. ! there is something special about your wrong values. "A tick is 976562.5 nanoseconds", dividing out that factor does give an interesting number in hex: >>> stamps = 4194303992614752, 4456447999938977, 4194304025573733, 3932159995422367, 4194303999938977, 3932160026794439
>>> for s in stamps: print("%010x" % floor(s / 976562.5))
...
00fffffff8
010fffffff
010000001a
00effffffb
00ffffffff
00f000001b I still don't see what the bug is though. It looks like |
I looked at the code earlier today and wondered the same thing, but it did seem like very low probability. I looked at the errata and found this interesting problem. We are setting |
I would be surprised if was happening every 3 days...it would not surprise me if every 3 days it happens with 25% to 50% probably. Will add new cases to the ticket as they come in. |
I should also add, it is possible there are times where we go backwards in time, and then jumps forward. Before I had my logging I probably would not have noticed (because the system would have otherwise appeared to work...it is the jumping forward then back that gets my code in a bad state and I know to look). Back-then-forword would should up in this new logging, but the logs sometimes scroll off before I get to read them if nothing else bad appears to be happening. TL;DR - I am confident I am seeing all the forword-then-back events, it is possible there are also back-then-forward events that I have missed. |
I got another this time on the "other" Feather, lets call it "unit 2"
|
here is a more careful implementation: |
Somewhat surprisingly (to me) unit 1 had the error again last night:
So maybe it really is just happening every 3 days and it just seem longer because I was impatient for reproducibility while tweaking the logging.... Happy to keep on reporting, or stop if it is just noise at this point |
We hope to give you a test version relatively soon, I think. And we want to try to reproduce this in a simpler way, maybe by presetting the counter value so we don't have to wait three days :). I just talked to @jepler about this, this morning. |
FWIW, Unit 2 got the bug again
Will try to get the build running tomorrow. |
@dhalbert Installed 7.2.0-alpha.1-366-g528c2a322 on both M4 Feathers Thursday afternoon (evening?), and got the error on both boards Sunday night (EST). Unit 1
Unit 2
Is it possable I have the wrong build? |
That is the build with the commit of my changes :( . It's also in 7.2.0-alpha.2. |
I will try to make some other kind of test that sets the count register forward so it doesn't take three days for confirmation. |
(probably just a coincidence: 3 days is also the half-period of the ticks_ms counter which relies on the same clock as monotonic_ns as far as I remember from another thread. might there be a connection? ) |
At ~day 6 unit 1 didn't show the problem, but unit 2 did
During the day unit 1 doesn't call monotonic_ns a often, so it might just be that it didn't call monotonic_ns during the window (but not sure where the period is v. ET day/night....) |
Thanks for the continued testing. I had hoped I found the non-atomic reading of the register and incrementing the overflow count, but there must be something further to do atomically. |
I instrumented CircuitPython so that I could easily cause RTC overflows. diff --git a/ports/atmel-samd/common-hal/rtc/RTC.c b/ports/atmel-samd/common-hal/rtc/RTC.c
index e2a67bd17..1d16e9b86 100644
--- a/ports/atmel-samd/common-hal/rtc/RTC.c
+++ b/ports/atmel-samd/common-hal/rtc/RTC.c
@@ -68,6 +68,9 @@ int common_hal_rtc_get_calibration(void) {
}
void common_hal_rtc_set_calibration(int calibration) {
+ mp_printf(&mp_plat_print, "Warping RTC in calibration setter\n");
+ RTC->MODE0.COUNT.reg = 0xffffff00;
+
if (calibration > 127 || calibration < -127) {
#if CIRCUITPY_FULL_BUILD
mp_raise_ValueError(translate("calibration value out of range +/-127"));
diff --git a/ports/atmel-samd/supervisor/port.c b/ports/atmel-samd/supervisor/port.c
index 7be1fdb53..372b17095 100644
--- a/ports/atmel-samd/supervisor/port.c
+++ b/ports/atmel-samd/supervisor/port.c
@@ -241,6 +241,8 @@ static void rtc_init(void) {
#endif
RTC->MODE0.INTENSET.reg = RTC_MODE0_INTENSET_OVF;
+ RTC->MODE0.COUNT.reg = 0xffffff00;
+
// Set all peripheral interrupt priorities to the lowest priority by default.
for (uint16_t i = 0; i < PERIPH_COUNT_IRQn; i++) {
@@ -496,8 +498,14 @@ uint32_t port_get_saved_word(void) {
// TODO: Move this to an RTC backup register so we can preserve it when only the BACKUP power domain
// is enabled.
static volatile uint64_t overflowed_ticks = 0;
+volatile bool overflow_flag;
+volatile uint32_t overflow_ticks;
static uint32_t _get_count(uint64_t *overflow_count) {
+ if(overflow_flag) {
+ mp_printf(&mp_plat_print, "RTC_Handler overflowed with overflow_ticks=0x%08x\n",overflow_ticks);
+ overflow_flag = 0;
+ }
while(1) {
// Disable interrupts so we can grab the count and the overflow atomically.
common_hal_mcu_disable_interrupts();
@@ -530,6 +538,10 @@ volatile bool _woken_up;
void RTC_Handler(void) {
uint32_t intflag = RTC->MODE0.INTFLAG.reg;
if (intflag & RTC_MODE0_INTFLAG_OVF) {
+
+overflow_flag = true;
+overflow_ticks = RTC->MODE0.COUNT.reg;
+
RTC->MODE0.INTFLAG.reg = RTC_MODE0_INTFLAG_OVF;
// Our RTC is 32 bits and we're clocking it at 16.384khz which is 16 (2 ** 4) subticks per
// tick. and I ran this program: import time
from rtc import RTC
while True:
RTC().calibration = 1
t0 = time.monotonic_ns()
et = t0 + 100_000_000 # .1 second
while (t1 := time.monotonic_ns()) < et: pass
print(f"duration {t1-t0}")
print() It turns out that the RTC overflow interrupt can occur while the RTC count register is merely close to wrapping around. I saw values as small as
|
I instrumented RTC_Handler and determined that on SAMD51 it was possible for the interrupt to be delivered well before the actual overflow of the RTC COUNT register (e.g., a value as small as 0xffff_fffd could be seen at the time of overflow) Rather than depending on the overflow interrupt coming in at the same time as COUNT overflows (exactly), rely only on observed values of COUNT in _get_count, overflowing when it wraps around from a high value to a low one. With this change, PLUS a second change so that it is possible to warp the RTC counter close to an overflow and test in 20ms instead of 3 days, there was no problem detected over 20000+ overflows. Before, a substantial fraction (much greater than 10%) of overflows failed. Fixes adafruit#5985 Change to common-hal/rtc/RTC.c for time warping (plus make rtc_old_count non-static): ```patch void common_hal_rtc_set_calibration(int calibration) { + + common_hal_mcu_disable_interrupts(); + + RTC->MODE0.COUNT.reg = 0xffffff00; + rtc_old_count = 0; + do { + while ((RTC->MODE0.SYNCBUSY.reg & (RTC_MODE0_SYNCBUSY_COUNTSYNC | RTC_MODE0_SYNCBUSY_COUNT)) != 0) { } + } + while(RTC->MODE0.COUNT.reg < 0xffffff00); + common_hal_mcu_enable_interrupts(); + + mp_printf(&mp_plat_print, "Warping RTC in calibration setter count=%08x rtc_old_count=%08x\n", RTC->MODE0.COUNT.reg, rtc_old_count); ``` Test program: ```python import time from rtc import RTC i = 0 while True: RTC().calibration = 1 # Warps to ~16ms before overflow, with patch to RTC code t0 = time.monotonic_ns() et = t0 + 20_000_000 # 20ms while (t1 := time.monotonic_ns()) < et: pass i += 1 print(f"{i:6d}: duration {t1-t0}") if t1-t0 > 200_000_000: break print() ```
I instrumented RTC_Handler and determined that on SAMD51 it was possible for the interrupt to be delivered well before the actual overflow of the RTC COUNT register (e.g., a value as small as 0xffff_fffd could be seen at the time of overflow) Rather than depending on the overflow interrupt coming in at the same time as COUNT overflows (exactly), rely only on observed values of COUNT in _get_count, overflowing when it wraps around from a high value to a low one. With this change, PLUS a second change so that it is possible to warp the RTC counter close to an overflow and test in 20ms instead of 3 days, there was no problem detected over 20000+ overflows. Before, a substantial fraction (much greater than 10%) of overflows failed. Fixes adafruit#5985 Change to common-hal/rtc/RTC.c for time warping (plus make rtc_old_count non-static): ```patch void common_hal_rtc_set_calibration(int calibration) { + + common_hal_mcu_disable_interrupts(); + + RTC->MODE0.COUNT.reg = 0xffffff00; + rtc_old_count = 0; + do { + while ((RTC->MODE0.SYNCBUSY.reg & (RTC_MODE0_SYNCBUSY_COUNTSYNC | RTC_MODE0_SYNCBUSY_COUNT)) != 0) { } + } + while(RTC->MODE0.COUNT.reg < 0xffffff00); + common_hal_mcu_enable_interrupts(); + + mp_printf(&mp_plat_print, "Warping RTC in calibration setter count=%08x rtc_old_count=%08x\n", RTC->MODE0.COUNT.reg, rtc_old_count); ``` Test program: ```python import time from rtc import RTC i = 0 while True: RTC().calibration = 1 # Warps to ~16ms before overflow, with patch to RTC code t0 = time.monotonic_ns() et = t0 + 20_000_000 # 20ms while (t1 := time.monotonic_ns()) < et: pass i += 1 print(f"{i:6d}: duration {t1-t0}") if t1-t0 > 200_000_000: break print() ```
CircuitPython version
Code/REPL
Behavior
Description
I have two Adafruit Feather M4 running programs that call time.monotonic_ns ~100 times a second. Each day with approximately a 10% chance it seems one them makes a monotonic_ns call and gets a result more than a day in the future. The next call then seem to be ok.
I have an Adafruit Bluefuit Circuit Playground running similar code that doesn't seem to present this problem. Aside from the CPU one difference is the M4's are setting rtc.RTC().datetime once an hour...thought the discontinues in monotonic_ns don't seem correlated with these calls.
Additional information
No response
The text was updated successfully, but these errors were encountered: