Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presence of boot.py in many cases will cause a boot loop on ESP32-S2 with TinyUF2 (rc, not beta) #5305

Closed
anecdata opened this issue Sep 6, 2021 · 19 comments · Fixed by #5348
Labels
bug espressif applies to multiple Espressif chips
Milestone

Comments

@anecdata
Copy link
Member

anecdata commented Sep 6, 2021

CircuitPython version

Adafruit CircuitPython 7.0.0-rc.1 on 2021-09-02; Adafruit FunHouse with ESP32S2

Code/REPL

None

Behavior

On an ESP32-S2 device (Fun House in this case, but on others as well) without TinyUF2, device behaves as expected.

On the same device with TinyUF2, starting clean from a fresh storage.erase_filesystem(), with either empty boot.py + empty code.py, or some (but not all) combinations of trivial and non-trivial boot.py + code.py, when reset the device will display the Blinka icon, then go into an apparent loop where nothing else is displayed (regardless of any print statements in boot.py) and the violet status lights and the red LED will flash every second.

To recover, go into safe mode and delete or rename the boot.py file, then hard reset.

Saw similar behaviors on rc.0, but not on various beta versions or prior.

UPDATE:
Further testing indicates that the presence of TinyUF2 does not seem relevant, the issue can be triggered with or without TinyUF2.

@anecdata anecdata added bug espressif applies to multiple Espressif chips labels Sep 6, 2021
@dhalbert dhalbert added this to the 7.0.0 milestone Sep 7, 2021
@anecdata
Copy link
Member Author

anecdata commented Sep 7, 2021

I think it's more complicated... if I start clean, empty boot & empty code will work repeatedly. It seems like there's some unexpected state being retained in some cases, and once the error triggers, it will keep triggering across hard resets until boot.py is removed. It would almost have to be something in flash?

Another behavior I hadn't seen before: Some non-trivial boot.py and non-trivial code.py combinations will not trigger the looping, but the code.py will not run. boot.py output: will display correctly on the built-in display, and boot-out.txt will be properly populated, but no code.py output: line or code output, simply Code done running.

@tannewt
Copy link
Member

tannewt commented Sep 7, 2021

Can you replicate this with a DEBUG build when it is outputting to the debug UART?

@anecdata
Copy link
Member Author

anecdata commented Sep 8, 2021

I haven't done a DEBUG build since it became too big to generate, but I'll ask around to see how to make it fit.

@dhalbert
Copy link
Collaborator

dhalbert commented Sep 8, 2021

I haven't done a DEBUG build since it became too big to generate, but I'll ask around to see how to make it fit.

Paulsk cut the wifi buffers in half and made it fit: https://discord.com/channels/327254708534116352/537365702651150357/881908023436640356

@anecdata
Copy link
Member Author

anecdata commented Sep 8, 2021

Switched to Cucumber RIS since Funhouse doesn't have debug pins exposed.

  1. erased flash
  2. installed TinyUF2 combined
  3. installed CP rc.1

Initial test with no boot.py and default code.py gives this in the debug console on startup:

# ...
I (904) cpu_start: Pro cpu up.
I (1318) spiram: SPI SRAM memory test OK
ESP-ROM:esp32s2-rc4-20191025

Creating an empty boot.py gave an ambiguous result on hard reset, repeatedly. No "Hello World!, but it's possible that tio didn't reconnect in time. It usually does. But this is the equivalent debug console out in this case:

# ...
I (908) cpu_start: Pro cpu up.
I (1315) ??f̆?f?fφ~
                  ????
                      i?U?
                          R?V?KW????сz-m??j5R?ESP-ROM:esp32s2-rc4-20191025

The gibberish varies each time. No boot loop in this case. Seems like something is getting clobbered.

(The ESP-ROM:esp32s2-rc4-20191025 string is the first line of a new restart)

@anecdata
Copy link
Member Author

anecdata commented Sep 8, 2021

If I put a larger code.py on it, it will go into a boot loop (about once per second), but the debug console in this particular case looks more normal on each iteration (no gibberish):

# ...
I (915) cpu_start: Pro cpu up.
I (1329) spiram: SPI SRAM memory test OK
ESP-ROM:esp32s2-rc4-20191025

Oh, and this appears earlier in the debug console every time when boot looping:

W (47) boot.esp32s2: PRO CPU has been reset by WDT.
W (52) boot.esp32s2: WDT reset info: PRO CPU PC=0x4004c671

These warning messages were not present in the case in the previous comment.

@anecdata
Copy link
Member Author

anecdata commented Sep 8, 2021

Manual bisect shows earliest S3 artifact with this issue is:
2021-08-27T22:59:12.000Z 2.6 MB adafruit-circuitpython-gravitech_cucumber_rs-en_US-20210827-31b9dd4.uf2
If I read the tea leaves right, I think that correlates to:
#5245
Someone else should check check my work.

@anecdata
Copy link
Member Author

anecdata commented Sep 8, 2021

Backing out just the "-1" and "+1" additions on lines 133 and 142 respectively in py/gc.c from that PR does indeed seem to fix the boot.py issue and allow more recent builds to run successfully, but of course that re-introduces other pesky issues.

@tannewt
Copy link
Member

tannewt commented Sep 9, 2021

What are the contents of your boot.py and code.py? This does sound like the issue we thought we had fixed.

@anecdata
Copy link
Member Author

anecdata commented Sep 9, 2021

I've been testing recently only with an empty boot.py.

I'm trying but having difficulty coming up with a minimal reproducible example for the initial code.py.

@Neradoc replicated the boot looping on FunHouse with an empty boot.py file and no code.py, but I don't know what his prior conditions were https://discord.com/channels/327254708534116352/327298996332658690/884511468844879883

Necessary but insufficient conditions seem to be:

  • Tiny UF2 (0.5.1 or 0.5.2 No: see Addendum
  • presence of boot.py file (can be empty)
  • CircuitPython builds from -en_US-20210827-31b9dd4 to present

Bu as far as I can tell, it needs those things but also has to be triggered by a code.py with presently unknown characteristics. Once it is triggered, the issue will persist through:

  • successive hard resets
  • going into bootloader mode
  • going into safe mode

Once triggered, you can even go into safe mode and reduce code.py down to empty, and the issue will still persist across hard resets. After that point, only two things will fix it:

  1. the deletion or renaming of boot.py and another hard reset will get things working again
  2. resetting from within the REPL and NOT while at the "Press any key to enter the REPL. Use CTRL-D to reload." prompt, even if there is still a boot.py

If we suspect corruption, a number of variables like build differences, port and board differences, etc., may affect the observed behavior.

I'll keep trying to get an example for an initial code.py.

Addendum:
Once realizing there was a difference between hard reset when at the "Press any key to enter the REPL. Use CTRL-D to reload." prompt vs. in the REPL, I restested without TinyUF2 and it will still boot loop if reset at the prompt instead of within the REPL. But of course restoring the offending code.py will again trigger the issue if hard reset at the "Press any key..." prompt or during code execution, and there is still a boot.py.

Addendum 2:
Just to make sure I hadn't made procedural errors in the bisect due to the above, I retested the last known good build, en_US-20210827-3e9daec.bin and it was fine. With an empty boot.py file and an offending code.py, I could hard reset it at any stage and it would restart properly.

Addendum 3:
I haven't been able to reproduce this scenario in my current config:

Once triggered, you can even go into safe mode and reduce code.py down to empty, and the issue will still persist across hard resets.

@anecdata
Copy link
Member Author

anecdata commented Sep 10, 2021

I don't really understand what that PR does, or the issues that it addressed. Any ideas or intuition about what kinds of code might trigger this... large code size or complex code to interpret? big buffers? fragmentation? stack vs heap vs pystack vs esp-idf? other? I finally bodged together 600+ lines of code snippets that will consistently trigger the issue, but it's ugly and introduces variables because a lot of modules get exercised. With some direction, I might be able to produce a more targeted example.

@tannewt
Copy link
Member

tannewt commented Sep 10, 2021

The issue we were trying to fix was a rare case where the ATB reads into the FTB. The ATB tracks block status and the FTB tracks whether each block has a finalizer. If the first block of the FTB looks like a TAIL block from the ATB, then the collector will read past the end of RAM and hang the chip. Once hung, the WDT will reset the chip after a time.

So it is a specific case of allocating something at the top of the heap and allocating something at the bottom with a finalizer. I talk about this in more detail in this stream: https://www.youtube.com/watch?v=DQVXNNUGvbk

I think the next step would be to print out the structure of the heap here and you can validate the addresses read here. The addresses (ptrs) should always be less than 0x3ff80000. Reading that exact value hangs.

@anecdata
Copy link
Member Author

That's all way over my head. I missed that deep dive, so I'll watch and see what I can learn.

@tannewt
Copy link
Member

tannewt commented Sep 13, 2021

I wasn't able to replicate this on an UM FeatherS2. Have you only replicated it on devices with a screen?

@anecdata
Copy link
Member Author

anecdata commented Sep 14, 2021

It affects any ESP32-S2 device I've tried so far: MagTag, FunHouse, Cucumber RS (no display). If I delete some code, doesn't seem to matter which code, more like how much, it's often fine again. Just verified that the latest build:
adafruit-circuitpython-gravitech_cucumber_rs-en_US-20210913-3e9cf8f.uf2 still triggers the issue with the 600+ line conglomeration of snippets. I'll put it in a gist and link below.

FeatherS2 seems harder to trigger, but I don't know why. It has more PSRAM, but it's not like anything I've done fills the 2MB of PSRAM in the other boards. I'd suggest testing on Saola or something else WROVER-based.

Addendum:
gist is here:
https://gist.github.com/anecdata/8e0a40181021b543a3a5b8fe123f3716

[EDIT: note that an empty (at least) boot.py must also be present - @dhalbert]

@dhalbert
Copy link
Collaborator

dhalbert commented Sep 14, 2021

@anecdata I think you mentioned on Monday that TinyUF2 is not necessary, at least with the example above? So let's edit the title, etc.

@anecdata
Copy link
Member Author

Just a note that in the large example, there are functions and function code that are never executed, but if I delete too much of that, the issue goes away. That leads me to think it has more to do with code size or complexity, and the interpretation to byte code stage. And indeed, if I have a code.py that simply imports the original offending but mpy-cross'd {code}.mpy, the issue does not arise. It seems to be the specific combination of a boot.py, plus a code.py actually containing certain (unknown) characteristics.

@tannewt
Copy link
Member

tannewt commented Sep 14, 2021

At what point does it fail with that code.py? It seems to get pretty far for me.

I also see you are enabling the WDT yourself. Please post more of your debug output. The S2 has multiple watchdogs and it should tell us which of the two reset the chip for you.

@anecdata
Copy link
Member Author

anecdata commented Sep 14, 2021

I don't think the user watchdog is required, it was just a handy snippet. The code will run and soft-reload fine. What it won't do is recover from a hard reset (or other behavior causing an auto-reset) after it's done a while True iteration or two. That results in boot loops with debug console like this:

ESP-ROM:esp32s2-rc4-20191025
Build:Oct 25 2019
rst:0x8 (TG1WDT_SYS_RST),boot:0x9 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40028200
SPIWP:0xee
mode:DIO, clock div:1
load:0x3ffe6100,len:0x16a8
load:0x4004c000,len:0xeec
load:0x40050000,len:0x2bb8
entry 0x4004c310
I (24) boot: ESP-IDF v4.3-356-g48ae2309fd 2nd stage bootloader
I (24) boot: compile time 06:13:05
I (24) boot: chip revision: 0
I (27) qio_mode: Enabling default flash chip QIO
I (33) boot.esp32s2: SPI Speed      : 80MHz
I (37) boot.esp32s2: SPI Mode       : QIO
I (42) boot.esp32s2: SPI Flash Size : 4MB
W (47) boot.esp32s2: PRO CPU has been reset by WDT.
W (52) boot.esp32s2: WDT reset info: PRO CPU PC=0x4004c671
I (59) boot: Enabling RNG early entropy source...
I (64) boot: Partition Table:
I (68) boot: ## Label            Usage          Type ST Offset   Length
I (75) boot:  0 nvs              WiFi data        01 02 00009000 00005000
I (82) boot:  1 otadata          OTA data         01 00 0000e000 00002000
I (90) boot:  2 ota_0            OTA app          00 10 00010000 00160000
I (97) boot:  3 ota_1            OTA app          00 11 00170000 00160000
I (105) boot:  4 uf2              factory app      00 00 002d0000 00040000
I (112) boot:  5 ffat             Unknown data     01 81 00310000 000f0000
I (120) boot: End of partition table
I (624) esp_image: segment 0: paddr=00010020 vaddr=3f000020 size=37c98h (228504) map
I (663) esp_image: segment 1: paddr=00047cc0 vaddr=3ffcd930 size=035fch ( 13820) load
I (666) esp_image: segment 2: paddr=0004b2c4 vaddr=40028000 size=04d54h ( 19796) load
I (674) esp_image: segment 3: paddr=00050020 vaddr=40080020 size=f0298h (983704) map
I (846) esp_image: segment 4: paddr=001402c0 vaddr=4002cd54 size=10bd4h ( 68564) load
I (862) esp_image: segment 5: paddr=00150e9c vaddr=40070000 size=0002ch (    44) load
I (862) esp_image: segment 6: paddr=00150ed0 vaddr=50000000 size=01010h (  4112) load
I (880) boot: Loaded app from partition at offset 0x10000
I (880) boot: Disabling RNG early entropy source...
I (893) cache: Instruction cache 	: size 16KB, 4Ways, cache line size 32Byte
I (893) cache: Data cache 		: size 16KB, 4Ways, cache line size 32Byte
I (898) spiram: Found 16MBit SPI RAM device
I (903) spiram: SPI RAM mode: sram 40m
I (908) spiram: PSRAM initialized, cache is in normal (1-core) mode.
I (915) cpu_start: Pro cpu up.
I (1329) spiram: SPI SRAM memory test OK
ESP-ROM:esp32s2-rc4-20191025
Build:Oct 25 2019
rst:0x8 (TG1WDT_SYS_RST),boot:0x9 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40083fcb
SPIWP:0xee
mode:DIO, clock div:1
load:0x3ffe6100,len:0x16a8
load:0x4004c000,len:0xeec
load:0x40050000,len:0x2bb8
entry 0x4004c310

If I rip the watchdog code out of the example and replace it with a couple of unused functions, it will still boot loop on reset:

rst:0x8 (TG1WDT_SYS_RST),boot:0x9 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40028200
SPIWP:0xee
mode:DIO, clock div:1
load:0x3ffe6100,len:0x16a8
load:0x4004c000,len:0xeec
load:0x40050000,len:0x2bb8
entry 0x4004c310
I (24) boot: ESP-IDF v4.3-356-g48ae2309fd 2nd stage bootloader
I (24) boot: compile time 06:13:05
I (24) boot: chip revision: 0
I (27) qio_mode: Enabling default flash chip QIO
I (33) boot.esp32s2: SPI Speed      : 80MHz
I (37) boot.esp32s2: SPI Mode       : QIO
I (42) boot.esp32s2: SPI Flash Size : 4MB
W (47) boot.esp32s2: PRO CPU has been reset by WDT.
W (52) boot.esp32s2: WDT reset info: PRO CPU PC=0x4004c671

dhalbert pushed a commit that referenced this issue Sep 15, 2021
This ensures it has a known start state. An unknown state risks it
looking like a tail block similar to the problem fixed by #5245.

Fixes #5305
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug espressif applies to multiple Espressif chips
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants