-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QT PY ESP32-S2 core crash when Wi-Fi router is power-cycled #7230
Comments
I have done more testing using my 4 QT-PY ESP32-S2's. They will often run for a dozen hours and then all core crash within minutes of each other. I've been checking the memory and there doesn't appear to be any leaks. I've been trying to come up with a way to reliably reconnect the WiFi when it's disrupted. Pinging the gateway, and then reconnecting on no response, does work most of the time. However, certain types of disconnections seem to always cause a core crash. All 4 QT PY's are in the same room with direct line of sight to my access point. I don't notice any WiFi issues with any of the other equipment in the house including an older MicroPython ESP32 in my vegetable garden outside which has been running great for over a year. btw: The docs state: "Reconnections are handled automatically once one connection succeeds." This is not the case with the QT PY ESP32-S2. I added multiple ping attempts before reconnection to avoid unnecessary reconnects. It seems that many failed pings to the gateway will resolve on a 2nd or 3rd try.
I also removed the RSSI check prior to WiFi connection because the start_scanning_networks() command was definitely causing more frequent core crashes. |
On The truest test of whether the device is connected is checking for an IPv4 address (
There are several automatic reconnection attempts, but not infinite, so connection retries are ultimately needed if there is an extended disconnection. |
@anecdata The crashes occur on wifi.radio.connect(). They don't occur on the pings. That's why I'm trying to minimize the reconnects. Thanks, for the tip on the ipv4_address! I will try setting up one of the QT PY's with an ipv4_check instead of pinging to determine if the WiFi connection has dropped. |
How did you determine that the btw,
https://docs.espressif.com/projects/esp-idf/en/latest/esp32s2/api-guides/wifi.html#wi-fi-reconnect There was an issue with unstable wifi scanning, it's ostensibly fixed but maybe try an example without pings or scanning just to see if that makes any difference? |
I have relied on the print statements. I have a print statements on the line before and after wifi.radio.connect. When a crash occurs the preceding print statement prints but the subsequent one never prints. There is only 1 access point in the house. I have 2 QT PY's running now using only the ipv4_address to verify the WiFi connection. I also have 2 QT PY's running my previous code slightly modified to reattempt failed pings before reconnecting. |
You can also do a |
Never tried to build CP. Is there a build guide for the QT PY ESP32-S2 or are there any prebuilt UF2's with debug enabled? |
@rdagger See https://learn.adafruit.com/building-circuitpython/ and particularly https://learn.adafruit.com/building-circuitpython/espressif-build. If you have or can set up a Linux box, that's generally easiest. The |
I was able to build with DEBUG=1. For some reason the I2S is not working but I don't really need it to test the WiFi. I will left you know how it goes and thanks for the build help. |
I’m stuck trying to debug the core crash. I tried multiple custom builds of CircuitPython using DEBUG=1 but I haven’t been able to collect any data yet because the core crash takes out the USB serial before any data is displayed. I’ve tried to implement a console uart to catch the debug data but I can’t get it working. The QT PY ESP32-S2 does not have a 'Send ESP_LOG output to TX/RX pins' sample in the sdkconfig file so I copied the section from the QT PY ESP32 Pico sdkconfig. I modified the console uart TX and RX pins to 5 and 16 to match the TX and RX labels on the QT PY ESP32-S2:
Unfortunately, I’m not getting any serial output on either GPIO pin. I hooked the QT PY up to an oscilloscope to verify no communication. Is there something else I have to modify or are my settings wrong? Btw: switching from a pinging approach to an ipv4_address check for WiFi connectivity has increased the time between core crashes but they still occur. |
That sdkconfig seems to work for me. Here is my build: Part of the output is:
|
@tannewt Thanks for doing that. I guess I'm missing something because I'm still not getting any serial communication. Did you use pins TX and RX (5 & 16)? I tried wiping the Pi just to be sure. Is there any special UART settings? I'm just using:
I should see something on TX when I reboot the QT PY right? |
btw: when I uploaded your firmware I got the following upon boot:
Is that the correct version? I also tried doing a factory reset of the QT PY and loading your UF2 again just to be sure. |
I'm not using a Pi to read the UART. I've got a USB to serial adapter board I use. That version number matches mine. Could you post a picture of the board? |
Wiring looks right to me. What does |
I did do a loop back test to make sure /dev/ttyAMA0 was the correct UART for my wiring and I verified it was the high speed UART (PL011). I tried on 2 Raspberry Pi's (B plus and 4). I'm on the road so I don't have my scope or any USB dongles. |
ls -l /dev dmesg | grep tty ttyAMA0 is the high speed UART. |
Ok, I have no idea why you aren't seeing output then. |
I'll try another QT PY ESP32-S2 when I get home. I did have some quality issues with this last batch such as a cold solder joint. Thanks for all your help! |
I reloaded CircuitPython 8.0.0-beta.4 from CircuitPython.com and ran the following code:
I then connected using Tio on the Pi with the same wiring pictured above and typed a test:
It showed up in the Mu serial console and was transmitted back to the Pi.
I think that shows that the Raspberry Pi and wiring are OK. |
I got a brand new QT PY ESP32-S2. I loaded your DEBUG=1 firmware above. I got a CP210x USB to serial dongle and connected it to the QT PY (TX to RX and RX to TX). I connected to the serial port using Putty at 115200 from a Windows computer. Unfortunately, I did not receive any communication from the QT PY. I tried rebooting the QT PY and nothing came through. I reversed RX and TX but still nothing. I hooked up the CP210x dongle to my scope with serial decoding enabled and was able to receive data from Putty. I hooked up the TX pin of the QT PY to the scope and I didn’t get anything. Perhaps I have inaccurate expectations. I thought the RX and TX pins would give REPL access like the serial screen in Mu. Is that not the case? Is there something I have to enable, or do to get the QT PY to transmit on the TX pin? Is there some better way to test my set up? |
My build should have debug output to TX, not the REPL. Try flashing the bin with esptool. (or |
Erasing the board and then flashing my firmware.bin with https://adafruit.github.io/Adafruit_WebSerial_ESPTool/ did the trick. Thanks! Unfortunately, my program gets stuck right after I2S is configured:
The debug console just keeps printing the following: The program itself hangs at this point. It's been printing the above for over 10 minutes now. The program will run if I comment out the I2S section, but I'm wondering if that could be part of what was originally causing the core crash. |
You may want to change the debug level in the debug sdkconfig You can change the I'd then add more prints to see where CP is getting stuck. |
I pointed you to the wrong debug level. That one is for the second stage bootloader. There is one further down in the file for the user program: https://github.com/adafruit/circuitpython/blob/main/ports/espressif/esp-idf-config/sdkconfig-debug.defaults#L76 |
That fixed it. Actually, the program is now up and running without any issues. For some reason the debug level was preventing the I2S from working. I'll let it run and hopefully I'll get a core crash soon with a back trace. Thanks for your patience! |
Not sure why but the program can't maintain a stable Adafruit IO connection with DEBUG=1 and also using I2S. My version without I2S ran for a day without dropping a connection. However, the 2 boards I just set up with I2S have been online for only an hour and have dropped 3 times with very unreliable service. Here's the REPL view:
And the only debug output:
I also have 2 boards without DEBUG=1 running the full program with I2S and they have been stable for a day. |
One of the QT PY's core crashed last night at 2AM. Here is the last REPL message:
Unfortunately, I didn't notice until this morning and my terminal only had 1000 lines of scrollback. The debug console keeps outputting the following lines (about 100 per minute) so I Iost the debug output concerning the core crash:
|
The status LED blinks tend to output that. It is probably easiest to comment out that print in the IDF, clean, build and flash again. |
Is this the correct code to modify: Specifically comment out the following lines in supervisor/shared/safe_mode.c as shown:
|
I haven't had a chance to implement the changes above, but the other QT PY crashed about 2 hours ago and didn't go into safe mode. Instead, it just locked up. Here's the debug output:
There was no info in the REPL other than the following:
|
I ran decode_backtrace and got the following:
|
Bummer. I was hoping that backtrace would be more useful. Something must be corrupting the stack. I'm not sure what. |
Could there be an issue with running multiple QT PY's at the same time? I noticed they all use the same network hostname (espressif). I'll try giving them unique hostnames. Also, could you please let me know what file I need to modify to disable the safe mode status LED. The edits I made above to supervisor/shared/safe_mode.c did not work. |
Not sure if it's a coincidence but all 4 boards made it through the night without crashing since I gave them unique hostnames. I also tried rebooting my access point and running deauth attacks against the QT PY's. So far, they haven't crashed. I added a fifth board and I'll let them all run for a few days. Afterwards, if there are no crashes, I'll put the hostnames back to espressif and see if the problem recurs. |
It's been 6 days and I have not had another core crash on any of the 5 QT PY's. I think the problem was due to running multiple QT PY boards with identical default hostnames on the same network. The problem may be specific to my Asus Wi-Fi router. Many of the simultaneous core crashes did seem to coincide with deauths in the router log. |
I don't know that these kinds of errors should cause literal crashes, but that may be beyond our control. |
I think it may be worth using the cpy-MAC hostname in dhcp in addition to mdns. That'd make them much more likely to be unique. |
I generated a unique name using the least significant 19 bits of the microcontroller.cpu.uid Example: QTPY362041 That way it's easy to identify when using network tools. |
CircuitPython version
Code/REPL
Behavior
Had 4 QT-PY ESP32-S2's running in my office. They all core crashed when I restarted my Wi-Fi router. Simplified my code and reproduced the problem multiple times.
The code runs until the router is rebooted or powered down. Then it throws the following error:
�]0;�Wi-Fi: off | Done | 8.0.0-beta.4-21-g8f414eb4e�\Auto-reload is off.
Running in safe mode! Not running saved code.
You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Crash into the HardFault_Handler.
Please file an issue with the contents of your CIRCUITPY drive at
https://github.com/adafruit/circuitpython/issues
Description
I don't think this is a duplicate of "ping too frequently results in Safe Mode #5980" because I am waiting 1 second between pings and the code doesn't crash from frequent pings. Instead, it crashes while the Wi-Fi is trying to reconnect. Furthermore, I've had a more complicated version of the code running for several days on 4 QT-PY's and only encountered a core crash when the router went down.
Additional information
Removing the RSSI check that uses wifi.radio.start_scanning_networks() causes the crashes less frequently but they can still occur during reconnect. Occasionally, the code will not crash and reconnect properly. I suspect the longer the router is down the more likely the core crash. My router takes over a minute to restart.
The text was updated successfully, but these errors were encountered: