-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp_ota_begin() starving network task of cpu time with large partitions and SPI RAM (IDFGH-261) #2083
Comments
We are having the same problem, good you brought this up. Although we are using a smaller flash, 4MB divided in two OTA paritions we are hit quite heavily by this since our application needs to continue working during OTA download. Our image is at the moment about 1MB. |
The vTaskDelay(1) after spi_flash_guard_end() in spi_flash_erase_range workaround seems to solve the issue, but is horrendously kludgy. Any better solution? |
Added Kconfig options to enable yield operation during flash erase Closes: #2083 Closes: IDFGH-261
Added Kconfig options to enable yield operation during flash erase. By default disable. Closes: #2083 Closes: IDFGH-261
Added Kconfig options to enable yield operation during flash erase. By default disable. Closes: #2083 Closes: IDFGH-261
Added Kconfig options to enable yield operation during flash erase Closes: espressif#2083 Closes: IDFGH-261
We use 16MB ESP32 modules with 4MB OTA partitions and external SPI RAM. Currently our firmware image is approximately 2.8MB in size. The flow of our code is to make an http GET request for the firmware image, read the headers (in particular download size), call esp_ota_begin() with the firmware download size, then download chunk by chunk and call esp_ota_write for each chunk, and a final esp_ota_end when done. Our networking buffers are in INTERNAL RAM, not SPI RAM.
After our firmware exceeded about 2MB in size, and we started to use SPIRAM more in our application, we started to see random crashes during OTA firmware updates over wifi.
We narrowed down the issue to using networking functions (reading from the TCP/IP socket) after calling esp_ota_begin with large image sizes (over approximately 2MB). The Espressif code calls a single esp_partition_erase_range() which disables the SPI RAM cache and blocks any task trying to access that. If the system networking task is blocked for too long, then it seems to get messed up in handling wifi beacons, and panics when it finally gets some CPU time?
If we change esp_ota_begin() to use a loop to erase the partition in 256KB chunks (calling esp_partition_erase_range() multiple times), with a 1 tick vTaskDelay between each chunk, the problem goes away and OTA works again. The vTaskDelay is required as without it the panic in pm_on_beacon_rx still happens.
I am not sure how to address this issue. The core spi_flash_erase_range, that this all depends on, already has a loop to erase sector/block by sector/block, and it uses spi_flash_guard_start() and spi_flash_guard_end() correctly for each sector/block. It seems that the tasks are not getting any/enough cpu time between calls to spi_flash_guard_start()/spi_flash_guard_end() in that sector/block erase loop.
Adding a delay there seems kludgy. Perhaps a freertos call to allow other blocked tasks to run? I tried adding a taskYIELD() after the spi_flash_guard_end() in spi_flash_erase_range, but that didn't solve the problem (presumably the task that called esp_ota_begin was higher priority than the networking task?). A vTaskDelay(1) in the same place works and solves the problem, but just seems horribly kludgy.
Overall, I'm just very uncomfortable with how invasive esp_ota_begin() is. With a 2.8MB image size, it blocks all other tasks that touch SPI RAM for about 17 seconds. That is not good. We should be able to OTA flash without starving other tasks in the system so badly.
There is a separate bug in the pm_on_beacon_rx that is triggered by this, but that seems more a symptom that a solution.
The text was updated successfully, but these errors were encountered: