-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v3.3.4] esp_ota_begin
causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932)
#6723
Comments
esp_ota_begin
causes Wi-Fi stack starvation on 4 MiB partitionsesp_ota_begin
causes Wi-Fi stack starvation on 4 MiB partitions (IDFGH-4932)
Thanks for the very detailed report and sorry for the inconvenience, we will look into soon. |
Hi @chrismerck, Thanks for the detailed instructions to reproduce this issue. I'm able to reproduce task watchdog getting triggered on v3.3.4 (could not reproduce the TBTT part).
Thanks, |
Hi @shubhamkulkarni97 , Thank you for the incredibly prompt turn-around.
That's with just the IDF and some demo code! We then build a really great product on top of it. That takes a lot of flash space :) We have a local HTTP API to support local control by our app and other smart home systems, plus other protocols (UDP and telnet) for performance and legacy compatibility. Then we bring in Lua and several radio drivers. It adds up quick. This is something that seems to be a bit of a blind spot for Espressif, to be totally honest. It's virtually impossible to use the 4 MB modules for any serious application that's more than a thin-client, yet it is quite difficult to obtain dev kits with 8MB or 16MB, and the new mini modules are only available in 4MB.
We already do store our data in a separate 1MB database partition. We considered SPIFFS along with other embedded database libraries but unfortunately had to roll our own to fit our requirements. (Especially handling database compaction after record deletions.)
Wow, that's amazing, thank you. We may give this a try.
That's excellent news. The flash cache has been tricky for us. The C3 may have a nice effect of boosting performance of our Bond Bridge product, since we use Core1 baremetal for time sensitive radio operations where our performance is currently limited by flash cache (we need to disable all interrupts during critical radio operations, which can last seconds in some circumstances.) The dual-core C3 + 16MB, and a mini module C3 + 8MB (especially this one) are what we are waiting for. Cannot do much with 4MB sadly. |
@chrismerck, Did you get a chance to try out the patch provided in above comment? Does it fix the issue? |
Thanks for your feedback. In case you have not seen, request you to go through our product ordering guide available here. I do see several modules offering 8M/16M flash parts. For more information on availability, please get in touch with our sales team. However, we do see lot of traction for 4M modules, in-fact we are also getting requests for 2M flash modules, just too many business verticals and use-cases in this domain :-)
FYI, C3 is single core RISC-V chip. However, upcoming ESP32-S3 is dual-core one, more information |
Hopefully patch provided earlier helped to fix your problem, closing this issue. Please feel free to re-open in case you need further help. |
We have not tested on
master
branch as that requires major work to upgrade from v3.3.4 IDF. I'm posting this in the hope that it helps someone else with this issue. Hopefully we will find the resources to upgrade our application to v4.x, as it seems that in lastestmaster
this code has been heavily reworked.Similar issues:
Environment
Problem Description
The IDF function
esp_ota_begin
blocks for the duration of the partition erasure, and during that period it causes serious performance problems with other subsystems.The problem is caused by
spi_flash_erase_range
which blocks all other execution while erasing each flash block (viaspi_flash_disable_interrupts_caches_and_other_cpu
). I speculate that, although the flash guard is released and re-acquired after each lock, other waiting tasks are not given a chance to run. This is a fairly aggressive manoeuvre that may be acceptable for small partitions, but causes problems when the partition takes more than, say, 3 seconds to erase. In that case, application-level deadlines may be missed.Workaround: SPI_FLASH_YIELD
To address this issue, the config
SPI_FLASH_YIELD_DURING_ERASE
was introduced (#5171) which effectively adds avTaskDelay(1)
after each block erase. This seems to have helped some users. However, in our case the 1-tick delay was apparently not sufficient to prevent starvation of Wi-Fi and modest application tasks.One solution is to increase
CONFIG_SPI_FLASH_ERASE_YIELD_TICKS
to effectively delay longer after each block erasure. However, there's still a shortcoming in this "ERASE_YIELD" trick: theCONFIG_SPI_FLASH_ERASE_YIELD_DURATION_MS
variable is not used in a reasonable way. It checks for a single block erase being longer than a threshold, which effectively defaults to 10 ms, at which setting it will always execute. However, if this threshold is increased, then it will likely never run, and starvation will occur. Simply increasing the ticks may incur significant delays (64 blocks * number of ticks). It would make much more sense to check to see if a certain total erase time has been elapsed before calling the vTaskDelay. This is exactly what we've done with our internal patchset on top of ESP-IDF.We found that a 500 ms delay every 2000 ms worked adequately. It could likely also be more fine-grained, perhaps 200 ms delay every 1000 ms.
However, it appears that this logic has been removed in
master
, withspi_flash_erase_range
being deprecated. I suppose we should be upgrading our application to IDF v4.x.Documentation
I believe that the esp_ota_begin documentation should make it clear that this function blocks for the duration of the partition erasure, and that this may take on the order of 4 sec per megabyte erased.
Similar Fixed Issue not Backported?
There was a claim (#5591 (comment)) in a closely-related issue that a fix for some Wi-Fi interference from esp_ota_begin was fixed in commit 4761c40. However, it would appear that this has not been backported to v3.x IDF. However I could be mistaken.
Expected Behavior
Calling
esp_ota_begin
in one task does not cause:However, it is acceptable that it may:
(*) Although it is unclear to me why erasing an unused partition requires blocking other task execution.
Actual Behavior
esp_ota_begin
against a 4 MB partition with a modest amount of background tasks running (normal priority) causes:Steps to reproduce
esp_ota_erase
.Monitor:
Code to reproduce this issue
Steps above summarize what we are doing in our commercial application which I believe could be used to reproduce the issue in an example. However, at this time I have not prepared a separate public example project.
Debug Logs
E () wifi:esf_buf
W () wifi:Next TBTT incorrect!
Other items
Relevant part:
Relevant part showing 4 MiB partitions:
The text was updated successfully, but these errors were encountered: