-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TW#17597] OTA updates failing randomly inside spi_flash_write_inner #1479
Comments
I have a minimalized reproduction of this issue. |
Fixes all the issues or just with esp_ota_begin? |
Just the |
Hi @MrSurly, Thanks for being patient while someone gets back to you about this. I have a couple of questions:
The first question is because we've noticed, particularly on LANs where the TCP latency is minimal and bandwidth is much higher than the flash write bandwidth, the OTA loop can starve other tasks in the system. This may explain the the task WDT, although it doesn't explain the other part. The high-priority IPC (inter-processor) task being involved in the crash dump is related to flash write operations. The CPU doing a flash write will call spi_flash_guard_start() which signals the IPC task on the other CPU to prevent it from reading flash (via code cached in flash) until the flash write is over. This doesn't explain why it's not returning from this properly, though... which is why I mention heap corruption as a possibility.
Is this something you're able to share? |
Yes, this helps immensely. Seems much more reliable. I set it to iterate the download 20 times, and it failed during # 12 with:
Honestly, not sure. Using
For my actual project, there are several; stuff like keeping Wifi active and what not. For the example code below, there's only one user-space task.
Not for the example code; haven't tried yet for production.
I am downloading from my local workstation; latency is likely < 1ms
Hrm. Let me turn on the heap checking for my main code and we'll see
Yes, below. I can provide the supporting code if you actually wish to compile. // C
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stddef.h>
#include <string.h>
#include <sys/param.h> // MIN/MAX
// ESP
#include <esp_task_wdt.h>
#include <esp_wifi.h>
#include <esp_wifi_types.h>
#include <esp_log.h>
#include <esp_event_loop.h>
#include <esp_ota_ops.h>
#include <nvs_flash.h>
// local
#include "Http.hpp"
#include "Sha256.hpp"
#include "elapsedMillis.h"
#include "log.h"
#include "config.h"
#define HEXDIGIT(x) (x) <= 9 ? '0' + (x): 'A' + ((x)-10)
#define ARRAY_LEN(x) (sizeof(x) / sizeof(x[0]))
using namespace std;
class Main {
public:
Main() : m_haveIp(false) {}
static esp_err_t eventHandler(void *ctx, system_event_t *event) {
const char * TAG = "Event";
Main& self = *(Main*)ctx;
(void)self;
switch (event->event_id) {
case SYSTEM_EVENT_STA_START:
L(TAG, "STA_START");
break;
case SYSTEM_EVENT_STA_GOT_IP:
L(TAG, "Got IP");
self.m_haveIp = true;
break;
case SYSTEM_EVENT_STA_DISCONNECTED:
{
self.m_haveIp = false;
// This is a workaround as ESP32 WiFi libs don't currently
// auto-reassociate.
system_event_sta_disconnected_t *disconn = &event->event_info.disconnected;
L(TAG, "STA_DISCONNECTED, reason:%d", disconn->reason);
switch (disconn->reason) {
case WIFI_REASON_BEACON_TIMEOUT:
L(TAG,"beacon timeout");
break;
case WIFI_REASON_NO_AP_FOUND:
L(TAG,"no AP found");
break;
case WIFI_REASON_AUTH_FAIL:
L(TAG,"authentication failed");
break;
default:
// Let other errors through and try to reconnect.
break;
}
break;
}
default:
L(TAG, "Some unhandled event: %d", event->event_id);
}
return ESP_OK;
}
void restart() {
esp_restart();
}
bool downloadUpdate(const string& url, string updateHash) {
const char * TAG = "update";
const esp_partition_t *configured = esp_ota_get_boot_partition();
const esp_partition_t *running = esp_ota_get_running_partition();
const esp_partition_t *update_partition = esp_ota_get_next_update_partition(NULL);
if (configured != NULL ) {
L(TAG, "Configured for partition subtype %d at offset 0x%x", configured->subtype, update_partition->address);
} else {
L(TAG, "Configured partition is NULL");
}
if (running != NULL) {
L(TAG, "Running from partition subtype %d at offset 0x%x", running->subtype, update_partition->address);
} else {
L(TAG, "Running partition is NULL");
}
if(update_partition != NULL) {
L(TAG, "Writing to partition subtype %d at offset 0x%x", update_partition->subtype, update_partition->address);
} else {
L(TAG, "Running partition is NULL");
}
esp_ota_handle_t update_handle;
L(TAG, "update begin");
elapsedMillis beginTimer;
if (esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &update_handle) == ESP_OK) {
L(TAG, "after update begin, %d milliseconds", (int) beginTimer);
Http http("mango/1.0");
int result = http.get(url);
if (result == 200) {
Sha256 sha256;
uint8_t buf [4096];
int read;
int buf_fill = 0;
int ttl_read = 0;
int timeout = 5000;
elapsedMillis printTimer = 3000;
while (true) {
read = http.read(buf + buf_fill, sizeof(buf) - buf_fill , timeout);
if ( read < 0 ) {
if (errno == EAGAIN) {
L(TAG, "read timeout");
continue;
}
L(TAG,"read error, aborting");
return false;
}
buf_fill += read;
if (read != 0 && buf_fill < sizeof(buf)) {
continue;
}
ttl_read += buf_fill;
sha256.update(buf, buf_fill);
//L(TAG, "> esp_ota_write");
if (esp_ota_write(update_handle, buf, buf_fill) != ESP_OK) {
L(TAG,"ota_esp_write failed");
return false;
}
//L(TAG, "< esp_ota_write");
if (printTimer > 500) {
L(TAG, "downloaded %d bytes", ttl_read);
printTimer = 0;
}
if (read == 0) {
break;
}
buf_fill = 0;
}
L(TAG, "downloaded %d bytes total", ttl_read);
if (esp_ota_end(update_handle) != ESP_OK) {
L(TAG,"esp_ota_end() failed");
return false;
}
uint8_t computedHash[65];
memset(computedHash, 0, sizeof(computedHash));
sha256.finish(computedHash);
for (std::string::iterator i = updateHash.begin(); i != updateHash.end(); ++i) {
*i = toupper(*i);
}
for (int i = 31; i >= 0; --i) {
uint8_t b = computedHash[i];
computedHash[i * 2] = HEXDIGIT(b >> 4);
computedHash[i * 2 + 1] = HEXDIGIT(b & 0xF);
}
L(TAG, "computed hash %s", computedHash);
if (updateHash != (char *)computedHash) {
L(TAG, "computed hash of %s does not equal update hash of %s, aborting", computedHash, updateHash.c_str());
return false;
}
L(TAG, "Updating bootable partition");
if (esp_ota_set_boot_partition(update_partition) != ESP_OK) {
L(TAG,"esp_ota_set_boot_partition() failed");
return false;
}
} else {
L(TAG,"HTTP GET failed");
return false;
}
} else {
L(TAG, "esp_ota_begin() returned an error");
return false;
}
L(TAG, "OTA update successful");
return true;
}
void wifi() {
const char * ssid = WIFI_SSID;
const char * pass = WIFI_PASS;
tcpip_adapter_init();
esp_event_loop_init(eventHandler, this);
const char * TAG = "initWifi";
wifi_init_config_t cfg = WIFI_INIT_CONFIG_DEFAULT();
L(TAG, "esp_wifi_init()");
if (esp_wifi_init(&cfg) != ESP_OK) {
L(TAG, "esp_wifi_init() failed");
restart();
}
L(TAG, "esp_wifi_set_storage()");
if(esp_wifi_set_storage(WIFI_STORAGE_RAM) != ESP_OK) {
L(TAG, "esp_wifi_set_storage() failed");
restart();
}
L(TAG, "esp_wifi_set_mode()");
if(esp_wifi_set_mode(WIFI_MODE_STA) != ESP_OK) {
L(TAG, "esp_wifi_set_mode() failed");
restart();
}
L(TAG, "esp_wifi_start()");
if(esp_wifi_start() != ESP_OK) {
L(TAG, "esp_wifi_start() failed");
restart();
}
wifi_config_t config;
memset(&config, 0, sizeof(config));
memcpy(config.sta.ssid, ssid, MIN(strlen(ssid), sizeof(config.sta.ssid)));
memcpy(config.sta.password, pass, MIN(strlen(pass), sizeof(config.sta.password)));
if(esp_wifi_set_config(ESP_IF_WIFI_STA, &config) != ESP_OK) {
L(TAG, "esp_wifi_set_config() failed");
restart();
}
if (esp_wifi_connect() != ESP_OK) {
L(TAG, "esp_wifi_connect() failed");
restart();
}
}
void run() {
nvs_flash_init(); // Required for Wifi, BT
const char * TAG = "main";
wifi();
while(!m_haveIp) {
L(TAG, "waiting for IP");
sleep(1);
}
for(int i = 0; i < 20; i++) {
L(TAG, "Begin test #%d\n", i + 1);
downloadUpdate("http://172.16.23.157/test.bin","13fbead996e7fa70612467c3ea478fb8db80db47b0d5f97e0f7662027bbf9cd9");
L(TAG, "End test #%d\n", i + 1);
}
L(TAG, "finished, idle");
while(true) {
sleep(3);
}
}
private:
volatile bool m_haveIp;
};
void mainTask (void* ctx) {
(void)ctx;
Main test;
test.run();
}
extern "C" {
void app_main() {
xTaskCreate(mainTask, "mainTask", 32768, NULL, tskIDLE_PRIORITY, NULL);
}
}; |
Just got this ... a little different this time since CPU 0 says "wifi". The ones after it say "IDLE" for CPU 0.
Also, sometimes the watchdog message is after download is complete, during a call to |
Not sure of the relevance, but upon success, this is seen when finishing the OTA
|
Thanks for all this info. There's a lot here, but one thing jumped out at me:
This will allow the task to migrate between cores depending on which is idle. Given the crash seems to be related to inter-core communciation in a busy system, this may be relevant. Can you please try changing this to xTaskCreatePinnedToCore(). You'll need another argument at the end of the args list, either 0 or 1 (for which core to pin to). If you're not sure, try 1. (This isn't a fix as much as trying to narrow down whatever the underlying cause of the bug is.) |
Edit: Pinning to core 0 seems to cause the WDT loop Using
|
So, this just happened, inside of
|
I am facing exactly same with OTA. I have external SPI RAM. There are many tasks running while OTA is running. If I run only wifi and OTA tasks, OTA completes successfully without watchdog trigger issue. |
Now I'm seeing WDT resets during normal operation, which appear flash related:
|
Hi @MrSurly , Sorry, I've been out sick the past week. Can I please take you up on the offer of a full project I can reproduce this in? Angus |
=( Sorry to hear that. I hope you're feeling better.
Absolutely. It's presently a branch from a client project. I'm out for the next 4 days -- I'll share a repo when I can get it properly sanitized and anonymized. |
@projectgus I use an image > 1MB. Usually fails around attempt # 10, with a WDT. For the production project I'm working on, this fails 3/4 times; probably because of more threads running ..? |
@projectgus Any word? |
I have one question, maybe its relevant or maybe not. Why you create main task with so low priority? Could you try something like: |
Hi @MrSurly , Sorry for the extended silence I've been busy with some other tasks and this slipped. @chegewara makes a very good point about the task priority. I totally missed this when you posted it earlier. This means that the main task is always competing with the idle task for CPU time, and this may have unexpected consequences (at minimum, degraded performance as the idle task generally assumes that any "real" task will pre-empt it). Try raising it to a value like the 5 suggested, or even Angus |
Increasing the priority seems to have entirely fixed the issue, though I don't understand the mechanism of why. |
This is good position to start. Sure, you dont need to know freeRTOS, but if you will learn just a basic stuff then it will help you a lot in work with esp-idf. Task has few parameters, one of them is priority. In your code it was set to tskIDLE_PRIORITY which is lowest possible. This priority is for idle freeRTOS task to make some cleanings etc. When you got your task set to tskIDLE_PRIORITY freeRTOS has been switching between your task and idle task even if your main task required more time to finish job. Basicaly for normal applications its good to set task priority between 5 and 10. But like i said, its good to read about it from this document to save time and health and avoid strange app behaviour. |
@MrSurly , To add to what @chegewara says, the "idle task" is special because it's assumed that it's the task that runs when there's nothing of higher priority to run. So it may do things like, for example, suspend the CPU clock to put the system into a low power state. The idle task can do this because it knows that if anything else needed to run, it would have a higher priority so it would already be running (in a hard real-time system like an RTOS, the highest priority "ready" task will always run before any lower priority task(s)). There is also a small difference in the ESP-IDF implementation of the FreeRTOS scheduler where we don't guarantee fair round-robin preemption of tasks at the same priority. The idle task never suspends waiting on a mutex or a semaphore (by design, because the assumption is that any other task in the system will have higher priority so why should the idle task ever need to block). So there's no guarantee that preemption will fairly wake up another task at the same "idle" priority if it's blocked on something. (Normally this scenario would trigger the ESP-IDF task watchdog to say "hey, task X is never blocking and might be starving something else out." But, again, the idle task is special as the assumption is that it's the lowest priority thing in the system so it can never starve anything out.) TLDR: Every task created should be at least Sorry that I didn't spot this in your sample code the first time around, and well done @chegewara for spotting it. |
Its my bad too, because i dont read most of issues ( i just dont feel i have knowledge to answer, this time i just got lucky), and im glad it solved your problem. Have a fun with esp32. |
Thank you both for this. That's what I get for copy-paste the task create code. Going to close this. |
EDIT 3:
esp_ota_begin
sometimes takes a few seconds to return. Sometimes, it never returns, and a WDT reboot occurs.EDIT 3.5: Increasing WDT timeout to 10s (per #578) fixes this the
esp_ota_begin
issue, but not the main issue outlined below.EDIT 2: Increasing IPC stack size does not help.
EDIT 1: Calling
esp_ota_write
with 4096 bytes (full flash sector) seems to help, a bit. It now sometimes completes. Probably because this results in fewer calls.Having a lot of trouble with OTA updates.
After doing a bit of digging, it appears that
spi_flash_write_inner
is not returning properly from where it's called fromflash_ops.c
:https://github.com/espressif/esp-idf/blob/master/components/spi_flash/flash_ops.c#L375
Line 377 is never reached.
I've updated to the latest IDF (hash f9ad17e), and the IDF submodules are up-to-date as well. Toolchain is:
xtensa-esp32-elf-gcc (crosstool-NG crosstool-ng-1.22.0-75-gbaf03c2) 5.2.0
Doesn't seem to matter if
SPI_FLASH_VERIFY_WRITE
(and setting all suboptions) is set or not. UnsettingCONFIG_SPI_FLASH_ROM_DRIVER_PATCH
seems to make it worse, when usinWhen it freezes, the serial monitor shows:
Relevant code section is below.
esp_ota_write
fails to return.sdkconfig:
The text was updated successfully, but these errors were encountered: