OTA boot partition selection & coredump not working (IDFGH-1717) #3954

ammaree · 2019-08-22T16:52:02Z

Environment

Development Kit: [none]
Kit version (for WroverKit/PicoKit/DevKitC): [NA]
Module or chip used: [ESP32-WROOM-32]
IDF version (run git describe --tags to find it):
v4.0-dev-1443-g39f090a4f
Build System: [Make]
Compiler version (run xtensa-esp32-elf-gcc --version to find it):
(crosstool-NG esp32-2019r1) 8.2.0
Operating System: [Win10]
Power Supply: [external 3.3V]

Problem Description

During late 2017 we developed our own mechanism to support 3 different OTA images without a factory partition. The logic was (and still is) that with 3 different FW images we had the flexibility to support different functionality and/or different versions and to switch between these as required.

The normal OTA functionality continues to work perfectly and we can download new FW via OTA as and when required. But, due to the stability of our FW we have not needed to use rollback during the last ~5 months, until very recently when it failed.

From tracing the flow we found the process hangs in esp_ota_set_boot_partition(). Is there any clear reason, possibly due to the significant changes that has been made over the last ~5 months, that would cause the hangup?

The coredump functionality stopped working in our dev environment somewhere during the last ~2 months. The situation is possibly similar to CoreDump to flash fails in master but works in v3.2.2 (IDFGH-1602) #3857

Expected Behavior

When esp_ota_set_boot_partition() is called to select an earlier (but still valid OTA) firmware, the operation should complete and return for restart to take place.
When unplanned restart happens, core dump info should be written to FLASH

Actual Behavior

When esp_ota_set_boot_partition() is called the MCU hangs indefinitely and a hard reset must be done
On crash/unplanned reboot, when we look for the core dump partition in order to do HTTPS upload to cloud, the length/size/num/ver are all 0xFFFFFFFF.

Steps to reproduce

Very easy and consistently reproduced in our dev environment. Only have to request a FW revert or select a specific boot image to trigger the hangup.
Also easily reproduced, just hard reboot a device in our test environment.

Code to reproduce this issue

Very difficult to provide sample code since the code is part of a very large code base, BUT the exact same code has been running since Jan 2018. Only update was done some months March 13 ( #1650 ) to calculate the core dump image size and discard the magic number that was removed.

esp_ota_set_boot_partition() related:

int32_t halFOTA_SetBootNumber(int32_t NewPart, uint32_t Mode) {
CTRACK("Partition Run=%d New=%d", CurPart, NewPart) ;
if (halFOTA_CheckRevert(NewPart) == erFAILURE) {
SL_ALRT("Unable to revert to P#%d breaking v%I", NewPart, versionLAST_BREAKING) ;
return erFAILURE ;
}
esp_partition_iterator_t sIter = esp_partition_find(ESP_PARTITION_TYPE_APP, mapLOG2SUB(NewPart), NULL) ;
IF_myASSERT(debugRESULT, sIter != 0) ;
const esp_partition_t * psPart = esp_partition_get(sIter) ;
IF_myASSERT(debugRESULT, mapSUB2LOG(psPart->subtype) == NewPart) ;
halFOTA_ReportPartition("New Boot", psPart, NULL) ;
vTaskDelay(1000) ;
SL_WARN("Boot P#%u (%p) in %s mode %s", NewPart, psPart->address, (Mode & fotaERASE_WIFI) ? "AP" : "STA", (Mode & fotaBOOT_REBOOT) ? "NOW" : "Later") ;
vTaskDelay(1000) ;
CTRACK() ;
ESP_ERROR_CHECK(esp_ota_set_boot_partition(psPart)) ; // set new boot partition
CTRACK() ;
esp_partition_iterator_release(sIter) ;
CTRACK() ;
if (Mode & fotaERASE_WIFI) {
halSTORAGE_DeleteKeyValue(halSTORAGE_STORE, halSTORAGE_KEY_WIFI) ; // WIFI blob
}
if (Mode & fotaERASE_VARS) {
halSTORAGE_DeleteKeyValue(halSTORAGE_STORE, halSTORAGE_KEY_VARS) ; // VARS blob
}
// if (Mode & (fotaERASE_OTA1 | fotaERASE_OTA2 | fotaERASE_OTA3)) { }// do something to erase partition on reboot
if (Mode & fotaBOOT_REBOOT) {
vRtosSetStatus(flagAPP_RESTART) ;
}
CTRACK() ;
return NewPart ;
}

Coredump related:

int32_t xHttpClientCoredumpUpload(void * pvPara) {
esp_partition_iterator_t sIter = esp_partition_find(ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_COREDUMP, NULL) ;
IF_myASSERT(debugRESULT, sIter != 0) ;
const esp_partition_t * psPart = esp_partition_get(sIter) ;
cd_hdr_t sCDhdr ;
int32_t iRV = esp_partition_read(psPart, 0, &sCDhdr, sizeof(sCDhdr)) ;
if ((sCDhdr.data_len == sCDhdr.tasks_num) && (sCDhdr.tcb_sz == sCDhdr.version)) {
SL_ALRT("Coredump not found: L=0x%X N=%d S=0x%X V=%d", sCDhdr.data_len, sCDhdr.tasks_num, sCDhdr.tcb_sz, sCDhdr.version) ;
return erFAILURE ;
}
if (iRV == ESP_OK) {
http_reqres_t sRR = { 0 } ;
sock_sec_t sSecure = { 0 } ;
sSecure.pPem = HostInfo[nvsVars.HostFOTA].pCert ;
sSecure.PemSize = HostInfo[nvsVars.HostFOTA].Size ;
sRR.sCtx.psSec = &sSecure ;
sRR.sCtx.pHost = HostInfo[nvsVars.HostFOTA].pName ;
sRR.pcQuery = "PUT /coredump/%m_%X_%X_%llu.bin" ;
sRR.handler = xHttpClientCoredumpUploadCB ;
sRR.hvContentType = ctApplicationOctetStream ;
sRR.pvArg = (void *) psPart ;
sRR.hvContentLength = (uint64_t) (sCDhdr.data_len) ;
iRV = xHttpClientExecuteRequest(&sRR, macSTA, esp_reset_reason(), DEV_FW_VER_NUM, sTSZ.usecs/MICROS_IN_SECOND) ;
if (iRV > 0) {
SL_NOT("Done: Len=%u Ver=%u Tsk=%u TCB=%u",sCDhdr.data_len, sCDhdr.version, sCDhdr.tasks_num, sCDhdr.tcb_sz) ;
}
}
esp_partition_iterator_release(sIter) ;
return iRV ;
}

Debug Logs

esp_ota_set_boot_partition() related:
Nothing to share here other than that it hangs in the function. No logs are shown.
Coredump related:

Guru Meditation Error: Core 0 panic'ed (StoreProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x400d9630 PS : 0x00060730 A0 : 0x800d97c5 A1 : 0x3ffbfeb0
0x400d9630: vCommandInterpret at C:/Dropbox/devs/ws/z-appl/iMacs/task_control.c:329
A2 : 0xffffffff A3 : 0x00000001 A4 : 0x00f002bf A5 : 0x3ffc1ed0
A6 : 0x00000000 A7 : 0x3ffc1ef8 A8 : 0x400d962c A9 : 0x3f4052b8
0x400d962c: vCommandInterpret at C:/Dropbox/devs/ws/z-appl/iMacs/task_control.c:329
A10 : 0x3ffb3360 A11 : 0x00000001 A12 : 0x00000001 A13 : 0x00000001
A14 : 0x00060b20 A15 : 0x00060023 SAR : 0x00000005 EXCCAUSE: 0x0000001d
EXCVADDR: 0xffffffff LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff
ELF file SHA256: 09eb963d9a76eaba94a6e3821f4d4a9b25f00dcbe93563e57e720b32966421fc
Backtrace: 0x400d962d:0x3ffbfeb0 0x400d97c2:0x3ffbfee0 0x400d98e4:0x3ffbff00 0x400d876c:0x3ffbff20 0x400d6516:0x3ffbff40
0x400d962d: vCommandInterpret at C:/Dropbox/devs/ws/z-appl/iMacs/task_control.c:329
0x400d97c2: vCommandHandler at C:/Dropbox/devs/ws/z-appl/iMacs/task_control.c:294
0x400d98e4: vControlCore at C:/Dropbox/devs/ws/z-appl/iMacs/task_control.c:572
0x400d876c: app_main at C:/Dropbox/devs/ws/z-appl/iMacs/application.c:187
0x400d6516: main_task at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/esp32/cpu_start.c:558
0.6270: #0 esp_core_dump_flash Save core dump to flash...
0.6270: #0 esp_core_dump_common Found tasks: (18)!
.......
0.325: #0 esp_core_dump_flash Init core dump to flash
0.325: #0 esp_core_dump_flash Found partition 'coredump' @ 3f0000 65536 bytes
0.319: #0 preX esp_core_dump_common Failed to mmap core dump data (260)!
4.983: #0 main xHttpClientCoredumpUpload Coredump not found: L=0xFFFFFFFF N=-1 S=0xFFFFFFFF V=-1
0.319: #0 preX esp_core_dump_common Failed to mmap core dump data (260)!

Other items if possible

sdkconfig.h.txt

The text was updated successfully, but these errors were encountered:

KonstantinKondrashov · 2019-08-23T06:23:28Z

Hi @ammaree!
Could you clarify the stage otadata after not successful esp_ota_set_boot_partition ().
It was erased and device run with ota_0? Which partition was run after?

As I see, you are not using the ROLLBACK and ANTI-ROLLBACK options from IDF, right?

Which size of your FW? and could you show your partiton_table.csv.

Thanks.

ammaree · 2019-08-23T11:08:51Z

@KonstantinKondrashov
BEFORE: Running from partition 3 (OTA2)
SELECT: Change to partition 1 (OTA0)
AFTER: Hard reboot brings back to same condition booting partition 3 (OTA2)
No obvious or symptomatic change to otadata

We are not using ROLLBACK nor ANTI-ROLLBACK from IDF. We have separate logic that monitors the health of the device with the running firmware, primarily using the time since 1st boot and number of reboots since to determine whether we should roll back.

Size of firmware is 967,344 bytes
Partition table is:

nvs ,data,nvs ,0x00009000,0x00004000,
otadata ,data,ota ,0x0000d000,0x00002000,
phy_init,data,phy ,0x0000f000,0x00001000,
ota_0 ,0 ,ota_0 ,0x00010000,0x00100000,
ota_1 ,0 ,ota_1 ,0x00110000,0x00100000,
ota_2 ,0 ,ota_2 ,0x00210000,0x00100000,
FlashFS ,data,spiffs ,0x00310000,0x000e0000,
CoreDump,data,coredump,0x003f0000,0x00010000,

Output of partitions from running system is:

Appl TY=APP (0) ST=OTA (10) AD=0x00010000 SZ=0x100000 LA=' ota_0' EN='N' #1 v0.0.4.245 build=2019-08-23T09:23:00.000Z boot=2019-08-23T09:24:39.000Z count= 0
Appl TY=APP (0) ST=OTA (11) AD=0x00110000 SZ=0x100000 LA=' ota_1' EN='N' #2 v0.0.4.245 build=2019-08-23T10:53:28.000Z boot=2019-08-23T10:54:43.000Z count= 0
un TY=APP (0) ST=OTA (12) AD=0x00210000 SZ=0x100000 LA=' ota_2' EN='N' #3 v0.0.4.245 build=2019-08-23T10:59:08.000Z boot=2019-08-23T11:00:16.000Z count= 0
Data TY=DATA(1) ST=NVS (02) AD=0x00009000 SZ=0x004000 LA=' nvs' EN='N'
Data TY=DATA(1) ST=OTA (00) AD=0x0000d000 SZ=0x002000 LA=' otadata' EN='N'
Data TY=DATA(1) ST=PHY (01) AD=0x0000f000 SZ=0x001000 LA='phy_init' EN='N'
Data TY=DATA(1) ST=FFS (82) AD=0x00310000 SZ=0x0e0000 LA=' FlashFS' EN='N'
Data TY=DATA(1) ST=CDMP(03) AD=0x003f0000 SZ=0x010000 LA='CoreDump' EN='N'

KonstantinKondrashov · 2019-08-23T11:29:27Z

If you suspect that the hang occurred in esp_ota_set_boot_partition () and you can easily reproduce it, could you add additional logs to check where we hang? There are two functions: image_validate() and esp_rewrite_ota_data(). This is probably happening somewhere inside one of them.

esp_err_t esp_ota_set_boot_partition(const esp_partition_t *partition)
{
    if (partition == NULL) {
        return ESP_ERR_INVALID_ARG;
    }

    if (image_validate(partition, ESP_IMAGE_VERIFY) != ESP_OK) {
        return ESP_ERR_OTA_VALIDATE_FAILED;
    }

    // if set boot partition to factory bin ,just format ota info partition
    if (partition->type == ESP_PARTITION_TYPE_APP) {
        if (partition->subtype == ESP_PARTITION_SUBTYPE_APP_FACTORY) {
            const esp_partition_t *find_partition = esp_partition_find_first(ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_OTA, NULL);
            if (find_partition != NULL) {
                return esp_partition_erase_range(find_partition, 0, find_partition->size);
            } else {
                return ESP_ERR_NOT_FOUND;
            }
        } else {
            return esp_rewrite_ota_data(partition->subtype);
        }
    } else {
        return ESP_ERR_INVALID_ARG;
    }
}

Alvin1Zhang · 2019-08-28T01:52:50Z

@ammaree Thanks for reporting the issue, would you please help provide more logs as suggested by @KonstantinKondrashov ? Thanks.

ammaree · 2019-08-29T15:26:54Z

@Alvin1Zhang @KonstantinKondrashov

Gents, apologies. I have found the cause, it was due to the same task trying to take the same mutex twice, all in the process of logging messages whilst setting up the reboot. Solved by check for the current task already owning the mutex.

The issue of coredump not working still remains. Shall I close this issue and open a new one for the coredump problem or shall I leave this open as a reminder ?

KonstantinKondrashov · 2019-08-29T15:34:35Z

HI @ammaree!
It is well that it is resolved. Yes, could you close this issue and open a new with the coredump problem.
Thanks.

ammaree · 2019-09-04T18:02:17Z

@KonstantinKondrashov

Just an update. I have traced the problem as far as possible. The problem is related to the IDF 4.0 SPI Flash driver implementation.

Using the legacy implementation the coredump works, 17 tasks are dumped to flash and the application uploads the coredump to cloud on reboot. PERFECT.

When using the IDF 4.0 SPI Flash drive the core dump hangs in Line 130 of core_dump_flash.c where spi_flash_erase_range() is called. I did not debug further into spi_flash_erase_range() but the fault lies here. After a short delay the system reboots due to watchdog timeout as below.

W (61) boot: PRO CPU has been reset by WDT.
W (61) boot: WDT reset info: PRO CPU PC=0x40093dce
W (62) boot: WDT reset info: APP CPU PC=0x4016dc9e (waiti mode)

I hope this help to narrow down the problem.

github-actions bot changed the title ~~OTA boot partition selection & coredump not working~~ OTA boot partition selection & coredump not working (IDFGH-1717) Aug 22, 2019

ammaree closed this as completed Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA boot partition selection & coredump not working (IDFGH-1717) #3954

OTA boot partition selection & coredump not working (IDFGH-1717) #3954

ammaree commented Aug 22, 2019

KonstantinKondrashov commented Aug 23, 2019

ammaree commented Aug 23, 2019

KonstantinKondrashov commented Aug 23, 2019

Alvin1Zhang commented Aug 28, 2019

ammaree commented Aug 29, 2019

KonstantinKondrashov commented Aug 29, 2019

ammaree commented Sep 4, 2019

OTA boot partition selection & coredump not working (IDFGH-1717) #3954

OTA boot partition selection & coredump not working (IDFGH-1717) #3954

Comments

ammaree commented Aug 22, 2019

Environment

Problem Description

Expected Behavior

Actual Behavior

Steps to reproduce

Code to reproduce this issue

Debug Logs

Other items if possible

KonstantinKondrashov commented Aug 23, 2019

ammaree commented Aug 23, 2019

KonstantinKondrashov commented Aug 23, 2019

Alvin1Zhang commented Aug 28, 2019

ammaree commented Aug 29, 2019

KonstantinKondrashov commented Aug 29, 2019

ammaree commented Sep 4, 2019