Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32-C3 Stack Protection Debug Assist module triggering on SP load (IDFGH-13568) #14456

Open
3 tasks done
projectgus opened this issue Aug 28, 2024 · 3 comments
Open
3 tasks done
Labels
Status: Opened Issue is new Type: Bug bugs in IDF

Comments

@projectgus
Copy link
Contributor

projectgus commented Aug 28, 2024

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2.2, also v5.2.2-639-g43098fc4de

Espressif SoC revision.

ESP32-C3 (QFN32) (revision v0.4)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

SEEED XIAO ESP32-C3

Power Supply used.

USB

What is the expected behavior?

Load SP register with a valid address (inside the current task's stack region) without Debug Assist hardware Stack Protection triggering.

What is the actual behavior?

Loading the SP register seems to intermittently trigger a hardware stack protector interrupt. All of the reported addresses look valid for the running task, i.e. there was no stack overflow or SP corruption.

Steps to reproduce.

Reproduction currently requires the MicroPython master branch and some Python code that sends a lot of data over Wi-Fi. (The original bug is micropython/micropython#15667)

It is probably possible to make a simpler reproducer, best guess is that the key features are:

  • Frequent context switches and/or interrupts (or maybe something else to do with Wi-Fi activity, but guess interrupts).
  • Execution in the task is jumping around using a setjmp/longjmp style mechanism. For MicroPython this is "native NLR" implemented here: https://github.com/micropython/micropython/blob/master/py/nlrrv32.c#L53 however the original issue report was using libc setjmp/longjmp.

Note that all of the jumps are happening within the same task, and the stack pointer is saved and restored each time to/from a valid value for the current executing task.

Debug Logs.

Here's a sample crash:

MPY version : v1.24.0-preview.201.g24aa8ed762.dirty on 2024-08-28
IDF version : v5.2.2
Machine     : ESP32C3 module with ESP32C3

Guru Meditation Error: Core  0 panic'ed (Stack protection fault). 

Detected in task "mp_task" at 0x4200b1ee
0x4200b1ee: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55

Stack pointer: 0x3fca7ff0
Stack bounds: 0x3fca43a4 - 0x3fca83a0


Core  0 register dump:
Stack dump detected
MEPC    : 0x4200b200  RA      : 0x403829fa  SP      : 0x3fca7ff0  GP      : 0x3fc96e00  
0x4200b200: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55
0x403829fa: mp_execute_bytecode at /home/gus/ry/george/micropython/py/vm.c:285

TP      : 0x3fc6b838  T0      : 0x3fca7fa0  T1      : 0x40390f52  T2      : 0x0000003f  
0x40390f52: vTaskSuspend at /home/gus/ry/george/esp-idf-v5/components/freertos/FreeRTOS-Kernel/tasks.c:1960 (discriminator 1)

S0/FP   : 0x3fcabbe0  S1      : 0x3fcabc30  A0      : 0x3fca8010  A1      : 0x00000054  
A2      : 0x00000000  A3      : 0x3fcc99c0  A4      : 0x3fcc99c0  A5      : 0x3fca80e0  
A6      : 0x00000002  A7      : 0x21400000  S2      : 0x3c17034c  S3      : 0x3fcc9950  
S4      : 0x00000001  S5      : 0x00000062  S6      : 0x00000068  S7      : 0x3c16dc1c  
S8      : 0x0000001b  S9      : 0x3c16e000  S10     : 0x3c178419  S11     : 0x3c1781b6  
T3      : 0x00000000  T4      : 0x0003877f  T5      : 0x00000003  T6      : 0x00000001  
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x0000001b  MTVAL   : 0x00004505  
0x40380001: _vector_table at ??:?

MHARTID : 0x00000000  


Backtrace:


0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
55          __asm volatile (
#0  0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
#1  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
ELF file SHA256: 7e6b188d6

Note that the Stack pointer address in the dump is valid for the bounds of the task.

This crash dump was created with a couple of additions in the nlr_jump function to try and get extra debug info:

200b184 <nlr_jump>:
        "sw   x2, 60(x10)       \n" // Store SP.
        "jal  x0, nlr_push_tail \n" // Jump to the C part.
        );
}

NORETURN void nlr_jump(void *val) {
4200b184:       1141                    addi    sp,sp,-16
4200b186:       c226                    sw      s1,4(sp)
4200b188:       c04a                    sw      s2,0(sp)
4200b18a:       c606                    sw      ra,12(sp)
4200b18c:       84aa                    mv      s1,a0
    MP_NLR_JUMP_HEAD(val, top)
4200b18e:       dc1fc0ef                jal     ra,42007f4e <mp_thread_get_state>
4200b192:       01452903                lw      s2,20(a0)
4200b196:       c422                    sw      s0,8(sp)
4200b198:       00091563                bnez    s2,4200b1a2 <nlr_jump+0x1e>
4200b19c:       8526                    mv      a0,s1
4200b19e:       ec2fa0ef                jal     ra,42005860 <nlr_jump_fail>
4200b1a2:       842a                    mv      s0,a0
4200b1a4:       00992223                sw      s1,4(s2)
4200b1a8:       854a                    mv      a0,s2
4200b1aa:       71a420ef                jal     ra,4204d8c4 <nlr_call_jump_callbacks>
4200b1ae:       00092783                lw      a5,0(s2)
4200b1b2:       c85c                    sw      a5,20(s0)
    __asm volatile (
4200b1b4:       854a                    mv      a0,s2
4200b1b6:       000102b3                add     t0,sp,zero  // Note: stored pre-restore SP to t0
4200b1ba:       00852083                lw      ra,8(a0)
4200b1be:       4540                    lw      s0,12(a0)
4200b1c0:       4904                    lw      s1,16(a0)
4200b1c2:       01452903                lw      s2,20(a0)
4200b1c6:       01852983                lw      s3,24(a0)
4200b1ca:       01c52a03                lw      s4,28(a0)
4200b1ce:       02052a83                lw      s5,32(a0)
4200b1d2:       02452b03                lw      s6,36(a0)
4200b1d6:       02852b83                lw      s7,40(a0)
4200b1da:       02c52c03                lw      s8,44(a0)
4200b1de:       03052c83                lw      s9,48(a0)
4200b1e2:       03452d03                lw      s10,52(a0)
4200b1e6:       03852d83                lw      s11,56(a0)
4200b1ea:       03c52103                lw      sp,60(a0)
4200b1ee:       0001                    nop  // <-- address the Debug Assist reports
4200b1f0:       0001                    nop
4200b1f2:       0001                    nop
4200b1f4:       0001                    nop
4200b1f6:       0001                    nop
4200b1f8:       0001                    nop
4200b1fa:       0001                    nop
4200b1fc:       0001                    nop
4200b1fe:       0001                    nop
4200b200:       4505                    li      a0,1  // <-- MEPC when the protection actually triggers
4200b202:       00008067                ret
  • The debug assist always points to the instruction after loading SP as the one which triggered protection.
  • Adding the add t0,sp,zero means temp register t0 holds the "before restore" SP value in the crash dump. Note that this SP value is also inside the task bounds.
  • Note that neither SP value is close to the stack limit. I doubled the task stack size and re-tested just in case, it crashes the same.
  • Adding the NOPs at the end means that the exception register dump is valid for all register values at the time of triggering (otherwise the CPU exeception triggers a couple of instructions after returning which makes it harder to follow). This is also why the Backtrace doesn't decode here (the SP doesn't point to the executing frame as it's just been updated). The stack isn't corrupt though, if you take the NOPs out then the Backtrace decodes correctly.

More Information.

  • Suspect probably a context switch or an interrupt that triggers immediately before or after the lw sp,60(a0) instruction is causing the stack protection to trigger.
  • Tried some simple patches in components/esp_system/port/include/private/esp_private/hw_stack_guard.h such as adding fence instructions and big nop blocks at the end of ESP_HW_STACK_GUARD_MONITOR_STOP_CPU0 and ESP_HW_STACK_GUARD_MONITOR_START_CPU0 macros, in case there was some race with the Debug Assist registers changing during a context switch. Still crashes, however I don't really know what I'm doing there.
  • Have NOT tried disabling interrupts inside nlr_jump. That seems like a possible workaround but also doesn't seem like it should be necessary...?

Happy to try anything you recommend, might even be able to provide a C reproducer that uses setjmp/longjmp.

@projectgus projectgus added the Type: Bug bugs in IDF label Aug 28, 2024
@github-actions github-actions bot changed the title ESP32-C3 Stack Protection Debug Assist module triggering on SP load ESP32-C3 Stack Protection Debug Assist module triggering on SP load (IDFGH-13568) Aug 28, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Aug 28, 2024
@Lapshin
Copy link
Collaborator

Lapshin commented Aug 30, 2024

Hi @projectgus , thank you for reporting!

That's a bizarre bug you caught. Could you please provide a reproducer?

I tried to reproduce it with this code + holding a key pressed :D (click to expand)
#include <stdio.h>
#include <inttypes.h>
#include "sdkconfig.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_chip_info.h"
#include "esp_flash.h"
#include "esp_system.h"
#include "esp_intr_alloc.h"
#include "soc/periph_defs.h"
#include "hal/uart_ll.h"

void interrupt_handler(__attribute__((unused)) void *)
{
    int fifolen = uart_ll_get_rxfifo_len(&UART0);
    while (fifolen != 0) {
        unsigned char data;
        uart_ll_read_rxfifo(&UART0, &data, 1);
        fifolen--;
    }
    uart_ll_clr_intsts_mask(&UART0, UART_INTR_RXFIFO_FULL | UART_INTR_RXFIFO_TOUT);
}

void app_main(void) {
    esp_intr_alloc(ETS_UART0_INTR_SOURCE, 0, interrupt_handler, NULL, NULL);
    while (1) {
        asm volatile("add t0,sp,zero");
        asm volatile("sw sp,16(t0)");
        asm volatile("addi sp,sp,-100");
        asm volatile("nop");
        asm volatile("nop");
        asm volatile("lw sp,16(t0)");
        vTaskDelay(50 / portTICK_PERIOD_MS);
    }
}

But could not reproduce it with v5.2.2 (3b8741b)

projectgus added a commit to projectgus/micropython that referenced this issue Sep 3, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
projectgus added a commit to projectgus/micropython that referenced this issue Sep 3, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
projectgus added a commit to projectgus/micropython that referenced this issue Sep 3, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

Re-enables the stack protector watchpoint which was the
ESP-IDF default before stack protector was enabled
(and still the default for Xtensa CPUs).

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@projectgus
Copy link
Contributor Author

@Lapshin I haven't had any luck yet either, maybe it actually requires high Wi-Fi traffic. Will keep at it and let you know.

projectgus added a commit to projectgus/micropython that referenced this issue Sep 3, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
dpgeorge pushed a commit to projectgus/micropython that referenced this issue Sep 4, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
graeme-winter pushed a commit to winter-special-projects/micropython that referenced this issue Sep 21, 2024
Workaround for what appears to be an upstream issue:
espressif/esp-idf#14456

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@dhalbert
Copy link

dhalbert commented Oct 23, 2024

We ran into what seems like the same problem in CircuitPython: adafruit/circuitpython#9749 (see a simple test program there), and fixed it with CONFIG_ESP_SYSTEM_HW_STACK_GUARD=n for ESP32-C3 and ESP32-C6: adafruit/circuitpython#9748, after seeing MicroPython's fix for this.

EDIT: CONFIG_ESP_SYSTEM_HW_STACK_GUARD is enabled when SOC_ASSIST_DEBUG_SUPPORTED is defined, which is true for ESP32-C2, C3, C6, P4, and will eventually be true for C5 and C61, according to comments in ESP-IDF. So we decided to turn it off for all relevant builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

4 participants