Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

ammaree · 2018-09-25T15:47:15Z

Fill in all the fields under Environment marked with [ ] by picking the correct option for you in each case and deleting the others.
Describe your problem.
Include debug logs on the monitor or the coredump.
Provide more items under Other items if possible can help us better locate your problem.
Use markup (buttons above) and the Preview tab to check what the issue will look like.
Delete these instructions from the above to the below marker lines before submitting this issue.

----------------------------- Delete above -----------------------------

Environment

Development Kit: Custom hardware
Kit version (for WroverKit/PicoKit/DevKitC): NA
Core (if using chip or module): ESP-WROOM32
IDF version (git rev-parse --short HEAD to get the commit id.): LATEST (today)
Development Env: Eclipse
Operating System: Windows
Power Supply: external 3.3V

Problem Description

Repeated crash in uart_write / _lock_acquire_recursive AND/OR uart_write / _lock_release_recursive

Expected Behavior

Should not crash, must work consistently

Actual Behavior

Crash at irregular intervals on 4 out of 50+ devices,

Steps to reproduce

Nothing can be done to to reproduce consistently. Have had 6 crashes on 4 devices out of 50+ during a 24 hour period.

Code to reproduce this issue

Other items if possible

Have attached the sdkconfig, same for all crashes since exact same firmware
Have attached coredump (both binary(all zipped into 1) and decoded text) for 6 events
ELF file not attached.

Debug Logs

30aea432bc04_4p9_1537855759 esp_vfs_write.txt
30aea432c9a4_4p9_1537884881 esp_vfs_write.txt
30aea432c70c_4p9_1537848949 esp_vfs_write.txt
30aea432c95c_4p9_1537853193 esp_vfs_write.txt
30aea432c95c_4p9_1537859708 esp_vfs_write.txt
30aea432c95c_4p9_1537861983 esp_vfs_write.txt

sdkconfig.txt
uart_write.zip

The text was updated successfully, but these errors were encountered:

igrr · 2018-09-26T01:02:56Z

Could you please also attach the panic handler output which is sent to the console on crash? That is to see what kind of exception/crash this is.

ammaree · 2018-09-26T05:45:42Z

Apologies, but all these devices are headless and running at remote sites, so no UART output, the coredump is uploaded to a central host. I started adding the code to log the reset cause/reason to the firmware but have not been able to deploy due to #2474 (comment) As soon as this resolved I will deploy new firmware to embed the reset code into the coredump name. From: Ivan Grokhotkov <notifications@github.com> Sent: Wednesday, 26 September 2018 03:03 To: espressif/esp-idf <esp-idf@noreply.github.com> Cc: Andre M. Maree <andrem@kss.co.za>; Author <author@noreply.github.com> Subject: Re: [espressif/esp-idf] Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (#2470) Could you please also attach the panic handler output which is sent to the console on crash? That is to see what kind of exception/crash this is. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#2470 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADE_mORtbQYlNxrZ2awawxg6EfWaLaoHks5uetJHgaJpZM4W470o>.

igrr · 2018-09-26T06:10:10Z

The backtraces suggest that this might be a stack overflow, but without seeing panic handler output it's impossible to tell exactly. Please try reproducing this on a device that you have physical access to.

ammaree · 2018-09-26T07:33:38Z

@igrr
Thanks to @projectgus suggested temporary fix our build environment working again.
Other than the reset code that we will embed into the coredump filename, what supporting info shall we try to gather?

igrr · 2018-09-26T08:03:30Z

Ideally, the entire panic handler output. I understand that it's not feasible to get it remotely, hence my suggestion to try reproducing the same issue locally.

ammaree · 2018-09-26T08:09:26Z

I have 3 test devices running same firmware locally but not 1 has crashed so I assume it is usage related.

The production devices are deployed in student hostels to control access to rooms and dispensing of hot/cold shower water hence the device functioning is 24hr and completely random on site.

Maybe the panic handler output should be redirected to flash as well, together with coredump?

ammaree · 2018-09-28T21:10:20Z

Hi @igrr @gerekon

I have not been able to get any local test devices (3 running for 48 hrs) to crash but have had 8 coredumps at the remote site (from 50 devices over 48hrs) of which 6 are reasonably similar fitting into 2 patterns. I have spent a lot of time trying to make sense of the current thread stack, but no luck. Some common trends can however be identified.

Pattern 1 we have 3 maybe 4 reasonably similar occurrences, all zipped together
Patter 2 we have 2 occurrences of, also zipped together. I think the cause is common between them but would really appreciate your help to start somewhere.

The reset cause is the single digit in the filename just after the MAC address.

Pattern 2.zip
Pattern 1.zip

All help appreciated to help stabilize this site.

Thanks

igrr · 2018-11-22T07:56:46Z

Hi @ammaree, sorry for the late response. What is the stack size of these two tasks where the issue happens? One common thing in every core dump is that the used stack size is 2568 or 2572 bytes at the point where panic handler is invoked. Which suggests that this might be a stack overflow if the stack size was set to 2560 bytes?

ammaree · 2018-11-22T10:19:26Z

Hi @igrr

It is very difficult to know which task is running.

Pattern 1:
Starting from #0 in the current thread stack and working downwards it is only at MQTTPublish() that the first reference to one of our functions comes in, BUT there is no link between MQTTPublish() and __swbuf_r() in the source. Even higher up in the stack trace we have 2 successive calls where no function call link exist, normally between xPrint() and MD5Update().

Both xMqttPublish() and xMqttPublishBuild() ----> xPrint() are part of the MQTTtx task and the stack size is large, ~7kB. But ultimately, based on #0 to __swbuf_r() I have no idea of the running task.

Pattern 2:
Starting from #0 to i2c_reset_tx_fifo() I have no idea of the running task. From vTaskEvents) to xPrint() makes sense, TaskEvents has a stack of 2816 bytes.

ammaree · 2019-05-03T16:10:19Z

Closing this issue as well. Problem went away somewhere during last couple of weeks, possibly related to changes in vfs module

Alvin1Zhang changed the title ~~Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive~~ [TW#26499] Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive Sep 26, 2018

projectgus changed the title ~~[TW#26499] Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive~~ Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) Mar 12, 2019

ammaree closed this as completed May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

ammaree commented Sep 25, 2018

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018 via email

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018 •

edited

Loading

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018

ammaree commented Sep 28, 2018

igrr commented Nov 22, 2018

ammaree commented Nov 22, 2018 •

edited

Loading

ammaree commented May 3, 2019

Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

Comments

ammaree commented Sep 25, 2018

Environment

Problem Description

Expected Behavior

Actual Behavior

Steps to reproduce

Code to reproduce this issue

Other items if possible

Debug Logs

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018 via email

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018 • edited Loading

igrr commented Sep 26, 2018

ammaree commented Sep 26, 2018

ammaree commented Sep 28, 2018

igrr commented Nov 22, 2018

ammaree commented Nov 22, 2018 • edited Loading

ammaree commented May 3, 2019

ammaree commented Sep 26, 2018 •

edited

Loading

ammaree commented Nov 22, 2018 •

edited

Loading