Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) #2470

Closed
ammaree opened this issue Sep 25, 2018 · 10 comments

Comments

@ammaree
Copy link

ammaree commented Sep 25, 2018

  1. Fill in all the fields under Environment marked with [ ] by picking the correct option for you in each case and deleting the others.
  2. Describe your problem.
  3. Include debug logs on the monitor or the coredump.
  4. Provide more items under Other items if possible can help us better locate your problem.
  5. Use markup (buttons above) and the Preview tab to check what the issue will look like.
  6. Delete these instructions from the above to the below marker lines before submitting this issue.

----------------------------- Delete above -----------------------------

Environment

  • Development Kit: Custom hardware
  • Kit version (for WroverKit/PicoKit/DevKitC): NA
  • Core (if using chip or module): ESP-WROOM32
  • IDF version (git rev-parse --short HEAD to get the commit id.): LATEST (today)
  • Development Env: Eclipse
  • Operating System: Windows
  • Power Supply: external 3.3V

Problem Description

Repeated crash in uart_write / _lock_acquire_recursive AND/OR uart_write / _lock_release_recursive

Expected Behavior

Should not crash, must work consistently

Actual Behavior

Crash at irregular intervals on 4 out of 50+ devices,

Steps to reproduce

Nothing can be done to to reproduce consistently. Have had 6 crashes on 4 devices out of 50+ during a 24 hour period.

Code to reproduce this issue

Other items if possible

Have attached the sdkconfig, same for all crashes since exact same firmware
Have attached coredump (both binary(all zipped into 1) and decoded text) for 6 events
ELF file not attached.

Debug Logs

30aea432bc04_4p9_1537855759 esp_vfs_write.txt
30aea432c9a4_4p9_1537884881 esp_vfs_write.txt
30aea432c70c_4p9_1537848949 esp_vfs_write.txt
30aea432c95c_4p9_1537853193 esp_vfs_write.txt
30aea432c95c_4p9_1537859708 esp_vfs_write.txt
30aea432c95c_4p9_1537861983 esp_vfs_write.txt

sdkconfig.txt
uart_write.zip

@igrr
Copy link
Member

igrr commented Sep 26, 2018

Could you please also attach the panic handler output which is sent to the console on crash? That is to see what kind of exception/crash this is.

@Alvin1Zhang Alvin1Zhang changed the title Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive [TW#26499] Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive Sep 26, 2018
@ammaree
Copy link
Author

ammaree commented Sep 26, 2018 via email

@igrr
Copy link
Member

igrr commented Sep 26, 2018

The backtraces suggest that this might be a stack overflow, but without seeing panic handler output it's impossible to tell exactly. Please try reproducing this on a device that you have physical access to.

@ammaree
Copy link
Author

ammaree commented Sep 26, 2018

@igrr
Thanks to @projectgus suggested temporary fix our build environment working again.
Other than the reset code that we will embed into the coredump filename, what supporting info shall we try to gather?

@igrr
Copy link
Member

igrr commented Sep 26, 2018

Ideally, the entire panic handler output. I understand that it's not feasible to get it remotely, hence my suggestion to try reproducing the same issue locally.

@ammaree
Copy link
Author

ammaree commented Sep 26, 2018

I have 3 test devices running same firmware locally but not 1 has crashed so I assume it is usage related.

The production devices are deployed in student hostels to control access to rooms and dispensing of hot/cold shower water hence the device functioning is 24hr and completely random on site.

Maybe the panic handler output should be redirected to flash as well, together with coredump?

@ammaree
Copy link
Author

ammaree commented Sep 28, 2018

Hi @igrr @gerekon

I have not been able to get any local test devices (3 running for 48 hrs) to crash but have had 8 coredumps at the remote site (from 50 devices over 48hrs) of which 6 are reasonably similar fitting into 2 patterns. I have spent a lot of time trying to make sense of the current thread stack, but no luck. Some common trends can however be identified.

Pattern 1 we have 3 maybe 4 reasonably similar occurrences, all zipped together
Patter 2 we have 2 occurrences of, also zipped together. I think the cause is common between them but would really appreciate your help to start somewhere.

The reset cause is the single digit in the filename just after the MAC address.

Pattern 2.zip
Pattern 1.zip

All help appreciated to help stabilize this site.

Thanks

@igrr
Copy link
Member

igrr commented Nov 22, 2018

Hi @ammaree, sorry for the late response. What is the stack size of these two tasks where the issue happens? One common thing in every core dump is that the used stack size is 2568 or 2572 bytes at the point where panic handler is invoked. Which suggests that this might be a stack overflow if the stack size was set to 2560 bytes?

@ammaree
Copy link
Author

ammaree commented Nov 22, 2018

Hi @igrr

It is very difficult to know which task is running.

Pattern 1:
Starting from #0 in the current thread stack and working downwards it is only at MQTTPublish() that the first reference to one of our functions comes in, BUT there is no link between MQTTPublish() and __swbuf_r() in the source. Even higher up in the stack trace we have 2 successive calls where no function call link exist, normally between xPrint() and MD5Update().

Both xMqttPublish() and xMqttPublishBuild() ----> xPrint() are part of the MQTTtx task and the stack size is large, ~7kB. But ultimately, based on #0 to __swbuf_r() I have no idea of the running task.

Pattern 2:
Starting from #0 to i2c_reset_tx_fifo() I have no idea of the running task. From vTaskEvents) to xPrint() makes sense, TaskEvents has a stack of 2816 bytes.

@projectgus projectgus changed the title [TW#26499] Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive Crash in VFS esp_vfs_write / uart_write / _lock_acquire(release)_recursive (IDFGH-385) Mar 12, 2019
@ammaree
Copy link
Author

ammaree commented May 3, 2019

Closing this issue as well. Problem went away somewhere during last couple of weeks, possibly related to changes in vfs module

@ammaree ammaree closed this as completed May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants