Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump inconsistency.. (IDFGH-676) #1650

Closed
ammaree opened this issue Feb 22, 2018 · 25 comments
Closed

Coredump inconsistency.. (IDFGH-676) #1650

ammaree opened this issue Feb 22, 2018 · 25 comments
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@ammaree
Copy link

ammaree commented Feb 22, 2018

We are trying to understand the core dump module but need some clarity on a number of items. Some background first.

We have implemented a mechanism to track planned and unplanned (crash) restarts and to make an informed decision about automatically reverting to previous version of OTA firmware. Whenever an unplanned restart is detected the full core dump partition is uploaded to our cloud host using HTTP PUT functionality and uniquely identified using MAC address and time..

http://esp-idf.readthedocs.io/en/latest/api-guides/core_dump.html#save-core-dump-to-flash states that the core dump size is the sum of all the TCBs and all the stacks, plus a small additional overhead. Based on this we calculated a core dump partitions of ~140KB, and for safety we made it 192KB.

We configured the coredump to use logging level 5, expecting the amount of info in the coredump to be more, but similar to @nkolban in https://www.esp32.com/viewtopic.php?f=13&t=1099 , found that only the log info prior/during the core dump generation increases, the amount of info in the coredump stays the same.

Q1: How should we calculate the core dump partition size since 20 tasks with total of ~140k stack allocated only creates a core dump of ~16K ?

Q2: What is the exact syntax required to enable the 'espcoredump.py' utility to take its input from the uploaded binary file? We have tried using:
"python /c/Dropbox/devs/ws/z-sdk/esp-idf/components/espcoredump/espcoredump.py --chip esp32 info_corefile --core /c/Dropbox/devs/ws/z-appl/ewm-irmacos/coredump/30aea432bc00_1.bin --core-format raw --print-mem"
and a number of variations on the above but the help message says:
'usage: espcoredump info_corefile [-h] [--gdb GDB] [--core CORE]
[--core-format CORE_FORMAT] [--off OFF]
[--save-core SAVE_CORE] [--print-mem]
prog
espcoredump info_corefile: error: too few arguments"

The "prog" parameter is not documented anywhere but from the python source it appears to be the path to the ELF file.

Adding the path [+filename] to the command line causes an error message:
"Growing up stacks are not supported for now!
Failed to create corefile!"

What parameters are we missing or have wrong?

@FayeY FayeY changed the title Coredump inconsistency.. [TW#18663] Coredump inconsistency.. Feb 24, 2018
@chmorgan
Copy link
Contributor

chmorgan commented Mar 2, 2018

I'm also interested in doing the same, pushing the core dump to AWS etc. I'll be watching here to see if you get additional clarification.

One point I was curious about was whether enabling flash output of coredumps entirely disables part output. I'd prefer to have both enabled in the typical case where I'm debugging and looking at a serial terminal.

@ammaree
Copy link
Author

ammaree commented Mar 4, 2018

@projectgus @igrr

Any chance for some light on the 2 questions, please ...

@gerekon
Copy link
Collaborator

gerekon commented Mar 7, 2018

@ammaree

http://esp-idf.readthedocs.io/en/latest/api-guides/core_dump.html#save-core-dump-to-flash states that the core dump size is the sum of all the TCBs and all the stacks, plus a small additional overhead. Based on this we calculated a core dump partitions of ~140KB, and for safety we made it 192KB.

Core dump module saves only that part of task's stack which is actually used at the moment of crash. So dump data size is much smaller than theoretical maximum.

Q1: How should we calculate the core dump partition size since 20 tasks with total of ~140k stack allocated only creates a core dump of ~16K ?

Having in mind optimization mentioned above I see the following possible approaches:

  • Using stack debug watchpoint. Enable Set a debug watchpoint as a stack overflow check item in Component config->FreeRTOS. For every task sequentially decrease its stack size, run your tests bench. If tests have passed repeat again. The stack size at which watchpoint is triggered is the minimal space to be reserved for that task.
  • Using FreeRTOS API. OS provides uxTaskGetStackHighWaterMark() function which can be used to find minimal space size necessary to store every task's stack.

When calculating core dump partition size you can add additional space to the minimal one for every task, for example by multiplying it by 1.15 coef. So the formula should look like part_size = small_overhead_size + tcbs_size + 1.15*(min_stack_sz_task_1 + ... + min_stack_sz_task_N).

Q2: What is the exact syntax required to enable the 'espcoredump.py' utility to take its input from the uploaded binary file?

Command syntax for this: espcoredump.py info_corefile -c <path_to_core_file> -t raw <path_to_prog_elf_file>. Note that you should remove the first 4 bytes from the raw data read from flash to use that python script.

@ammaree
Copy link
Author

ammaree commented Mar 11, 2018

@gerekon
Thanks for the feedback.
We have and still are using uxTaskGetStackHighWaterMark() to profile stack size for each task.

To close this issue just 3 more points:

  1. What would happen to a) the coredump writing process and b) the coredump interpretation if the partition was slightly too small?
  2. Is there any way to know the actual real size of the coredump content, this can save us uploading unnecessary data over slow links?
  3. Can this information please be used to update& correct the documentation.

Thanks

@ammaree
Copy link
Author

ammaree commented Mar 11, 2018

@gerekon
Thanks for the info, works perfectly, we simply avoid uploading the first 4 bytes and then the decode works correctly, see the file attached

In addition to the questions above, another request (or 2):

#1 Would it be possible to add a mechanism whereby the user application can register area(s) of memory to be included in the coredump. It would be of great value to use if we had visibility of, for example, our stdout buffer since it would contain the last number of error messages before the crash.

#2 Since you have access to the task names inside the TCBs would it be possible to add the logic to identify the tasks under "THREADS INFO" as well as "CORE DUMP MEMORY CONTENTS"

30aea432bc00_1520768624.txt

@gerekon
Copy link
Collaborator

gerekon commented Mar 12, 2018

@ammaree

What would happen to a) the coredump writing process and b) the coredump interpretation if the partition was slightly too small?

Core dump will be aborted and no data are written to flash. Respective error message will be printed to UART.

Is there any way to know the actual real size of the coredump content, this can save us uploading unnecessary data over slow links?
Can this information please be used to update& correct the documentation.

There should be special API, but I do not see it in github repo (I will check this). In any case you can read actual length from the second 4 bytes of RAW data in flash. See comments at

* @brief Saves core dump to flash.
. Note that length should be decreased by 4 bytes (to skip magic number at the beginning).

Would it be possible to add a mechanism whereby the user application can register area(s) of memory to be included in the coredump. It would be of great value to use if we had visibility of, for example, our stdout buffer since it would contain the last number of error messages before the crash.

I think yes. We will discuss and plan this feature.

Since you have access to the task names inside the TCBs would it be possible to add the logic to identify the tasks under "THREADS INFO" as well as "CORE DUMP MEMORY CONTENTS"

This is also possible, but needs GDB update. We will discuss and plan this feature too.

Thanks a lot for your suggestions! I think we can implement them.

@ammaree
Copy link
Author

ammaree commented Mar 14, 2018

@gerekon

Apologies, could and should have found this info myself.
Coredump partition size, will keep it 192K big for now, just to be safe, can reduce later.

I will close this issue now, and look forward to the extra feature to attach a memory buffer to the coredump

@ammaree ammaree closed this as completed Mar 14, 2018
@ammaree
Copy link
Author

ammaree commented Sep 25, 2018

@gerekon
Sorry, know I closed the issue but just following up. I am using the latest version of IDF (today) and need to identify a task causing a crash.

The "Current Thread Stack" stops with a message "Backtrace stopped: Cannot access memory at address 0x4000bd86" or "Backtrace stopped: previous frame identical to this frame (corrupt stack?)"

In 1 case the "Current Thread Stack" has only a single entry before the "Backtrace stopped: Cannot access memory at address 0x4000bd86" message

All help appreciated.

@ammaree ammaree reopened this Sep 25, 2018
@igrr
Copy link
Member

igrr commented Sep 25, 2018

0x4000bd86 points to a function in ROM code. Please see https://docs.espressif.com/projects/esp-idf/en/latest/api-guides/core_dump.html#rom-functions-in-backtraces.

@gerekon gerekon closed this as completed Sep 25, 2018
@gerekon gerekon reopened this Sep 25, 2018
@ammaree
Copy link
Author

ammaree commented Sep 25, 2018

I have added the reference to esp32_rom.elf and it adds the ROM functions but does not help for a single entry with addresses:
#0 _xt_lowint1 () at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos\xtensa_vectors.S:1079
Backtrace stopped: Cannot access memory at address 0x7ffd2180

or:
#0 0x4008e7c7 in _frxt_dispatch () at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos\portasm.S:447
#1 0x4008e5ec in uxQueueMessagesWaiting (xQueue=0x0) at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos/queue.c:1746
#2 0x4008e386 in xQueueGiveFromISR (xQueue=0x3ffc37c8, pxHigherPriorityTaskWoken=0x0) at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos/queue.c:1419
#3 0x4008e444 in xPortGetCoreID () at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos/include/freertos/portable.h:208
#4 xQueueGenericReceive (xQueue=0x3ffc37c8, pvBuffer=, xTicksToWait=2148019936, xJustPeeking=1) at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/freertos/queue.c:1493
#5 0x40082df8 in lock_acquire_generic (lock=, delay=4294967295, mutex_type=) at C:/Dropbox/devs/ws/z-sdk/esp-idf/components/newlib/locks.c:162
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

In neither of these cases I can identify the task and that makes it impossible to trace further.
What is the best suggestion to a) identify the task and b) get more of a stack trace to follwo the path?

@gerekon
Copy link
Collaborator

gerekon commented Sep 26, 2018

@ammaree Crash behavior seems to be unstable. Also 0x7ffd2180 is not valid address, it falls into reserved address range for ESP32 (see the spec.). So having in mind these two facts I suppose that your program corrupts the stack. In this case it is impossible to get valid callstack (it is overwritten by the program).
So currently I see one solution to your problem. You can try to connect with GDB to crashed program and check pxCurrentTCB variable from tasks.c.

@ammaree
Copy link
Author

ammaree commented Sep 27, 2018

HI @gerekon

Thanks for your feedback, have increased the size of all task stacks so not the symptoms have changed. From the latest firmware built using IDF as at yesterday morning (27 Sep at 08h00 UTC) the coredump now (only on some crashes) yield even less info.

The dump is empty and complains about overlapping reasons. I have attached a couple of examples and the resulting decoded dump. Any chance you can shed some light here?

overlap regions.zip

PS: The 4 in the bin/txt filenames indicate the reset cause

@gerekon
Copy link
Collaborator

gerekon commented Sep 29, 2018

@ammaree We are planning to update the core dump script in order to give more info upon such conditions (overlapping regions). In general this message can be seen when one of TCB/task stack interferes with another TCB or stack. This should not happen in usual case.
In your situation it seems that TCB of one task is corrupted because according to the data from it that task occupied 60820 of stack. Also it can be due to infinite recursion which eats all stack and goes beyond it. In this case simple increasing of stack size will not help you.
I modified the script to print a bit more info about data from coredump (espcoredump.zip). Below is its output for one of you binaries:

espcoredump.py v0.2-dev
Read TCB 368 bytes @ 0x3ffcdd10
Read stack 1912 bytes @ 0x3ffcd590
Read TCB 368 bytes @ 0x3ffbf23c
Read stack 60820 bytes @ 0x3ffb04a0
Read TCB 368 bytes @ 0x3ffbf9b4
Read stack 364 bytes @ 0x3ffbf840
Read TCB 368 bytes @ 0x3ffd2c6c
Read stack 404 bytes @ 0x3ffd2ad0
Read TCB 368 bytes @ 0x3ffd1ff4
Read stack 476 bytes @ 0x3ffd1e10
Read TCB 368 bytes @ 0x3ffc1bb4
Read stack 492 bytes @ 0x3ffc19c0
Read TCB 368 bytes @ 0x3ffcfa00
Read stack 712 bytes @ 0x3ffcf730
Read TCB 368 bytes @ 0x3ffd4750
Read stack 440 bytes @ 0x3ffd4590
Read TCB 368 bytes @ 0x3ffcef88
Read stack 720 bytes @ 0x3ffcecb0
Read TCB 368 bytes @ 0x3ffc0628
Read stack 384 bytes @ 0x3ffc04a0
Read TCB 368 bytes @ 0x3ffd077c
Read stack 388 bytes @ 0x3ffd05f0
Read TCB 368 bytes @ 0x3ffd51c8
Read stack 368 bytes @ 0x3ffd5050
Read TCB 368 bytes @ 0x3ffbc970
Skip TCB 368 bytes @ 0x3ffbc970. (Reason: Can not add overlapping region [3ffbc970..3ffbcadf] to ELF file. Conflict with existing [3ffb04a0..3ffbf233].)
Read stack 392 bytes @ 0x3ffbc7e0
Skip task's (3ffbc970) stack 392 bytes @ 0x3ffbc7e0. (Reason: Can not add overlapping region [3ffbc7e0..3ffbc967] to ELF file. Conflict with existing [3ffb04a0..3ffbf233].)
Read TCB 368 bytes @ 0x3ffbd144
Skip TCB 368 bytes @ 0x3ffbd144. (Reason: Can not add overlapping region [3ffbd144..3ffbd2b3] to ELF file. Conflict with existing [3ffb04a0..3ffbf233].)
Read stack 396 bytes @ 0x3ffbcfb0
Skip task's (3ffbd144) stack 396 bytes @ 0x3ffbcfb0. (Reason: Can not add overlapping region [3ffbcfb0..3ffbd13b] to ELF file. Conflict with existing [3ffb04a0..3ffbf233].)
Read TCB 368 bytes @ 0x3ffd5c40
Read stack 424 bytes @ 0x3ffd5a90
Read TCB 368 bytes @ 0x3ffafc90
Read stack 392 bytes @ 0x3ffafb00
Read TCB 368 bytes @ 0x3ffc4964
Read stack 492 bytes @ 0x3ffc4770
Read TCB 368 bytes @ 0x3ffc3240
Read stack 472 bytes @ 0x3ffc3060
Done!

@ammaree
Copy link
Author

ammaree commented Sep 29, 2018

Hi @gerekon , very concerning situation.

In all my own private code I only have 2 occurrences of recursion.
#1 in a forced CR before LF situation, limited to 1 level only
#2 in a JSON writer routine, when closing a parent with open child objects, recurse max of 2 levels to close the child objects first.

Will again check the json_parser and Paho MQTT libraries, but nothing there as far as I know.

@ammaree
Copy link
Author

ammaree commented Oct 29, 2018

Hi @gerekon

I have been travelling for the last 4 weeks and returned to find 310 crash dump files generated by 38 different devices during the period.
I also saw that a new version of toolkit has been released dated 20181001. I have installed it but before I start doing the crash dump analysis I have a couple of questions.
Q1 Can I use the new toolkit to analyse crash dumps created with the IDF as at 27 September?
Q2 If yes, can I use the modified version of espcoredump.py emailed to me Sept 29th (modified to handle overlapping regions), or should I use the default provided in the latest IDF from GitHub.
Q3 Is there anything else I need to update to ensure the best possible dump analysis, maybe a new or better beta version of GDB?

Thanks

@ammaree
Copy link
Author

ammaree commented Nov 1, 2018

@gerekon @igrr any comments please gents, I would like to start analysing

@gerekon
Copy link
Collaborator

gerekon commented Nov 5, 2018

@ammaree

Q1 Can I use the new toolkit to analyse crash dumps created with the IDF as at 27 September?

Yes.

Q2 If yes, can I use the modified version of espcoredump.py emailed to me Sept 29th (modified to handle overlapping regions), or should I use the default provided in the latest IDF from GitHub.

Yes. you should use modified version because espcoredump.py script in IDF is not updated yet. There are a couple of improvements p[lanned for it. So it will be updated later.

Q3 Is there anything else I need to update to ensure the best possible dump analysis, maybe a new or better beta version of GDB?

No special updates for core dump is needed.

Let me know if you have problems.

@ammaree
Copy link
Author

ammaree commented Feb 22, 2019

A'luta continua....

@gerekon @igrr
Although we have had relative stability for a period we remain with a number of crash coredumps that we cannot make any sense of. We are using edp-idf master as of latest updates about 3 weeks ago.

Any help to urgently resolve these crashes will be highly appreciated, even if I have to use a beta version that at least provide some information.

Andre

30aea432bbfc_5_48C_1550822685_0.txt

30aea432c6e0_5_48C_1550806228_0.txt

30aea432c6ec_5_48C_1550787188_0.txt

30aea432c9b8_4_48C_1550684836_0.txt

30aea432c9c4_4_48C_1550684642_0.txt

30aea432c778_7_48C_1550686824_0.txt

30aea432c960_4_48C_1550684978_0.txt

@ammaree
Copy link
Author

ammaree commented Mar 4, 2019

@gerekon @igrr @projectgus

Upgraded last week to the latest IDF and coredump. Rebuilt the same application with the new IDF and deployed on some test bench devices. Trying again to redo the crash analysis but not having any luck. Latest coredump now just results in:

espcoredump.py v0.3-dev
A fatal error occurred: Core dump version '5' is not supported! Should be up to '1'.

Also, the old version "espcoredump.py" now only shows 6 TCBs and stacks detected instead of the existing 18 with no error messages.

This is close to becoming a small crisis, not being able to do crash analysis and fix problems. Any recommendations PLEASE?

@ammaree
Copy link
Author

ammaree commented Mar 6, 2019

Gents @gerekon @igrr @projectgus

have rechecked options my side. Using latest IDF but coredump boms out with version error. Older v2 (modified by @gerekon) produces some output but does not make any sense.

Any input on the above problem. I am sitting with a number of core dumps that cannot be analyzed, and a regular stream of devices continuing to reboot...

Please?

@gerekon
Copy link
Collaborator

gerekon commented Mar 6, 2019

@ammaree Could you post coredump file and ELF file of the application?

@projectgus projectgus changed the title [TW#18663] Coredump inconsistency.. Coredump inconsistency.. (IDFGH-676) Mar 12, 2019
@ammaree
Copy link
Author

ammaree commented Mar 12, 2019

@gerekon

I think I might have found the problem. The coredump does not seem to start with a 4 byte (uint32_t) magic number but starts directly with the size, followed by version, number of tasks and TCB size.

Is this correct?

@gerekon
Copy link
Collaborator

gerekon commented Mar 13, 2019

@gerekon

I think I might have found the problem. The coredump does not seem to start with a 4 byte (uint32_t) magic number but starts directly with the size, followed by version, number of tasks and TCB size.

Is this correct?

Yes, there have been changes in format. For now you must use the proper version of espcoredump.py with generated dumps. We are planning to fix versioning problem in the next IDF release.

@sticilface
Copy link

the proper version of espcoredump.py

Can I ask where you get this. I'm getting cannot use Version12, expected 1. if i disable version checking in espcoredump.py I get the error referring to growing up stacks not supported. I have a coredump but can't analyse it. any advice?

@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, feel free to reopen.

0xFEEDC0DE64 pushed a commit to 0xFEEDC0DE64/esp-idf that referenced this issue May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

7 participants