Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery from a hard fault (emergency mode) not working #3599

Closed
1 task done
mha1 opened this issue May 15, 2023 · 22 comments · Fixed by #3603
Closed
1 task done

Recovery from a hard fault (emergency mode) not working #3599

mha1 opened this issue May 15, 2023 · 22 comments · Fixed by #3603
Labels
bug 🪲 Something isn't working

Comments

@mha1
Copy link
Contributor

mha1 commented May 15, 2023

Is there an existing issue for this problem?

  • I have searched the existing issues

What part of EdgeTX is the focus of this bug?

Transmitter firmware

Current Behavior

Bug or feature?

From what I understand a CPU hard fault (e.g. division by zero, misaligned memory access, etc.) causes the CPU to reset. At restart the RTC backup registers holding expected shutdown indicators will be used to decide a hard fault reset occurred (= unexpected shutdown) and the well known Emergency Mode screen shows up. Reading @3djc's issue #3579 I understand "color radios have a dedicated area of memory so that in the event of emergency mode, data needed to safely keep your model flying despite the reset would be safe and not depending on sdcard". All in all I'd expect to have at least the RF link and mixers up and running after a hard fault reset (if it's not a permanent one) to maintain control of the model.

Here's my experiment. I injected a hard fault using gcc's __builtin_trap() in a part of the GUI to allow me event based triggering of a hard fault by clicking a selection field I misuse for this experiment. The manually triggered hard fault actually happens. The radio shows the Emergency screen and the RF link goes down with the rx in failsfe. But the radio link - MPM in my experiment - doesn't come back up. As there is no permanent hard fault I'd expect at least a boot into a (degraded) core system, maybe without the GUI, coming back to life.

Expected Behavior

In case of a spurious hard fault at least a degraded core system should be operative.

Steps To Reproduce

  • add line __builtin_trap(); after line resetMultiProtocolsOptions(moduleIdx); in
    auto md = &g_model.moduleData[moduleIdx];
    choice = new Choice(
    this, rect_t{}, 0, 0, [=]() { return md->subType; },
    [=](int16_t newValue) {
    md->subType = newValue;
    resetMultiProtocolsOptions(moduleIdx);
    SET_DIRTY();
    });
  • build firmware and flash
  • select internal MPM, choose e.g. Mode MULTI HoTT
  • make sure RF link and receiver are working
  • select the RF Protocol field and select the same value again
  • observe Emergency Mode but RF link not coming back
  • turn off radio
  • turn on radio
  • observe RF link being back on

Version

Nightly (Please give date/commit below)

Transmitter

Radiomaster TX16S / TX16SMK2

Operating System (OS)

No response

OS Version

No response

Anything else?

latest nightly

@mha1 mha1 added bug 🪲 Something isn't working triage Bug report awaiting review / sorting labels May 15, 2023
@philmoz
Copy link
Collaborator

philmoz commented May 15, 2023

The 'internalModule' property in the RadioData struct is not saved to the RTC backup memory.
On startup in EM this property defaults to 0 so the radio does not know there is an internal module.

I removed the NOBACKUP from this property in datastruct_private.h and my TX16S now reconnects to the receiver on EM startup. Not sure if this is the correct solution or there is some initialisation of this value that is being missed elsewhere.

@mha1
Copy link
Contributor Author

mha1 commented May 16, 2023

Awesome, if not the solution a good start at a minimum. This is one of those "need to understand history" things.

@pfeerick
Copy link
Member

BTW, another way to do this sort of testing is to use the "Test" SF with a debug build, and uncomment the while (1); in testFunc() so that the watchdog should then trigger an EM reboot ;)

@3djc
Copy link
Collaborator

3djc commented May 16, 2023

BTW, another way to do this sort of testing is to use the "Test" SF with a debug build, and uncomment the while (1); in testFunc() so that the watchdog should then trigger an EM reboot ;)

yeap, designed for that very purpose :). I think the EM reboot is broken in EdgeTX in quite a few places, but it is a significant pîece of work to properly test and fix it with all the existing combination of radio and modules, as there are several issues existing, but this is chalenging to do at the rate of change EdgeTX is having

@philmoz
Copy link
Collaborator

philmoz commented May 16, 2023

The g_eeGeneral.internalModule property appears to only be set in the radio hardware setup page - there is no other initialisation of it that I can see. So including it in the backup data should be enough (for this case).

@3djc
Copy link
Collaborator

3djc commented May 16, 2023

Size must be checked as there are only something like 4k available in backup storage on F4, so every byte counts

@philmoz
Copy link
Collaborator

philmoz commented May 16, 2023

Size must be checked as there are only something like 4k available in backup storage on F4, so every byte counts

The data is run-length compressed. On my, albeit fairly simple, models it is using < 700 bytes.

@mha1
Copy link
Contributor Author

mha1 commented May 16, 2023

How about a don't give up strategy. Check size of compressed model data to be backed up. If it fits mark it as present. EM will try quicK EM warm boot. If it doesn't fit mark backup memory as not usable. Let EM try a full boot in case of empty backup memory without switch checks, model notes and the likes preventing turning on RF.

@3djc
Copy link
Collaborator

3djc commented May 16, 2023

Well, I like the idea of trying backup first then regular if first ones fails, but I would set a size check at compile so that dev can manage that size (since if it is predefined anyway)

@pfeerick pfeerick removed the triage Bug report awaiting review / sorting label May 16, 2023
@mha1
Copy link
Contributor Author

mha1 commented May 16, 2023

Absolutely, all is known at compile time. A little bit of sizeof() math together with a static assert would even keep track of changes pushing the memory limit.

@philmoz
Copy link
Collaborator

philmoz commented May 16, 2023

Well, I like the idea of trying backup first then regular if first ones fails, but I would set a size check at compile so that dev can manage that size (since if it is predefined anyway)

The size of the RTC backup data can't be calculated at compile time. It depends on how much the data gets compressed with the run-length compression that is done.

The uncompressed size is currently 4650 bytes on the TX16S - already well over the available 4094 bytes.

@mha1
Copy link
Contributor Author

mha1 commented May 16, 2023

true that is. leaves runtime checking the size or maybe at build time by a unit test.

@mha1
Copy link
Contributor Author

mha1 commented May 16, 2023

I just tried to do a normal boot after a watchdog reset by changing horus/board.h inline bool UNEXPECTED_SHUTDOWN() to return false in case of a watchdog reset. Reboot works flawless. MPM RF takes about 2-3s to be back on again after loosing it. Of course switch warnings and model notes need to be skipped.

@pfeerick
Copy link
Member

pfeerick commented May 17, 2023

I just tried to do a normal boot after a watchdog reset by changing horus/board.h inline bool UNEXPECTED_SHUTDOWN() to return false in case of a watchdog reset.

Er... but couldn't that just reboot back right into whatever caused the watchdog to reset? Which is the whole point of EM mode? i.e. We just need to make sure it's doing its job properly.

@mha1
Copy link
Contributor Author

mha1 commented May 17, 2023

Short answer:
not necessarily

Long answer:
I tried this to see if a full reboot (with switch checks and model notes skipped) can be part of a "never give up strategy". This strategy should do the EM type reboot if that is possible. But for other reasons. It is much faster to get RF back compared to a full boot. So there is less time where the model is not controllable. And as the SD card was deemed potentially unreliable the EM type reboot is done to load the last model data from backup RAM. So it's perfectly ok to attempt an EM type reboot as a first priority. But what to do if this is not possible for e.g. the reason this issue exists (backup data not existing) or model data size doesn't fit into the backup memory? Giving up is no option. A fallback must be to try a full boot. It might be successful.

There is also an inherent assumption in your question. The assumption you made was there are specific or permanent conditions still present after the reboot driving the system into the same hard fault again. It can happen and if this is the case you are indeed done but at least you tried.

There are a lot of fault scenarios where a reboot will save your model. Some of those scenarios are kinda esoteric but real. Examples:

  • Coding errors which causes hard faults in some part of code that is only run upon request (e.g. on touch). The faulty code will not be executed again unless you create the condition again.
  • A bit flip due to temperature variations or cosmic radiation is very unlikely but it happens. It's unprotected consumer electronic we are dealing with. No radiation hardening, no error correcting memory. (Not very scientific read, another one)

BTW: How degraded is the system after a successful EM reboot supposed to be? E.g. is the GUI task running? Are software timers running (logging, ...) ?

@pfeerick
Copy link
Member

pfeerick commented May 17, 2023

But the inherent assumption is the one I think you have to make... as it is worst-case scenario... hope for the best, but plan for the worst. The GUI task is not running, nor is Lua in EM mode. I believe the intent is that essentially only the mixer and RF are active - in order to remove as many points of failure as possible.

So attempt EM via RTC backup first, and then we would need a new EM from SD as the fallback for that, that skips all startup checks, and start-up as much of the rest of the system that we've determined we can rely on. So still a reduced operating mode, requiring a manual reboot to restore full functionality (assuming the system can do so - i.e. the cause wasn't failing or intermittent SD card, some intermittent power or coding issue, whatever).

@mha1
Copy link
Contributor Author

mha1 commented May 17, 2023

We are on the same page. In an emergency try to reset into the absolute minimal system required to control the model. This means excluding as many potential fault contributors (GUI, LUA, logging, ...) as possible. The question is about the source of the model data. It is wise to use backed up to RAM model data if possible as the SD card might be corrupted by the fault itself and restoring data from RAM is much faster. If restoring model data from backup RAM is not possible (corrupt - need to CRC the backup RAM, or not present due to size) you'd have to try and use SD card data.

@mha1
Copy link
Contributor Author

mha1 commented May 17, 2023

@philmoz I can confirm removing NOBACKUP here solves the internal MPM RF reconnect issue

NOBACKUP(uint8_t internalModule ENUM(ModuleType));

@pfeerick I'd prefer to have this change in 2.9 and as it is the same in 2.8 also in 2.8.5. And let's keep working on safeguarding the backup memory (if necessary) together with restoring model data from SD Card in EM mode if for whatever reason restoring backup memory is not possible.

@mha1
Copy link
Contributor Author

mha1 commented May 17, 2023

I just tried hard faulting the system with an external ELRS module. EM works fine with a fast RF reconnect and full control.

But I noticed a difference between EM with external ELRS vs EM with internal MPM (fix above included). As soon as the radio goes to EM with the external ELRS the radio starts beeping and when trying to power it down it'll show the "receiver still connected message". There is no EM beeping with the internal MPM and no message at power down.

@pfeerick
Copy link
Member

It's in for the next nightly, and unless there are unforeseen side-effects, yup, can go in for 2.8.5

mha1 added a commit to mha1/edgetx that referenced this issue Jul 6, 2023
pfeerick pushed a commit that referenced this issue Jul 14, 2023
…ce (#3730)

* justification: #3666

- checks SD card free space at startup, start of logging and writing screenshots
- issues warning to user if SD card has less than 50MB free space
- will not write logs or screenshots if free space is less than 50MB at activation timerReset
- fixes crash of warning popup in logs.cpp (#3728)

* update: warning message if SD card fills up to threshold while logging

* empty commit

* call popup in non-blocking mode

* make b&w compatible to non-blocking popup calls

* 2nd attempt

* Popup:

- turns on backlight if dimmed due to inactivity setting
- if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls

* #3599

* fixed b&w UI

* updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW

* chore: Update translations
pfeerick pushed a commit that referenced this issue Jul 14, 2023
…ce (#3730)

* justification: #3666

- checks SD card free space at startup, start of logging and writing screenshots
- issues warning to user if SD card has less than 50MB free space
- will not write logs or screenshots if free space is less than 50MB at activation timerReset
- fixes crash of warning popup in logs.cpp (#3728)

* update: warning message if SD card fills up to threshold while logging

* empty commit

* call popup in non-blocking mode

* make b&w compatible to non-blocking popup calls

* 2nd attempt

* Popup:

- turns on backlight if dimmed due to inactivity setting
- if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls

* #3599

* fixed b&w UI

* updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW

* chore: Update translations
pfeerick pushed a commit that referenced this issue Jul 30, 2023
…ce (#3730)

* justification: #3666

- checks SD card free space at startup, start of logging and writing screenshots
- issues warning to user if SD card has less than 50MB free space
- will not write logs or screenshots if free space is less than 50MB at activation timerReset
- fixes crash of warning popup in logs.cpp (#3728)

* update: warning message if SD card fills up to threshold while logging

* empty commit

* call popup in non-blocking mode

* make b&w compatible to non-blocking popup calls

* 2nd attempt

* Popup:

- turns on backlight if dimmed due to inactivity setting
- if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls

* #3599

* fixed b&w UI

* updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW

* chore: Update translations
pfeerick added a commit that referenced this issue Jul 30, 2023
…ce (#3802)

* fix: Prevent functional degradation if SD has less than 50MB free space (#3730)

* justification: #3666

- checks SD card free space at startup, start of logging and writing screenshots
- issues warning to user if SD card has less than 50MB free space
- will not write logs or screenshots if free space is less than 50MB at activation timerReset
- fixes crash of warning popup in logs.cpp (#3728)

* update: warning message if SD card fills up to threshold while logging

* empty commit

* call popup in non-blocking mode

* make b&w compatible to non-blocking popup calls

* 2nd attempt

* Popup:

- turns on backlight if dimmed due to inactivity setting
- if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls

* #3599

* fixed b&w UI

* updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW

* chore: Update translations

* chore: Remove magic number

* chore: JP language update for SD card full (#3809)

* fix: Don't spam simulator log with file close msgs

---------

Co-authored-by: Michael <mha1@users.noreply.github.com>
@sam2b
Copy link

sam2b commented Aug 7, 2023

Hello. Thank you for the research and development on this issue.
In the last 2 months I experienced the exact same behavior including Emergency Mode message, claiming two of my planes. :-(
I only found an empty zero-byte radio_error.yml file at the time.
Additionally, I had had plenty of space on my SD card, but I suspect SD card corruption (card gone bad).
Since replacing the SD card with a new & higher quality, the behavior has stopped.
I trust this fix will prevent any more crashes. :-)
Thanks so much.

@raphaelcoeffic
Copy link
Member

Closing now as it has been fixed in v2.8.5, v2.9 release candidates and main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working
Projects
None yet
6 participants