-
-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery from a hard fault (emergency mode) not working #3599
Comments
The 'internalModule' property in the RadioData struct is not saved to the RTC backup memory. I removed the NOBACKUP from this property in datastruct_private.h and my TX16S now reconnects to the receiver on EM startup. Not sure if this is the correct solution or there is some initialisation of this value that is being missed elsewhere. |
Awesome, if not the solution a good start at a minimum. This is one of those "need to understand history" things. |
BTW, another way to do this sort of testing is to use the "Test" SF with a debug build, and uncomment the |
yeap, designed for that very purpose :). I think the EM reboot is broken in EdgeTX in quite a few places, but it is a significant pîece of work to properly test and fix it with all the existing combination of radio and modules, as there are several issues existing, but this is chalenging to do at the rate of change EdgeTX is having |
The g_eeGeneral.internalModule property appears to only be set in the radio hardware setup page - there is no other initialisation of it that I can see. So including it in the backup data should be enough (for this case). |
Size must be checked as there are only something like 4k available in backup storage on F4, so every byte counts |
The data is run-length compressed. On my, albeit fairly simple, models it is using < 700 bytes. |
How about a don't give up strategy. Check size of compressed model data to be backed up. If it fits mark it as present. EM will try quicK EM warm boot. If it doesn't fit mark backup memory as not usable. Let EM try a full boot in case of empty backup memory without switch checks, model notes and the likes preventing turning on RF. |
Well, I like the idea of trying backup first then regular if first ones fails, but I would set a size check at compile so that dev can manage that size (since if it is predefined anyway) |
Absolutely, all is known at compile time. A little bit of sizeof() math together with a static assert would even keep track of changes pushing the memory limit. |
The size of the RTC backup data can't be calculated at compile time. It depends on how much the data gets compressed with the run-length compression that is done. The uncompressed size is currently 4650 bytes on the TX16S - already well over the available 4094 bytes. |
true that is. leaves runtime checking the size or maybe at build time by a unit test. |
I just tried to do a normal boot after a watchdog reset by changing horus/board.h |
Er... but couldn't that just reboot back right into whatever caused the watchdog to reset? Which is the whole point of EM mode? i.e. We just need to make sure it's doing its job properly. |
Short answer: Long answer: There is also an inherent assumption in your question. The assumption you made was there are specific or permanent conditions still present after the reboot driving the system into the same hard fault again. It can happen and if this is the case you are indeed done but at least you tried. There are a lot of fault scenarios where a reboot will save your model. Some of those scenarios are kinda esoteric but real. Examples:
BTW: How degraded is the system after a successful EM reboot supposed to be? E.g. is the GUI task running? Are software timers running (logging, ...) ? |
But the inherent assumption is the one I think you have to make... as it is worst-case scenario... hope for the best, but plan for the worst. The GUI task is not running, nor is Lua in EM mode. I believe the intent is that essentially only the mixer and RF are active - in order to remove as many points of failure as possible. So attempt EM via RTC backup first, and then we would need a new EM from SD as the fallback for that, that skips all startup checks, and start-up as much of the rest of the system that we've determined we can rely on. So still a reduced operating mode, requiring a manual reboot to restore full functionality (assuming the system can do so - i.e. the cause wasn't failing or intermittent SD card, some intermittent power or coding issue, whatever). |
We are on the same page. In an emergency try to reset into the absolute minimal system required to control the model. This means excluding as many potential fault contributors (GUI, LUA, logging, ...) as possible. The question is about the source of the model data. It is wise to use backed up to RAM model data if possible as the SD card might be corrupted by the fault itself and restoring data from RAM is much faster. If restoring model data from backup RAM is not possible (corrupt - need to CRC the backup RAM, or not present due to size) you'd have to try and use SD card data. |
@philmoz I can confirm removing NOBACKUP here solves the internal MPM RF reconnect issue edgetx/radio/src/datastructs_private.h Line 874 in f293693
@pfeerick I'd prefer to have this change in 2.9 and as it is the same in 2.8 also in 2.8.5. And let's keep working on safeguarding the backup memory (if necessary) together with restoring model data from SD Card in EM mode if for whatever reason restoring backup memory is not possible. |
I just tried hard faulting the system with an external ELRS module. EM works fine with a fast RF reconnect and full control. But I noticed a difference between EM with external ELRS vs EM with internal MPM (fix above included). As soon as the radio goes to EM with the external ELRS the radio starts beeping and when trying to power it down it'll show the "receiver still connected message". There is no EM beeping with the internal MPM and no message at power down. |
It's in for the next nightly, and unless there are unforeseen side-effects, yup, can go in for 2.8.5 |
…ce (#3730) * justification: #3666 - checks SD card free space at startup, start of logging and writing screenshots - issues warning to user if SD card has less than 50MB free space - will not write logs or screenshots if free space is less than 50MB at activation timerReset - fixes crash of warning popup in logs.cpp (#3728) * update: warning message if SD card fills up to threshold while logging * empty commit * call popup in non-blocking mode * make b&w compatible to non-blocking popup calls * 2nd attempt * Popup: - turns on backlight if dimmed due to inactivity setting - if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls * #3599 * fixed b&w UI * updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW * chore: Update translations
…ce (#3730) * justification: #3666 - checks SD card free space at startup, start of logging and writing screenshots - issues warning to user if SD card has less than 50MB free space - will not write logs or screenshots if free space is less than 50MB at activation timerReset - fixes crash of warning popup in logs.cpp (#3728) * update: warning message if SD card fills up to threshold while logging * empty commit * call popup in non-blocking mode * make b&w compatible to non-blocking popup calls * 2nd attempt * Popup: - turns on backlight if dimmed due to inactivity setting - if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls * #3599 * fixed b&w UI * updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW * chore: Update translations
…ce (#3730) * justification: #3666 - checks SD card free space at startup, start of logging and writing screenshots - issues warning to user if SD card has less than 50MB free space - will not write logs or screenshots if free space is less than 50MB at activation timerReset - fixes crash of warning popup in logs.cpp (#3728) * update: warning message if SD card fills up to threshold while logging * empty commit * call popup in non-blocking mode * make b&w compatible to non-blocking popup calls * 2nd attempt * Popup: - turns on backlight if dimmed due to inactivity setting - if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls * #3599 * fixed b&w UI * updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW * chore: Update translations
…ce (#3802) * fix: Prevent functional degradation if SD has less than 50MB free space (#3730) * justification: #3666 - checks SD card free space at startup, start of logging and writing screenshots - issues warning to user if SD card has less than 50MB free space - will not write logs or screenshots if free space is less than 50MB at activation timerReset - fixes crash of warning popup in logs.cpp (#3728) * update: warning message if SD card fills up to threshold while logging * empty commit * call popup in non-blocking mode * make b&w compatible to non-blocking popup calls * 2nd attempt * Popup: - turns on backlight if dimmed due to inactivity setting - if backlight is dimmed due to inactivity setting backlight can turnd on again by any controls * #3599 * fixed b&w UI * updated languages CN, CZ, DA, DE, FR, HE, IT, PL, PT, SE, TW * chore: Update translations * chore: Remove magic number * chore: JP language update for SD card full (#3809) * fix: Don't spam simulator log with file close msgs --------- Co-authored-by: Michael <mha1@users.noreply.github.com>
Hello. Thank you for the research and development on this issue. |
Closing now as it has been fixed in |
Is there an existing issue for this problem?
What part of EdgeTX is the focus of this bug?
Transmitter firmware
Current Behavior
Bug or feature?
From what I understand a CPU hard fault (e.g. division by zero, misaligned memory access, etc.) causes the CPU to reset. At restart the RTC backup registers holding expected shutdown indicators will be used to decide a hard fault reset occurred (= unexpected shutdown) and the well known Emergency Mode screen shows up. Reading @3djc's issue #3579 I understand "color radios have a dedicated area of memory so that in the event of emergency mode, data needed to safely keep your model flying despite the reset would be safe and not depending on sdcard". All in all I'd expect to have at least the RF link and mixers up and running after a hard fault reset (if it's not a permanent one) to maintain control of the model.
Here's my experiment. I injected a hard fault using gcc's
__builtin_trap()
in a part of the GUI to allow me event based triggering of a hard fault by clicking a selection field I misuse for this experiment. The manually triggered hard fault actually happens. The radio shows the Emergency screen and the RF link goes down with the rx in failsfe. But the radio link - MPM in my experiment - doesn't come back up. As there is no permanent hard fault I'd expect at least a boot into a (degraded) core system, maybe without the GUI, coming back to life.Expected Behavior
In case of a spurious hard fault at least a degraded core system should be operative.
Steps To Reproduce
__builtin_trap();
after lineresetMultiProtocolsOptions(moduleIdx);
inedgetx/radio/src/gui/colorlcd/mpm_settings.cpp
Lines 167 to 174 in 3d6bc63
Version
Nightly (Please give date/commit below)
Transmitter
Radiomaster TX16S / TX16SMK2
Operating System (OS)
No response
OS Version
No response
Anything else?
latest nightly
The text was updated successfully, but these errors were encountered: