-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception and Reset Log possible race conditions #411
Comments
The potential race conditions in exception handling was also noted in a previous trac ticket, now in issue #76 |
Sure, I'm happy to take it on, but this is one area that reality of varying system architectures is somewhat mismatched to the original API/requirements (which had a bit of a simplistic VxWorks/flat memory/non-MMU view of the world). A reason this has been sitting in the "todo" bucket for so long is that it first needs some community agreement on what the requirements for exception handling should be. It likely means moving at least some aspects of the exception handling to the PSP so will require coordinated change to the requirement, PSP, and CFE at the same time. Yes, will likely have to deprecate |
Sounds good. It would help if you could initiate this conversation and work it with the architects and/or coordinate with @astrogeco to bring it up at the CCB. Note #509 removes the exception handling requirements from cFE (so it can be implemented in the PSP layer). The requirements related to the ER Log remain since we still need a way to command a clear of the log and command writing the log to the file (cFE capabilities), but utilizing the PSP under the hood. |
Suggestions/Proposal for fixing this issue The root of the issue is that the exception handler within CFE ( This is another area where the "better" approach likely depends on the mode of operation and goals, in particular whether it is a "debug" build or an actual deployment build. On a "debug" build it is preferable to do an On a flight-like or "release" build it is preferable to capture and log as much as possible internally. Not all platforms can support this, however, and the solution is platform-dependent. Suggesting a sequence that the MCP750 could do as an example (even though it is generally still a debug platform). Proposed sequence for VxWorks MCP750 (for a flight-like example)
Background Task:
Concerns/considerations with this sequence:
If the hardware supports it, it might be useful to trigger the watchdog directly in the exception hook but in a "delay" fashion, so if the background task never completes, the system still gets a reset (again, requires HW that can actually do this, not all can). Proposed Sequence for DEBUG builds The default behavior for handling this type of exception (i.e. if the PSP/OSAL did not register any handler for this event, the kernel would would handle it equivalently to an "abort" or debugger trap). This will either generate a core dump or otherwise let the developer debug the situation with full context info. Please review and provide comments/thoughts, particularly @skliper, @acdumore and @jwilmot, I'd like to hear your thoughts on this approach. I'll then move forward with implementing it. |
I like it. |
The proposed solution sounds good to me too. I think the original implementation worked for VxWorks with the exception hook if I remember correctly, but I had to hack the implementation for RTEMS and FreeRTOS because of the problems with calling CFE_ES_ProcessCoreException from an ISR/Exception handler. I suppose another potential implementation could be to let the PSP handle everything. Make sure the PSP can access the Performance Log, Exception and Reset Log, syslog, etc, so the PSP could just log the exception data from ES and reset directly. I guess it would have to stop the scheduler to do this. (is that even possible on Linux?) But is that just moving the problems? |
Could you make the different behavior selectable independent of debug mode? There may be users who would like to control this behavior explicitly or enforce it be consistent between build types. |
I'm fine if it defaults to behaving this way, just asking that it not be required to behave this way (allow folks to configure it to behave more flight-like while in debug mode). |
It is based on your PSP and its capabilities, not the BUILDTYPE of the build. In the Maybe there is some signal handler technique I'm not aware of for the arithmetic/floating point exception stuff, but for the other stuff like SIGSEGV/SIGBUS/SIGILL, it really is futile to try to continue running because the task is in an indeterminate state. So for pc-linux, I think the only choice should be to just use the default signal behavior for these exceptions, which will abort the CFE and generate a coredump. Any recovery could be done in a parent/watchdog process that detects this abnormal exit and does a "processor restart" of the CFE to get things running again. In theory, this could detect the presence of the core file and make an ER log entry after it restarts, or it could be logged by the external process. My recommendation is to save that for another day, or let someone who actually wants that to implement it. The other, better way to make pc-linux more robust WRT exceptions is to run each app/module in a separate memory-protected process. This would be FAR more reliable. I think its doable but it would be a good challenge. But then the exception handling would "just work". When using the mcp750/vxworks, this is the only platform that can really implement exception handling as it is currently designed because of the fact that VxWorks suspends the individual task and has the exception hook feature. So this will be employed when using this PSP. I've been thinking about @acudmore 's suggestion of letting the PSP do the log entirely and I think that makes a lot of sense. most everything related to the exception is platform-specific so it may make more sense to just keep it entirely "owned" by the PSP and only offer a simplified API that ES can call if it needs to add some entry into it. |
This "DEBUG builds" is what threw me (this sounded like a reference to BUILDTYPE), completely on-board with PSP handled. It was the initial direction I thought this was going to begin with per What about retaining the clear/write to file commands? |
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Move exception handling to a PSP function. In this approach the CFE only logs data after the event as a background job. Replaces the CFE_ES_ProcessCoreException with a simple notification that causes the ES background job to run, which in turn polls the PSP for logged exceptions and writes entries to the ES ER log. Both the PSP execption scan and the ER log file dump are converted to background jobs.
Correct items flagged as warnings in documenation build, and remove now-unused definition in sample platform config
Fix #411, rework exception handling in CFE
Describe the bug
CFE_ES_ClearERLogCmd and CFE_ES_WriteToERLog both modify shared CFE_ES_ResetDataPtr values. CFE_ES_ProcessCoreException and CFE_ES_ResetCFE both use CFE_ES_WriteToERLog (both are API's, so could be out of ES context).
To Reproduce
Looks to me like if CFE_ES_ClearERLogCmd gets interrupted by the processing of an app core exception, the log could get corrupted.
Expected behavior
No race.
Code snips
See functions above.
Versions
Additional context
Not observed, via code review.
Reporter Info
Jacob Hageman - NASA/GSFC
The text was updated successfully, but these errors were encountered: