-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rfc] [air/tune/train] Improve trial/training failure error printing #27946
[rfc] [air/tune/train] Improve trial/training failure error printing #27946
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
python/ray/train/_internal/utils.py
Outdated
except Exception as exc: | ||
# Other (e.g. training) errors should be directly raised | ||
_ray_start_tb = True # noqa: F841 | ||
raise exc.with_traceback( | ||
shorten_tb(exc.__traceback__, attr="_ray_start_tb") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amogkam if we just catch RayTaskErrors in addition to RayActorErrors this will lead to test failures as training is restarted and a different exception is raised.
What is the expected behavior here? IMO it looks like task errors should fail immediately (as it's likely a logic/syntax error) and only actor failures should be retried. If that's the case (as in the current implementation) maybe we can add better comments for this. Lmk
Overall huge fan of this change. Just wish we could avoid the weird 'magic' variable instead. perhaps attach an attribute to the exception? |
Signed-off-by: Kai Fricke <kai@anyscale.com>
Yeah, I wasn't happy either with the hackiness of the approach. I've refactored the code to use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. would be great to add a test here on the stacktrace.
Follow-up from #27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility. Signed-off-by: Kai Fricke kai@anyscale.com
Follow-up from ray-project#27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility. Signed-off-by: Kai Fricke kai@anyscale.com Signed-off-by: ilee300a <ilee300@anyscale.com>
Why are these changes needed?
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.
The main problems here are:
The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).
The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.
Deduplicating traceback printing
The solution here is to use
logger.error
instead oflogger.exception
in thefunction_trainable.py
to avoid printing a traceback in the trainable.Additionally, we introduce an environment variable
TUNE_PRINT_ALL_TRIAL_ERRORS
which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.
Removing internal wrappers from tracebacks
The solution here is to introcude a magic local variable
_ray_start_tb
. In two places, we use this magic local variable to reduce the stacktrace. A utilityshorten_tb
looks for the last occurence of_ray_start_tb
in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variableRAY_AIR_FULL_TRACEBACKS
which disables traceback shortening.Output before
Output after
Example
Tuning script:
Current output
Medium output
467444 lines (~65% reduction)Short output
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.