Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rfc] [air/tune/train] Improve trial/training failure error printing #27946
[rfc] [air/tune/train] Improve trial/training failure error printing #27946
Changes from 8 commits
d78270b
32b1eda
ebd38b6
a2ac130
0782931
8365682
7ef83c9
d9e116c
f186811
a07e0e9
0b1711b
992f7ac
4323bd2
4759df4
588dfbc
276cc46
68a62c5
51fa009
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amogkam if we just catch RayTaskErrors in addition to RayActorErrors this will lead to test failures as training is restarted and a different exception is raised.
What is the expected behavior here? IMO it looks like task errors should fail immediately (as it's likely a logic/syntax error) and only actor failures should be retried. If that's the case (as in the current implementation) maybe we can add better comments for this. Lmk