-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953
Conversation
@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper { | |||
errorClass.startsWith("INTERNAL_ERROR") | |||
} | |||
|
|||
def isRuntimeUserError(e: Throwable): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense that this as an optional field to error-classes.json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I can add an optional bool flag.
I need some suggestions on the naming. We want to avoid the "job aborted" wrapper for user-facing errors, and user-facing errors are errors with error classes. I think "user-facing" should be a good name. Another case is task retry, I picked the name "user error", which may not be proper as what we really care is if the error is runtime and non-transient. As an example, out-of-memory is user-facing error but it can be transient and the task may success after retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @superdupershant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't retriable/transient errors in the minority? Shouldn't we flag those instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I agree with you, but the current behavior is to retry all errors, and I'm afraid of introducing regressions by not retrying most of the errors, as it's hard to identify all the errors that need retry.
@@ -2895,7 +2901,7 @@ private[spark] class DAGScheduler( | |||
/** Fails a job and all stages that are only used by that job, and cleans up relevant state. */ | |||
private def failJobAndIndependentStages( | |||
job: ActiveJob, | |||
error: SparkException): Unit = { | |||
error: Exception): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this inevitable? Nvm. I found the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM from my side. Thank you, @cloud-fan .
true | ||
case "INVALID_ARRAY_INDEX" | "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" | | ||
"INVALID_INDEX_OF_ZERO" => true | ||
// TODO: add more user-facing runtime errors (mostly ANSI errors). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to create a Map
, and to check the given error class in it. Code will be more maintainable and faster.
@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper { | |||
errorClass.startsWith("INTERNAL_ERROR") | |||
} | |||
|
|||
def isRuntimeUserError(e: Throwable): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if "UserError" is right concept here. These are more so Data Dependent errors. I think all exceptions with an error class are user facing.
job, | ||
val finalException = exception.collect { | ||
// If the error is well defined (has an error class and is not internal error), we treat | ||
// it as user-facing, and expose this error to the end users directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes more exception types than just those that match isRuntimeUserError() right? I think we might want a different name somewhere, people might get user-facing error and user error mixed up?
@@ -976,6 +976,14 @@ private[spark] class TaskSetManager( | |||
info.id, taskSet.id, tid, ef.description)) | |||
return | |||
} | |||
if (ef.exception.exists(SparkThrowableHelper.isRuntimeUserError)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be incredibly helpful!
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala
Outdated
Show resolved
Hide resolved
) | ||
) | ||
sparkContext.listenerBus.waitUntilEmpty() | ||
// TODO: Spark should not re-try tasks this error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but should ideally file a JIRA for this e.g., TODO(SPARK-XXXXX): ...
UPDATE: I removed the task retry part, as I need to investigate more about which errors should be retried. Now this PR only contains the unwrap error part. |
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
thanks for the review, merging to master! |
…ead files ### What changes were proposed in this pull request? This is a followup of #44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker. ### Why are the changes needed? better error reporting ### Does this PR introduce _any_ user-facing change? no, `FAILED_READ_FILE` is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45723 from cloud-fan/error. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ead files ### What changes were proposed in this pull request? This is a followup of apache#44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker. ### Why are the changes needed? better error reporting ### Does this PR introduce _any_ user-facing change? no, `FAILED_READ_FILE` is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45723 from cloud-fan/error. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
It's not user-friendly to always wrap task runtime errors with
SparkException("Job aborted ...")
, as users need to scroll down quite a bit to find the real error. This PR throws the user-facing runtime errors directly, which means the error defines error class and is not internal error.This PR also fixes some error wrapping issues.
Why are the changes needed?
Report errors better.
Does this PR introduce any user-facing change?
Yes, now users can see the actual error directly instead of looking for the cause of "job aborted" error.
How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
No