Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 30, 2024

What changes were proposed in this pull request?

It's not user-friendly to always wrap task runtime errors with SparkException("Job aborted ..."), as users need to scroll down quite a bit to find the real error. This PR throws the user-facing runtime errors directly, which means the error defines error class and is not internal error.

This PR also fixes some error wrapping issues.

Why are the changes needed?

Report errors better.

Does this PR introduce any user-facing change?

Yes, now users can see the actual error directly instead of looking for the cause of "job aborted" error.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

@cloud-fan
Copy link
Contributor Author

cc @MaxGekk @srielau @dongjoon-hyun

@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper {
errorClass.startsWith("INTERNAL_ERROR")
}

def isRuntimeUserError(e: Throwable): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense that this as an optional field to error-classes.json?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I can add an optional bool flag.

I need some suggestions on the naming. We want to avoid the "job aborted" wrapper for user-facing errors, and user-facing errors are errors with error classes. I think "user-facing" should be a good name. Another case is task retry, I picked the name "user error", which may not be proper as what we really care is if the error is runtime and non-transient. As an example, out-of-memory is user-facing error but it can be transient and the task may success after retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't retriable/transient errors in the minority? Shouldn't we flag those instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I agree with you, but the current behavior is to retry all errors, and I'm afraid of introducing regressions by not retrying most of the errors, as it's hard to identify all the errors that need retry.

@@ -2895,7 +2901,7 @@ private[spark] class DAGScheduler(
/** Fails a job and all stages that are only used by that job, and cleans up relevant state. */
private def failJobAndIndependentStages(
job: ActiveJob,
error: SparkException): Unit = {
error: Exception): Unit = {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this inevitable? Nvm. I found the reason.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. Thank you, @cloud-fan .

true
case "INVALID_ARRAY_INDEX" | "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" |
"INVALID_INDEX_OF_ZERO" => true
// TODO: add more user-facing runtime errors (mostly ANSI errors).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to create a Map, and to check the given error class in it. Code will be more maintainable and faster.

@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper {
errorClass.startsWith("INTERNAL_ERROR")
}

def isRuntimeUserError(e: Throwable): Boolean = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if "UserError" is right concept here. These are more so Data Dependent errors. I think all exceptions with an error class are user facing.

job,
val finalException = exception.collect {
// If the error is well defined (has an error class and is not internal error), we treat
// it as user-facing, and expose this error to the end users directly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes more exception types than just those that match isRuntimeUserError() right? I think we might want a different name somewhere, people might get user-facing error and user error mixed up?

@@ -976,6 +976,14 @@ private[spark] class TaskSetManager(
info.id, taskSet.id, tid, ef.description))
return
}
if (ef.exception.exists(SparkThrowableHelper.isRuntimeUserError)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be incredibly helpful!

@cloud-fan cloud-fan changed the title [SPARK-46922][CORE][SQL] Better handling for runtime user errors [SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors Feb 2, 2024
)
)
sparkContext.listenerBus.waitUntilEmpty()
// TODO: Spark should not re-try tasks this error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but should ideally file a JIRA for this e.g., TODO(SPARK-XXXXX): ...

@cloud-fan
Copy link
Contributor Author

UPDATE: I removed the task retry part, as I need to investigate more about which errors should be retried. Now this PR only contains the unwrap error part.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super helpful!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@github-actions github-actions bot added the PYTHON label Feb 7, 2024
@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 5789316 Feb 7, 2024
MaxGekk pushed a commit that referenced this pull request Mar 27, 2024
…ead files

### What changes were proposed in this pull request?

This is a followup of #44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker.

### Why are the changes needed?

better error reporting

### Does this PR introduce _any_ user-facing change?

no, `FAILED_READ_FILE` is not released yet.

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45723 from cloud-fan/error.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
…ead files

### What changes were proposed in this pull request?

This is a followup of apache#44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker.

### Why are the changes needed?

better error reporting

### Does this PR introduce _any_ user-facing change?

no, `FAILED_READ_FILE` is not released yet.

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45723 from cloud-fan/error.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants