[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

cloud-fan · 2024-01-30T15:03:47Z

What changes were proposed in this pull request?

It's not user-friendly to always wrap task runtime errors with SparkException("Job aborted ..."), as users need to scroll down quite a bit to find the real error. This PR throws the user-facing runtime errors directly, which means the error defines error class and is not internal error.

This PR also fixes some error wrapping issues.

Why are the changes needed?

Report errors better.

Does this PR introduce any user-facing change?

Yes, now users can see the actual error directly instead of looking for the cause of "job aborted" error.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2024-01-30T15:04:03Z

cc @MaxGekk @srielau @dongjoon-hyun

srielau · 2024-01-30T16:05:21Z

common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala

@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper {
    errorClass.startsWith("INTERNAL_ERROR")
  }

+  def isRuntimeUserError(e: Throwable): Boolean = {


Would it make sense that this as an optional field to error-classes.json?

Yea I can add an optional bool flag.

I need some suggestions on the naming. We want to avoid the "job aborted" wrapper for user-facing errors, and user-facing errors are errors with error classes. I think "user-facing" should be a good name. Another case is task retry, I picked the name "user error", which may not be proper as what we really care is if the error is runtime and non-transient. As an example, out-of-memory is user-facing error but it can be transient and the task may success after retry.

also cc @superdupershant

Aren't retriable/transient errors in the minority? Shouldn't we flag those instead?

Ideally I agree with you, but the current behavior is to retry all errors, and I'm afraid of introducing regressions by not retrying most of the errors, as it's hard to identify all the errors that need retry.

dongjoon-hyun · 2024-01-30T16:34:13Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -2895,7 +2901,7 @@ private[spark] class DAGScheduler(
  /** Fails a job and all stages that are only used by that job, and cleans up relevant state. */
  private def failJobAndIndependentStages(
      job: ActiveJob,
-      error: SparkException): Unit = {
+      error: Exception): Unit = {


~~Is this inevitable?~~ Nvm. I found the reason.

dongjoon-hyun

+1, LGTM from my side. Thank you, @cloud-fan .

MaxGekk · 2024-01-30T16:56:53Z

common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala

+            true
+          case "INVALID_ARRAY_INDEX" | "INVALID_ARRAY_INDEX_IN_ELEMENT_AT" |
+               "INVALID_INDEX_OF_ZERO" => true
+          // TODO: add more user-facing runtime errors (mostly ANSI errors).


How about to create a Map, and to check the given error class in it. Code will be more maintainable and faster.

superdupershant · 2024-01-30T18:17:58Z

common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala

@@ -74,6 +74,26 @@ private[spark] object SparkThrowableHelper {
    errorClass.startsWith("INTERNAL_ERROR")
  }

+  def isRuntimeUserError(e: Throwable): Boolean = {


I don't know if "UserError" is right concept here. These are more so Data Dependent errors. I think all exceptions with an error class are user facing.

superdupershant · 2024-01-30T18:25:28Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

-        job,
+      val finalException = exception.collect {
+        // If the error is well defined (has an error class and is not internal error), we treat
+        // it as user-facing, and expose this error to the end users directly.


This includes more exception types than just those that match isRuntimeUserError() right? I think we might want a different name somewhere, people might get user-facing error and user error mixed up?

superdupershant · 2024-01-30T18:26:03Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -976,6 +976,14 @@ private[spark] class TaskSetManager(
            info.id, taskSet.id, tid, ef.description))
          return
        }
+        if (ef.exception.exists(SparkThrowableHelper.isRuntimeUserError)) {


This will be incredibly helpful!

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala

HyukjinKwon · 2024-02-02T03:04:01Z

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala

+          )
+        )
+        sparkContext.listenerBus.waitUntilEmpty()
+        // TODO: Spark should not re-try tasks this error.


nit but should ideally file a JIRA for this e.g., TODO(SPARK-XXXXX): ...

cloud-fan · 2024-02-02T03:04:59Z

UPDATE: I removed the task retry part, as I need to investigate more about which errors should be retried. Now this PR only contains the unwrap error part.

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala

allisonwang-db

This is super helpful!

dongjoon-hyun

+1, LGTM.

cloud-fan · 2024-02-07T13:30:55Z

thanks for the review, merging to master!

…ead files ### What changes were proposed in this pull request? This is a followup of #44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker. ### Why are the changes needed? better error reporting ### Does this PR introduce _any_ user-facing change? no, `FAILED_READ_FILE` is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45723 from cloud-fan/error. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ead files ### What changes were proposed in this pull request? This is a followup of apache#44953 to refine the newly added `FAILED_READ_FILE` error. It's better to always throw `FAILED_READ_FILE` error if anything goes wrong during file reading. This is more predictable and easier for users to do error handling. This PR adds sub error classes to `FAILED_READ_FILE` so that users can know what went wrong quicker. ### Why are the changes needed? better error reporting ### Does this PR introduce _any_ user-facing change? no, `FAILED_READ_FILE` is not released yet. ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45723 from cloud-fan/error. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions bot added SQL CORE labels Jan 30, 2024

srielau reviewed Jan 30, 2024

View reviewed changes

dongjoon-hyun reviewed Jan 30, 2024

View reviewed changes

dongjoon-hyun approved these changes Jan 30, 2024

View reviewed changes

MaxGekk reviewed Jan 30, 2024

View reviewed changes

superdupershant reviewed Jan 30, 2024

View reviewed changes

cloud-fan changed the title ~~[SPARK-46922][CORE][SQL] Better handling for runtime user errors~~ [SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors Feb 2, 2024

HyukjinKwon reviewed Feb 2, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Feb 2, 2024

View reviewed changes

HyukjinKwon approved these changes Feb 2, 2024

View reviewed changes

cloud-fan commented Feb 2, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala Outdated Show resolved Hide resolved

cloud-fan commented Feb 2, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionAnsiErrorsSuite.scala Outdated Show resolved Hide resolved

allisonwang-db approved these changes Feb 5, 2024

View reviewed changes

cloud-fan force-pushed the user-error branch from 70806a8 to 7064a56 Compare February 5, 2024 16:21

github-actions bot added MLLIB STRUCTURED STREAMING WEB UI AVRO labels Feb 5, 2024

cloud-fan force-pushed the user-error branch from ee46112 to f9e873d Compare February 6, 2024 04:59

github-actions bot added the CONNECT label Feb 6, 2024

cloud-fan force-pushed the user-error branch from f9e873d to c3d00ba Compare February 6, 2024 08:51

github-actions bot added the DOCS label Feb 6, 2024

cloud-fan added 3 commits February 6, 2024 22:00

better handling for runtime user errors

6857045

revert retry

ada6d79

update tests

9c396a9

cloud-fan force-pushed the user-error branch from c3d00ba to 9c396a9 Compare February 6, 2024 14:08

dongjoon-hyun approved these changes Feb 6, 2024

View reviewed changes

update pyspark tests

2cf9149

github-actions bot added the PYTHON label Feb 7, 2024

cloud-fan closed this in 5789316 Feb 7, 2024

cloud-fan mentioned this pull request Mar 26, 2024

[SPARK-47564][SQL] Always throw FAILED_READ_FILE error when fail to read files #45723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

cloud-fan commented Jan 30, 2024 •

edited

Loading

cloud-fan commented Jan 30, 2024

srielau Jan 30, 2024

cloud-fan Jan 31, 2024 •

edited

Loading

cloud-fan Jan 31, 2024

srielau Jan 31, 2024

cloud-fan Feb 1, 2024

dongjoon-hyun Jan 30, 2024 •

edited

Loading

dongjoon-hyun left a comment

MaxGekk Jan 30, 2024

superdupershant Jan 30, 2024

superdupershant Jan 30, 2024

superdupershant Jan 30, 2024

HyukjinKwon Feb 2, 2024

cloud-fan commented Feb 2, 2024

allisonwang-db left a comment

dongjoon-hyun left a comment

cloud-fan commented Feb 7, 2024

[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

[SPARK-46922][CORE][SQL] Do not wrap runtime user-facing errors #44953

Conversation

cloud-fan commented Jan 30, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan commented Jan 30, 2024

Choose a reason for hiding this comment

cloud-fan Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 2, 2024

allisonwang-db left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented Feb 7, 2024

cloud-fan commented Jan 30, 2024 •

edited

Loading

cloud-fan Jan 31, 2024 •

edited

Loading

dongjoon-hyun Jan 30, 2024 •

edited

Loading