Adds filter for failed and non completed stages #1558

sayedbilalbari · 2025-02-21T19:56:34Z

Currently we store the stageInfo using the stageModelManager class where we map incoming stage information during the following events -

doSparkListenerStageCompleted

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala

Line 475 in 1f037fa

app.getOrCreateStage(event.stageInfo)
doSparkListenerStageSubmitted. -

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala

Line 464 in 1f037fa

app.getOrCreateStage(event.stageInfo)

So a stage information is updated once when a stage is submitted and once during completion. A stageCompleted event comes for an attempt for a failed stage as well ( eg - there will be two stage Submitted and StageCompleted events for stage that fails on first attempt and succeeds on attempt 2)
Both those stageInfo objects are updated in the map
This PR adds extra filters to calculate aggregate metrics only for stages that have not failed and have completed successfully.

Changes -

Add a new method - getAllCompletedStages for filtering StageInfo objects on being successful and completed
Changes previous usages of getAllStages to getAllCompletedStages

Effects -

This change now ensures that all stage level aggregate metrics generated only take into account the successful ( not failed and completed ) ones

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

amahussein

Did you see any change in the output as an impact of that change? If you a diff between the outputs of before/after is there any change in the metrics?

amahussein · 2025-02-21T20:31:39Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

@@ -321,7 +321,7 @@ class AppSparkMetricsAnalyzer(app: AppBase) extends AppAnalysisBase(app) {
    val emptyNodeNames = Seq.empty[String]
    val emptyDiagnosticMetrics = HashMap.empty[String, AccumProfileResults]
    // TODO: this has stage attempts. we should handle different attempts
-    app.stageManager.getAllStages.map { sm =>
+    app.stageManager.getAllCompletedStages.map { sm =>


Do we still need the Todo? If not we can replace it with explanation about why we are using getAllCompleteStages

Have removed the TODO

amahussein · 2025-02-21T20:36:11Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

@@ -321,7 +321,7 @@ class AppSparkMetricsAnalyzer(app: AppBase) extends AppAnalysisBase(app) {
    val emptyNodeNames = Seq.empty[String]
    val emptyDiagnosticMetrics = HashMap.empty[String, AccumProfileResults]
    // TODO: this has stage attempts. we should handle different attempts
-    app.stageManager.getAllStages.map { sm =>
+    app.stageManager.getAllCompletedStages.map { sm =>


There is tricky part in this: inside the loop, we retrieve the tasks belonging to a certain stage. If a stage is still incomplete some of its tasks could be already there. right?
I am comparing this to the other behavior (BehaviorB) which is to get the most representative attempt among a stageID Vs getting only successful attempts.
For example, in behaviorB we will have the tasks of an incomplete stage given that it did not fail and there is no other attempt that has succeeded.

So as per my understanding, do we just want the successful attempts of a stage in the diagnostics vs attempts that were running and not completed in case there is no successful attempt of a stage

The above behavior can be discussed as it warrants the understanding whether qualX or downstream wants this in the output. If yes, this can be included.

Also another point that I came across while debugging, here even for successful stages, there can be tasks that were killed midway ( speculative execution ).

These tasks come with accumulables and update the accumulable.

Currently these killed task updates are included in the rolling metrics and are part of associated diagnostic metric

As per my understanding, a killed task will have a successful attempt coming in but for a true representation for a stage, they should be included in the metric. Question being should they put in a separate category or not ?

@amahussein Have updated the logic -

Updated the current code to accommodate all successful attempts for a stage and aggregate them

As for failed/killed tasks, we do accommodate for the accumulables coming in with them.

Failed tasks do not come in with accumulables

Killed tasks come in with accumulables

The logic we follow ->

Get all tasks associated with a particular stage attempt ( in this case only the successful ones)

This can include tasks which are failed or killed

Failed tasks do not contribute to the stage level metrics

As for killed tasks, the major metrics that I have seen them contribute are -

sw_writeTime

executorRuntime

jvmGCTime

input_bytesRead

sw_bytesWritten

sw_recordsWritten

input_RecordsRead

peakExecutionMemory

In my understanding, they should ideally be included with the stageAggregated metrics and that is what is happening in the current logic

amahussein · 2025-02-21T20:38:21Z

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

+  def hasCompleted: Boolean = {
+    stageInfo.completionTime.isDefined
+  }
+


Is this similar to the logic in Spark code?
We can benefit from having a getState or getStatus similar to what Spark does, then this would be also a column in the final output. Will be helpful for us to troubleshoot the behavior.

private[spark] def getStatusString: String = { if (completionTime.isDefined) { if (failureReason.isDefined) { "failed" } else { "succeeded" } } else { "running" } }

This is the logic in spark. A succeeded stage is one for which completionTime is defined and which does not have a failureReason. I have used similar reasoning for filtering completed stages.
There are just 3 statuses - failed, succeeded and running
If has completion time and not failed, then succeeded. Else running

eordentlich · 2025-02-25T16:11:21Z

@leewyang can you take a look at this wrt qualx?

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

…ids-tools into issue1552-bilal

sayedbilalbari added 2 commits February 21, 2025 11:54

Changes for filtering failed stages

28f4d94

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Updated header for StageModelManager

fbb56e2

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

sayedbilalbari requested review from amahussein and cindyyuanjiang February 21, 2025 19:56

amahussein reviewed Feb 21, 2025

View reviewed changes

amahussein assigned sayedbilalbari Feb 21, 2025

amahussein added bug Something isn't working core_tools Scope the core module (scala) labels Feb 21, 2025

sayedbilalbari added 7 commits February 25, 2025 10:50

Removed TODO

174d5f9

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Changes for filtering failed stages

0ec5073

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Updated header for StageModelManager

1d9dd9a

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Removed TODO

af56acf

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Aggregate metrics across successful stage attempts

97dad6e

Signed-off-by: Sayed Bilal Bari <sbari@nvidia.com>

Merge branch 'issue1552-bilal' of github.com:sayedbilalbari/spark-rap…

595edca

…ids-tools into issue1552-bilal

Merge remote-tracking branch 'nv/dev' into issue1552-bilal

88142e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds filter for failed and non completed stages #1558

Adds filter for failed and non completed stages #1558

sayedbilalbari commented Feb 21, 2025

amahussein left a comment

amahussein Feb 21, 2025

sayedbilalbari Feb 24, 2025

amahussein Feb 21, 2025

sayedbilalbari Feb 25, 2025

sayedbilalbari Feb 27, 2025

amahussein Feb 21, 2025

sayedbilalbari Feb 24, 2025

eordentlich commented Feb 25, 2025

Adds filter for failed and non completed stages #1558

Are you sure you want to change the base?

Adds filter for failed and non completed stages #1558

Conversation

sayedbilalbari commented Feb 21, 2025

Changes -

Effects -

amahussein left a comment

Choose a reason for hiding this comment

amahussein Feb 21, 2025

Choose a reason for hiding this comment

sayedbilalbari Feb 24, 2025

Choose a reason for hiding this comment

amahussein Feb 21, 2025

Choose a reason for hiding this comment

sayedbilalbari Feb 25, 2025

Choose a reason for hiding this comment

sayedbilalbari Feb 27, 2025

Choose a reason for hiding this comment

amahussein Feb 21, 2025

Choose a reason for hiding this comment

sayedbilalbari Feb 24, 2025

Choose a reason for hiding this comment

eordentlich commented Feb 25, 2025