-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
Signed-off-by: cindyyuanjiang <cindyj@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few doubts
@@ -72,8 +72,7 @@ object IOAccumDiagnosticMetrics { | |||
) | |||
|
|||
val SHUFFLE_WRITE_TIME_METRIC_NAMES = Set( | |||
"shuffle write time", // common across all Spark eventlogs | |||
"rs. shuffle write time" // only in GPU eventlogs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this been removed ?
@@ -115,3 +114,100 @@ object IOAccumDiagnosticMetrics { | |||
metricNamesToKeyMap(metric) | |||
} | |||
} | |||
|
|||
object FilteredAccumDiagnosticMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this is named as FilteredAccumDiagnostic metrics because it represents a subset of metrics that are more relevant to profiler output. The candidates for this being the names mentioned. Is there a doc somewhere highlighting why these operators ?
@@ -351,6 +375,8 @@ class AppSQLPlanAnalyzer(app: AppBase, appIndex: Int) extends AppAnalysisBase(ap | |||
updateIODiagnosticMetricsMap(sqlAccumProileResult) | |||
} | |||
|
|||
accumIdToNodeInfoMap(metric.accumulatorId) = (metric.sqlID, metric.nodeID, metric.nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, so we iterate over all the SQLMetricInfoCase objects, and using that try to maintain a mapping of accumId -> (sqlID, nodeID, nodeName)
Can there be a case where an entry belonging to the same accumID is overridden in the next iteration. So basically can one accumID belong to multiple combinations of sqlID, nodeID and nodeName combos ?
In my minimal understanding, accumID is agnostic of sql and multiple sqls and nodes should be able to update the same accumulable
val nodeInfoToFilterMetricsMap = | ||
HashMap.empty[(Long, Long, String, Int), HashMap[String, StatisticsMetrics]] | ||
|
||
for (stageAccumResult <- filteredStageAccumList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the logic here to be clear ->
- We first get the filteredStageAccumList from the accumStage level aggregated metrics by doing a name match
- Then iterate over that list ( granularity is accumMeta +StageID)
- For the accumID, we find the sqlID,nodeID, nodeName
- Maintain a cache of sqlID, nodeID, nodeName, stageID -> accumName, statistic
- In this, can there be multiple accumulable updates coming for the same key as above ? Here the value distinction is correct ( name + statistic) but on the key level there can be multiple accum updates
- Also, contextual doubt - all the metrics being filtered, is generating these from a SQL Metric View relevant ? Because as part of some previous discussion, the SQL Metrics sometimes may not be that accurate
- So a better way could be getting them from the internal.metrics.* accums. But that is only relevant in case SQL View is not needed by downstream ( otherwise can ignore)
for (stageAccumResult <- filteredStageAccumList) { | ||
val accumId = stageAccumResult.accMetaRef.id | ||
// Retrieve node information if available | ||
accumIdToNodeInfoMap.get(accumId).foreach { case (sqlId, nodeId, nodeName) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to eliminate this new map creation ? So we have accumID, we have the stageID, we can get accum + stage stats from accumManager. and I see a map sqlPlanNodeIdToStageIds
above. This implementation is cleaner, but just verifying is this can be avoided and kept clean at the same time would be good
Contributes to #1374
Changes
filtered_diagnostic_metrics.csv
generateFilteredDiagnosticAccums
to compute filtered diagnostic metrics incore/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
FilteredAccumDiagnosticMetrics
to store selected filtered related metric names and methodsFilteredDiagnosticResult
to represent filtered diagnostic metrics resultTesting
test filtered diagnostic metrics
incore/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala
Example Output