Handle invalid cluster recommendation for Dataproc #1537

parthosa · 2025-02-07T22:23:18Z

Currently, AutoTuner/Bootstrapper recommends 1 x n1-standard-16 instance for the input CPU job, which used 8 cores and 2 instances. However, Dataproc does not support clusters with only one worker node.

This PR introduces validateRecommendedCluster, a validation mechanism for recommended cluster configurations. Platform-specific classes can override this method to enforce platform-specific constraints.

Logic

Dataproc: If the recommended clusterInfo has fewer worker nodes than the minimum supported by Dataproc (i.e., 2), mark the recommendation as invalid.

Code Changes

Enhancements to cluster recommendation validation:

core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala: Introduced the validateRecommendedCluster method to validate the recommended cluster configuration, allowing subclasses to provide platform-specific validation.
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala: Implemented platform-specific validation in DataprocPlatform to ensure the number of worker nodes meets the minimum required by the platform.

Improvements to test coverage:

core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala: Modified tests to compare actual cluster information against expected values and added a new test to validate the recommended cluster information for invalid configurations. [1] [2] [3]
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala: Refactored the runQualificationAndTestClusterInfo method to return the cluster summary, improving test readability and maintainability.

Test

Added an invalid event log and a unit test for dataproc to verify that no cluster is recommended.
In future, we could add similar constraints and unit tests for other platforms.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

amahussein

Thanks @parthosa !

I understand that we don't want to make an invalid cluster recommendation. I am a little bit concerned of the implications of completely dropping the recommendations; especially for internal development and testing. Have we evaluated the impact of that decision on testing environments?
This reminds me of the "minimum CPU-core threshold feature" that almost killed all the testing environment temporarily. Similarly, after making that change, many eventlogs might not generate any cluster recommendations.

If this is going to impact the testing environments, then we might consider allowing on/off the constraints from configurations/env-variables; or enforce the closest valid cluster recommendation so the autotuner always generates a valid recommendation.

amahussein · 2025-02-10T16:22:47Z

core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala

+  // TODO: This should be extended for validating the recommended cluster information
+  //       for other platforms.
+  test(s"test invalid recommended cluster information JSON for platform - dataproc") {
+    val logFile = s"$logDir/cluster_information/platform/invalid/dataproc.zstd"


shall we rename that file to be more specific for the case it represents?

Renamed it to dataproc_invalid_num_workers.zstd

amahussein · 2025-02-10T16:29:31Z

core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala

+                recommendedNodeInstanceInfo = Some(recommendedNodeInstance)
+                recommendedClusterInfo = Some(validCluster)
+              case Left(reason) =>
+                logWarning(s"Failed to generate a cluster recommendation. Reason: $reason")


Can't we add an AutoTuner comment to log that case instead of logWarning? That way, it will be kept as part of the output folder and it can be processed as part of the AutoTuner's output.
We know that the log messages are ignored when it comes to justification of output.

That makes sense. Updated the logic to add it as AutoTuner comment.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa · 2025-02-10T21:06:53Z

Have we evaluated the impact of that decision on testing environments?

Tested this change partially in our internal test pipelines. The changes do not affect them.

This reminds me of the "minimum CPU-core threshold feature" that almost killed all the testing environment temporarily.

This scenario is different as it does not impact the qualification numbers. All jobs that were previously qualified will be qualified.

On a side note:
The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

amahussein · 2025-02-10T21:15:25Z

This scenario is different as it does not impact the qualification numbers. All jobs that were previously qualified will be qualified.

On a side note: The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

There are two different aspects in question:

My point was not about qualifying a job. It was about generating recommended cluster configuration from a given eventlog. With this PR, I cannot tell which eventlogs are going to get a successful cluster recommendation Vs. which ones are not. It is almost try-and-error. This leads to two other troubleshooting issues:
- how to initialize the sampling eventlogs to test E-2-E. Which ones are supposed to generate a recommendations Vs. which do not?
- When troubleshooting: How to justify the missing cluster recommendation to a customer? IS the reason that his job is small? or something else?

amahussein · 2025-02-10T21:49:25Z

On a side note:
The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

In the scenario being in question, our logic that calculates the cluster size is not taking into consideration CSP restriction. That's completely different thing from missing property.
The fix to this is initialize the cluster recommendation to meet the platform restriction. Then increase the executors/cores if applicable.

Qualifying a job is another dimension and it is certainly not AutoTuner's purpose to disqualify or to exclude jobs.

tgravescs · 2025-02-11T19:08:17Z

so can you please update the description to explain the logic this PR introduces to fix this situation?

What I see is:

This PR introduces validateRecommendedCluster, a validation mechanism for recommended cluster configurations. Platform-specific classes can override this method to enforce platform-specific constraints.

That is great but this doesn't tell me what validateRecommendedCluster is doing to actually address the issue. Does it recommend using 2 workers of half the size, does it recommend using a single node clusters, etc?

parthosa · 2025-02-11T22:03:37Z

@tgravescs Updated the PR description with the logic introduced.

parthosa · 2025-02-11T22:04:26Z

Marking this as draft for more requirements.

tgravescs · 2025-02-11T22:10:32Z

I'm surprised at this restriction in dataproc. I agree with Ahmed. it seems odd to go through all the logic to pick a node and then at the last second just drop it all because its only 1 node. It also seems odd for us to recommend 2 nodes because its going to add Cost to have another GPU. It would be good to decide what we do want to recommend - should then do 2 nodes or maybe single node instead of cluster - if single node you need extra cores for driver. I'm guessing going with 2 nodes is easiest thing and it seems like that logic would be easy to update to handle, especially if it was already run on dataproc cpu.

Handle invalid cluster recommendation for Dataproc

9ec73e2

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa added bug Something isn't working core_tools Scope the core module (scala) labels Feb 7, 2025

parthosa requested review from cindyyuanjiang, amahussein and sayedbilalbari February 7, 2025 22:23

parthosa self-assigned this Feb 7, 2025

parthosa marked this pull request as ready for review February 7, 2025 22:50

amahussein requested changes Feb 10, 2025

View reviewed changes

Address review comments

68ee89f

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa marked this pull request as draft February 11, 2025 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid cluster recommendation for Dataproc #1537

Handle invalid cluster recommendation for Dataproc #1537

parthosa commented Feb 7, 2025 •

edited

Loading

amahussein left a comment

amahussein Feb 10, 2025

parthosa Feb 10, 2025

amahussein Feb 10, 2025

parthosa Feb 10, 2025

parthosa commented Feb 10, 2025

amahussein commented Feb 10, 2025

amahussein commented Feb 10, 2025

tgravescs commented Feb 11, 2025

parthosa commented Feb 11, 2025

parthosa commented Feb 11, 2025

tgravescs commented Feb 11, 2025

Handle invalid cluster recommendation for Dataproc #1537

Are you sure you want to change the base?

Handle invalid cluster recommendation for Dataproc #1537

Conversation

parthosa commented Feb 7, 2025 • edited Loading

Logic

Code Changes

Test

amahussein left a comment

Choose a reason for hiding this comment

amahussein Feb 10, 2025

Choose a reason for hiding this comment

parthosa Feb 10, 2025

Choose a reason for hiding this comment

amahussein Feb 10, 2025

Choose a reason for hiding this comment

parthosa Feb 10, 2025

Choose a reason for hiding this comment

parthosa commented Feb 10, 2025

amahussein commented Feb 10, 2025

amahussein commented Feb 10, 2025

tgravescs commented Feb 11, 2025

parthosa commented Feb 11, 2025

parthosa commented Feb 11, 2025

tgravescs commented Feb 11, 2025

parthosa commented Feb 7, 2025 •

edited

Loading