-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid cluster recommendation for Dataproc #1537
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthosa !
I understand that we don't want to make an invalid cluster recommendation. I am a little bit concerned of the implications of completely dropping the recommendations; especially for internal development and testing. Have we evaluated the impact of that decision on testing environments?
This reminds me of the "minimum CPU-core threshold feature" that almost killed all the testing environment temporarily. Similarly, after making that change, many eventlogs might not generate any cluster recommendations.
If this is going to impact the testing environments, then we might consider allowing on/off the constraints from configurations/env-variables; or enforce the closest valid cluster recommendation so the autotuner always generates a valid recommendation.
// TODO: This should be extended for validating the recommended cluster information | ||
// for other platforms. | ||
test(s"test invalid recommended cluster information JSON for platform - dataproc") { | ||
val logFile = s"$logDir/cluster_information/platform/invalid/dataproc.zstd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we rename that file to be more specific for the case it represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to dataproc_invalid_num_workers.zstd
recommendedNodeInstanceInfo = Some(recommendedNodeInstance) | ||
recommendedClusterInfo = Some(validCluster) | ||
case Left(reason) => | ||
logWarning(s"Failed to generate a cluster recommendation. Reason: $reason") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we add an AutoTuner comment to log that case instead of logWarning? That way, it will be kept as part of the output folder and it can be processed as part of the AutoTuner's output.
We know that the log messages are ignored when it comes to justification of output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Updated the logic to add it as AutoTuner comment.
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Tested this change partially in our internal test pipelines. The changes do not affect them.
This scenario is different as it does not impact the qualification numbers. All jobs that were previously qualified will be qualified. On a side note: |
There are two different aspects in question:
|
In the scenario being in question, our logic that calculates the cluster size is not taking into consideration CSP restriction. That's completely different thing from missing property. Qualifying a job is another dimension and it is certainly not AutoTuner's purpose to disqualify or to exclude jobs. |
so can you please update the description to explain the logic this PR introduces to fix this situation? What I see is:
That is great but this doesn't tell me what validateRecommendedCluster is doing to actually address the issue. Does it recommend using 2 workers of half the size, does it recommend using a single node clusters, etc? |
@tgravescs Updated the PR description with the logic introduced. |
Marking this as draft for more requirements. |
I'm surprised at this restriction in dataproc. I agree with Ahmed. it seems odd to go through all the logic to pick a node and then at the last second just drop it all because its only 1 node. It also seems odd for us to recommend 2 nodes because its going to add Cost to have another GPU. It would be good to decide what we do want to recommend - should then do 2 nodes or maybe single node instead of cluster - if single node you need extra cores for driver. I'm guessing going with 2 nodes is easiest thing and it seems like that logic would be easy to update to handle, especially if it was already run on dataproc cpu. |
Fixes #1521.
Currently, AutoTuner/Bootstrapper recommends
1 x n1-standard-16
instance for the input CPU job, which used 8 cores and 2 instances. However, Dataproc does not support clusters with only one worker node.This PR introduces
validateRecommendedCluster
, a validation mechanism for recommended cluster configurations. Platform-specific classes can override this method to enforce platform-specific constraints.Logic
Code Changes
Enhancements to cluster recommendation validation:
core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: Introduced thevalidateRecommendedCluster
method to validate the recommended cluster configuration, allowing subclasses to provide platform-specific validation.core/src/main/scala/com/nvidia/spark/rapids/tool/Platform.scala
: Implemented platform-specific validation inDataprocPlatform
to ensure the number of worker nodes meets the minimum required by the platform.Improvements to test coverage:
core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala
: Modified tests to compare actual cluster information against expected values and added a new test to validate the recommended cluster information for invalid configurations. [1] [2] [3]core/src/test/scala/com/nvidia/spark/rapids/tool/qualification/QualificationSuite.scala
: Refactored therunQualificationAndTestClusterInfo
method to return the cluster summary, improving test readability and maintainability.Test