Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid cluster recommendation for Dataproc #1537

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Feb 7, 2025

Fixes #1521.

Currently, AutoTuner/Bootstrapper recommends 1 x n1-standard-16 instance for the input CPU job, which used 8 cores and 2 instances. However, Dataproc does not support clusters with only one worker node.

This PR introduces validateRecommendedCluster, a validation mechanism for recommended cluster configurations. Platform-specific classes can override this method to enforce platform-specific constraints.

Logic

  • Dataproc: If the recommended clusterInfo has fewer worker nodes than the minimum supported by Dataproc (i.e., 2), mark the recommendation as invalid.

Code Changes

Enhancements to cluster recommendation validation:

Improvements to test coverage:

Test

  • Added an invalid event log and a unit test for dataproc to verify that no cluster is recommended.
  • In future, we could add similar constraints and unit tests for other platforms.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added bug Something isn't working core_tools Scope the core module (scala) labels Feb 7, 2025
@parthosa parthosa self-assigned this Feb 7, 2025
@parthosa parthosa marked this pull request as ready for review February 7, 2025 22:50
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthosa !

I understand that we don't want to make an invalid cluster recommendation. I am a little bit concerned of the implications of completely dropping the recommendations; especially for internal development and testing. Have we evaluated the impact of that decision on testing environments?
This reminds me of the "minimum CPU-core threshold feature" that almost killed all the testing environment temporarily. Similarly, after making that change, many eventlogs might not generate any cluster recommendations.

If this is going to impact the testing environments, then we might consider allowing on/off the constraints from configurations/env-variables; or enforce the closest valid cluster recommendation so the autotuner always generates a valid recommendation.

// TODO: This should be extended for validating the recommended cluster information
// for other platforms.
test(s"test invalid recommended cluster information JSON for platform - dataproc") {
val logFile = s"$logDir/cluster_information/platform/invalid/dataproc.zstd"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we rename that file to be more specific for the case it represents?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to dataproc_invalid_num_workers.zstd

recommendedNodeInstanceInfo = Some(recommendedNodeInstance)
recommendedClusterInfo = Some(validCluster)
case Left(reason) =>
logWarning(s"Failed to generate a cluster recommendation. Reason: $reason")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we add an AutoTuner comment to log that case instead of logWarning? That way, it will be kept as part of the output folder and it can be processed as part of the AutoTuner's output.
We know that the log messages are ignored when it comes to justification of output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Updated the logic to add it as AutoTuner comment.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa
Copy link
Collaborator Author

Have we evaluated the impact of that decision on testing environments?

Tested this change partially in our internal test pipelines. The changes do not affect them.

This reminds me of the "minimum CPU-core threshold feature" that almost killed all the testing environment temporarily.

This scenario is different as it does not impact the qualification numbers. All jobs that were previously qualified will be qualified.

On a side note:
The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

@amahussein
Copy link
Collaborator

This scenario is different as it does not impact the qualification numbers. All jobs that were previously qualified will be qualified.

On a side note: The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

There are two different aspects in question:

  • My point was not about qualifying a job. It was about generating recommended cluster configuration from a given eventlog. With this PR, I cannot tell which eventlogs are going to get a successful cluster recommendation Vs. which ones are not. It is almost try-and-error. This leads to two other troubleshooting issues:
    • how to initialize the sampling eventlogs to test E-2-E. Which ones are supposed to generate a recommendations Vs. which do not?
    • When troubleshooting: How to justify the missing cluster recommendation to a customer? IS the reason that his job is small? or something else?

@amahussein
Copy link
Collaborator

On a side note:
The changes in PR is similar to many existing cases where we cannot give a valid cluster recommendation due to missing properties. This brings a broader question, should we even qualify the apps where we cannot give a valid cluster recommendation.

In the scenario being in question, our logic that calculates the cluster size is not taking into consideration CSP restriction. That's completely different thing from missing property.
The fix to this is initialize the cluster recommendation to meet the platform restriction. Then increase the executors/cores if applicable.

Qualifying a job is another dimension and it is certainly not AutoTuner's purpose to disqualify or to exclude jobs.

@tgravescs
Copy link
Collaborator

so can you please update the description to explain the logic this PR introduces to fix this situation?

What I see is:

This PR introduces validateRecommendedCluster, a validation mechanism for recommended cluster configurations. Platform-specific classes can override this method to enforce platform-specific constraints.

That is great but this doesn't tell me what validateRecommendedCluster is doing to actually address the issue. Does it recommend using 2 workers of half the size, does it recommend using a single node clusters, etc?

@parthosa
Copy link
Collaborator Author

@tgravescs Updated the PR description with the logic introduced.

@parthosa
Copy link
Collaborator Author

Marking this as draft for more requirements.

@parthosa parthosa marked this pull request as draft February 11, 2025 22:04
@tgravescs
Copy link
Collaborator

I'm surprised at this restriction in dataproc. I agree with Ahmed. it seems odd to go through all the logic to pick a node and then at the last second just drop it all because its only 1 node. It also seems odd for us to recommend 2 nodes because its going to add Cost to have another GPU. It would be good to decide what we do want to recommend - should then do 2 nodes or maybe single node instead of cluster - if single node you need extra cores for driver. I'm guessing going with 2 nodes is easiest thing and it seems like that logic would be easy to update to handle, especially if it was already run on dataproc cpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] AutoTuner should not recommend jobs with 1 worker nodes in Dataproc
3 participants