Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] [PBT] [Doc] Add example PBT notebook #28519

Merged
merged 23 commits into from
Oct 4, 2022

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Sep 14, 2022

Summary

Example notebook

See here for a Colab version of the notebook! Some images are missing that would show up in the Ray docs once merged.

The purpose of the example is to help new users understand what PBT is doing under the hood, and provide an example of using PBT with a function trainable. The notebook gives recommendations on how to set PBT-specific parameters such as perturbation_interval and how to perform checkpointing alongside PBT.

The example notebook replicates an experiment found in the original PBT paper. The toy example in the paper is a good way of verifying that PBT behavior is correct (trials are being correctly exploited and the correct checkpoints are being used).

Notebook plots

Figure in paper

pbt_paper_exp

PBT logging enhancements

This PR also includes some quality of life improvements to PBT logging that makes it easier to understand what is happening.

Example:

hyperparam_mutations = {
    "a": tune.uniform(0, 1),
    "b": list(range(5)),
    "c": {
        "d": tune.uniform(2, 3),
        "e": {"f": [-1, 0, 1]},
    },
}

The above hyperparameter mutations config results in the following log after the PBT explore step:

[PBT] [Explore] Perturbed the hyperparameter config of trial trial_1:
a : 0.5 --- (* 1.2) --> 0.6
b : 2 --- (shift right) --> 3
c :
    d : 2.5 --- (* 1.2) --> 3.0
    e :
        f : 0 --- (shift right) --> 1

Some docs restructuring

I also took the chance to nest some Tune examples in sub-sections, since there were a lot of frameworks specific examples that were overloading the table of contents.

Related PRs

This test helped uncover some bugs fixed in the following PRs:

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add missing logging logic from previous commit

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add missing resample operation log

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu added tune Tune-related issues docs An issue or change related to documentation labels Sep 14, 2022
@justinvyu justinvyu self-assigned this Sep 14, 2022
@justinvyu justinvyu marked this pull request as draft September 14, 2022 21:56
…ons`

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu marked this pull request as ready for review September 28, 2022 18:45
@Yard1 Yard1 requested review from c21 and removed request for c21 September 28, 2022 19:00
Copy link
Contributor

@xwjiang2010 xwjiang2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!
This looks great! Only some nits..

custom_explore_fn: Optional[Callable],
) -> Dict:
) -> Tuple[Dict, Dict]:
"""Return a config perturbed as specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be updated? The return type should be updated already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also operations returned.

Example of using a Trainable function with HyperBandScheduler.
Also uses the AsyncHyperBandScheduler.
- :doc:`/tune/examples/pbt_visualization/pbt_visualization`:
Configuring and running PBT and understanding the underlying algorithm behavior with a simple example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention this is to illustrate synchronous pbt?

…nd make runnable via colab

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing tutorial and changes! Just two minor nits.

cc @maxpumperla for docs approval (incl. structure)

new_config[key] = distribution[
min(len(distribution) - 1, distribution.index(config[key]) + 1)
]
shift = random.choice([-1, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here that explains what we're doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, let's add a few comments for this whole exploration block

new_idx = distribution.index(config[key]) + shift
new_idx = min(max(new_idx, 0), len(distribution) - 1)
new_config[key] = distribution[new_idx]
operations[key] = f"shift {'left' if shift == -1 else 'right'}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "shift left (noop)" or similar if old_idx == new_idx? E.g. when we select shift = -1 when we're already at the first item

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this should include some indicator that we've hit the end of the list.

A few thoughts on this actually:

  • We might want to just wrap around from the end if trying to shift left at the first item. The fact that we don't wrap implies some kind of ordering to the items, but there is no guarantee there. Example:
    • "a": [100, 1, 50, 75]
    • 75 <- (shift left & wrap around) -- 100 -- (shift right) -> 1
    • Seems arbitrary why 100 can only be perturbed to the right.
  • Should a noop be an option here rather than always shifting left and right? SHERPA's implementation includes a "no shift" option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm what does the original paper say?

Everything else seems to have an option not to shift, so I actually believe just going with [-1, 0, 1] should be fine.

For wrapping around, basically a matter of preference. Best would be to distinguish between ordinal and categorical, but since we don't have this, I'd say let's assume ordinal and leave the logic as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The original paper doesn't really mention specifics on perturb, so I think adding the no-shift makes sense to align with other implementations.

  2. I see, can leave this for a PR in the future.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu requested a review from krfricke September 30, 2022 21:24
@krfricke krfricke merged commit da50ef4 into ray-project:master Oct 4, 2022
amogkam pushed a commit that referenced this pull request Nov 18, 2022
Follow-up to #28519, fixing the hierarchy to match the hierarchy contained within the pages.

Also moved MLFlow from ML Framework Examples to Experiment Tracking Examples

Signed-off-by: Matthew Deng <matt@anyscale.com>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
See [here](https://colab.research.google.com/github/justinvyu/ray/blob/pbt_example_notebook/doc/source/tune/examples/pbt_visualization/pbt_visualization.ipynb) for a Colab version of the notebook! Some images are missing that would show up in the Ray docs once merged.

The purpose of the example is to help new users understand what PBT is doing under the hood, and provide an example of using PBT with a function trainable. The notebook gives recommendations on how to set PBT-specific parameters such as `perturbation_interval` and how to perform checkpointing alongside PBT.

The example notebook replicates an experiment found in the [original PBT paper](https://arxiv.org/pdf/1711.09846.pdf). The toy example in the paper is a good way of verifying that PBT behavior is correct (trials are being correctly exploited and the correct checkpoints are being used).

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Follow-up to ray-project#28519, fixing the hierarchy to match the hierarchy contained within the pages.

Also moved MLFlow from ML Framework Examples to Experiment Tracking Examples

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation tune Tune-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants