Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Sep 27, 2022

Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format

Details:
On Algorithm checkpoints:

  • All Algorithm checkpoints now use the AIR Checkpoint mechanism. I.e. my_algo.restore([some AIR checkpoint]) works as well as Algorithm.restore([some path to a checkpoint dir]). The checkpoint directory structure will change from:
.
..
checkpoint-[some iter num]

to:

.
..
policies/
    policy_1/
        policy_state.pkl
    policy_2
        policy_state.pkl
checkpoint_version.txt
state.pkl
  • Algorithm checkpoints now have a version (e.g. "v0", "v1") stored in the checkpoint dir under "checkpoint_version.txt". This will help keeping checkpoint handling fully backward compatible from Ray 2.0 on. Test cases are introduced in this PR confirming this is and remains the case.
  • Algorithm checkpoints now contain a sub-directory ("policies") which has further sub-directories (named after the policies' IDs) that contain the individual policy checkpoints (see below). This allows for easier decomposition and re-assembly of Policies within an Algorithm checkpoint (e.g. restore an Algorithm from a checkpoint, but only with policies A and B, instead of the original A, B, and C, or restoring a Policy instance individually).
  • Algorithm gets two new static utilities: from_checkpoint() and from_state(), both of which return new Algorithm objects, given a checkpoint dir or object or a state dict, respectively. I.e.: my_new_algo = Algorithm.from_checkpoint([path to AIR checkpoint OR AIR checkpoint obj]). No original config or other information is needed other than the checkpoint.
  • Test cases have been added to keep checkpoint backward compatibility and to test these new utilities and dir structures.

On Policy Checkpoints:

  • Policy checkpoints now use the AIR Checkpoint mechanism. I.e. Policy.export_checkpoint() produces an AIR Checkpoint directory with all the policy's state in it.
  • Policy gets two new static utilities: from_checkpoint() and from_state(), both of which return new Policy objects, given a Policy checkpoint dir or object or a Policy state, respectively.

On native keras/PyTorch models being part of a Policy checkpoint (optional):

  • A new config option: config.checkpointing(checkpoints_contain_native_model_files=True) makes Policies also try to write their NN model as native keras/torch saved model into the given checkpoint directory (under sub-dir "model"). This may still fail (gracefully) in some cases, e.g. for certain TfModelV2 where the keras self.base_model (of the TfModelV2) cannot be discovered easily. This problem will be fully solved by the ongoing RLModule/RLTrainer API efforts.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

sven1977 added 30 commits June 8, 2022 17:46
…l_export_overhaul

Signed-off-by: sven1977 <svenmika1977@gmail.com>

# Conflicts:
#	rllib/examples/export/onnx_tf.py
#	rllib/examples/export/onnx_torch.py
#	rllib/policy/dynamic_tf_policy_v2.py
#	rllib/policy/eager_tf_policy_v2.py
#	rllib/policy/tests/test_policy.py
#	rllib/policy/torch_policy_v2.py
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -214,6 +214,7 @@ parts:
- file: rllib/user-guides
sections:
- file: rllib/rllib-models
- file: rllib/rllib-checkpoints-and-exports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks! @sven1977 as this is a user guide, I think you also want to add this new doc to the panels in user-guides.rst

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe we should take a closer look at that gallery, too. For instance, I see that "connectors" are also not in this gallery, although they show up in the TOC (and main navigation of the docs). this should align

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • was already added to user-guides.rst
  • @gjoliver on adding connectors docs to list of RLlib user guides (user-guides.rst).


from ray.rllib.algorithms.ppo import PPOConfig # noqa

# Create a new Algorithm (which contains a Policy, which contains a NN Model).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole import is too long and consists of 80%+ comments. I think this doesn't read very well on the docs (as a reader I'm not sure if I'm supposed to read all this, it looks as if all those comments are there by "accident"):

Screenshot 2022-10-07 at 13 23 55

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically this import and maybe the last one suffer from this a little, other than that I think this looks great btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll split it up and move the comments into the rsv file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the keras-related comments in the code, but split up the 3 different ways on how to save your models (direct, via policy checkpoint, via algo checkpoint).

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Contributor

@maxpumperla maxpumperla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, just the TOC and include file.

@@ -140,7 +140,7 @@ Serving and Offline
- `Saving experiences <https://github.com/ray-project/ray/blob/master/rllib/examples/saving_experiences.py>`__:
Example of how to externally generate experience batches in RLlib-compatible format.
- `Finding a checkpoint using custom criteria <https://github.com/ray-project/ray/blob/master/rllib/examples/checkpoint_by_custom_criteria.py>`__:
Example of how to find a checkpoint after a `Tuner.fit()` via some custom defined criteria.
Example of how to find a `checkpoint <rllib-saving-and-loading-algos-and-policies.html>`__ after a `Tuner.fit()` via some custom defined criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but using a :ref: here is more stable and prevents outdated links when moving files etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but can you check, whether I did this correctly. Not sure how sphinx infers the actual html file. E.g. I don't find any serve-rllib-tutorial either (referenced further above this one here).

Copy link
Contributor

@maxpumperla maxpumperla Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 yeah, I don't like the way Sphinx solves this. essentially, if there's a file called serve-rllib-tutorial.md in the source, relative to the same file, you can use that as doc ref, otherwise (and this is usually much better), you'd just add a .. _serve-rllib-tutorial: tag in rst or (serve-rllib-tutorial)= in markdown in the respective doc and reference that, see here:

https://raw.githubusercontent.com/ray-project/ray/23b3a599b9df8100558c477e94b0b19b1a38ac27/doc/source/serve/tutorials/rllib.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in this case, please put .. _rllib-saving-and-loading-algos: at the top of your new doc/source/rllib/rllib-saving-and-loading-algos-and-policies.rst file. then you can reference it as:

:ref:`my reference <rllib-saving-and-loading-algos>`

this should also fix the build

@@ -0,0 +1,308 @@
# flake8: noqa

# __create-algo-checkpoint-begin__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention (and I think this is a good one to keep consistent), we have a doc_code folder like here for Tune:

https://github.com/ray-project/ray/tree/master/doc/source/tune/doc_code

which has all the code referenced in docs. You can then basically copy this block in the bazel BUILD file to auto-test all imported snippets. I'm aware that this would likely also happen with rllib/examples but I think it'd be great to keep this code close to doc-sources (which e.g. makes it likely easier for new contributors to understand the structure).

https://github.com/ray-project/ray/blob/master/doc/BUILD#L155-L164

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a simple move... apart from that, this is exactly what I had in mind! :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question on this one:
If we currently run all rllib/examples/documentation/* via the CI, how would we reference into the new location (docs/source/rllib/doc_code/*) from the rllib/BUILD file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should simply consider those two different CI runs. I didn't realise there was an rllib/examples/documentation folder already. We could either leave it as is, or (in a follow-up PR) move the contents of that folder into doc/source/rllib/doc_code. We can then delete that part from the rllib/BUILD and add the resp. bit to doc/BUILD. that way we don't have to think about the issue you brought up. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simply skip this discussion for now and fix this later, don't think it's a big deal either way

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Contributor

@maxpumperla maxpumperla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, pending the ref fix

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 merged commit 9ece8ac into ray-project:master Oct 20, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…d Policy model exports in native format (ray-project#28812)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@sven1977 sven1977 deleted the docs_update_checkpointing_and_exports branch December 27, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants