[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

sven1977 · 2022-09-27T12:19:23Z

Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format

Details:
On Algorithm checkpoints:

All Algorithm checkpoints now use the AIR Checkpoint mechanism. I.e. my_algo.restore([some AIR checkpoint]) works as well as Algorithm.restore([some path to a checkpoint dir]). The checkpoint directory structure will change from:

.
..
checkpoint-[some iter num]

to:

.
..
policies/
    policy_1/
        policy_state.pkl
    policy_2
        policy_state.pkl
checkpoint_version.txt
state.pkl

Algorithm checkpoints now have a version (e.g. "v0", "v1") stored in the checkpoint dir under "checkpoint_version.txt". This will help keeping checkpoint handling fully backward compatible from Ray 2.0 on. Test cases are introduced in this PR confirming this is and remains the case.
Algorithm checkpoints now contain a sub-directory ("policies") which has further sub-directories (named after the policies' IDs) that contain the individual policy checkpoints (see below). This allows for easier decomposition and re-assembly of Policies within an Algorithm checkpoint (e.g. restore an Algorithm from a checkpoint, but only with policies A and B, instead of the original A, B, and C, or restoring a Policy instance individually).
Algorithm gets two new static utilities: from_checkpoint() and from_state(), both of which return new Algorithm objects, given a checkpoint dir or object or a state dict, respectively. I.e.: my_new_algo = Algorithm.from_checkpoint([path to AIR checkpoint OR AIR checkpoint obj]). No original config or other information is needed other than the checkpoint.
Test cases have been added to keep checkpoint backward compatibility and to test these new utilities and dir structures.

On Policy Checkpoints:

Policy checkpoints now use the AIR Checkpoint mechanism. I.e. Policy.export_checkpoint() produces an AIR Checkpoint directory with all the policy's state in it.
Policy gets two new static utilities: from_checkpoint() and from_state(), both of which return new Policy objects, given a Policy checkpoint dir or object or a Policy state, respectively.

On native keras/PyTorch models being part of a Policy checkpoint (optional):

A new config option: config.checkpointing(checkpoints_contain_native_model_files=True) makes Policies also try to write their NN model as native keras/torch saved model into the given checkpoint directory (under sub-dir "model"). This may still fail (gracefully) in some cases, e.g. for certain TfModelV2 where the keras self.base_model (of the TfModelV2) cannot be discovered easily. This problem will be fully solved by the ongoing RLModule/RLTrainer API efforts.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…l_export_overhaul # Conflicts: # rllib/BUILD

…l_export_overhaul

…l_export_overhaul Signed-off-by: sven1977 <svenmika1977@gmail.com> # Conflicts: # rllib/examples/export/onnx_tf.py # rllib/examples/export/onnx_torch.py # rllib/policy/dynamic_tf_policy_v2.py # rllib/policy/eager_tf_policy_v2.py # rllib/policy/tests/test_policy.py # rllib/policy/torch_policy_v2.py

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…l_export_overhaul

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…l_export_overhaul

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…l_export_overhaul

Signed-off-by: sven1977 <svenmika1977@gmail.com>

doc/source/rllib/rllib-checkpoints-and-exports.rst

Signed-off-by: sven1977 <svenmika1977@gmail.com>

maxpumperla · 2022-10-07T11:07:02Z

doc/source/_toc.yml

@@ -214,6 +214,7 @@ parts:
          - file: rllib/user-guides
            sections:
              - file: rllib/rllib-models
+              - file: rllib/rllib-checkpoints-and-exports


cool, thanks! @sven1977 as this is a user guide, I think you also want to add this new doc to the panels in user-guides.rst

hmm, maybe we should take a closer look at that gallery, too. For instance, I see that "connectors" are also not in this gallery, although they show up in the TOC (and main navigation of the docs). this should align

was already added to user-guides.rst

@gjoliver on adding connectors docs to list of RLlib user guides (user-guides.rst).

doc/source/rllib/rllib-checkpoints-and-exports.rst

rllib/examples/documentation/checkpoints_and_exports.py

maxpumperla · 2022-10-07T11:29:03Z

rllib/examples/documentation/checkpoints_and_exports.py

+
+from ray.rllib.algorithms.ppo import PPOConfig  # noqa
+
+# Create a new Algorithm (which contains a Policy, which contains a NN Model).


This whole import is too long and consists of 80%+ comments. I think this doesn't read very well on the docs (as a reader I'm not sure if I'm supposed to read all this, it looks as if all those comments are there by "accident"):

basically this import and maybe the last one suffer from this a little, other than that I think this looks great btw.

I'll split it up and move the comments into the rsv file.

Left the keras-related comments in the code, but split up the 3 different ways on how to save your models (direct, via policy checkpoint, via algo checkpoint).

doc/source/rllib/rllib-checkpoints-and-exports.rst

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_update_checkpointing_and_exports

Signed-off-by: sven1977 <svenmika1977@gmail.com>

maxpumperla

Almost there, just the TOC and include file.

doc/source/_toc.yml

maxpumperla · 2022-10-17T15:51:31Z

doc/source/rllib/rllib-examples.rst

@@ -140,7 +140,7 @@ Serving and Offline
 - `Saving experiences <https://github.com/ray-project/ray/blob/master/rllib/examples/saving_experiences.py>`__:
   Example of how to externally generate experience batches in RLlib-compatible format.
 - `Finding a checkpoint using custom criteria <https://github.com/ray-project/ray/blob/master/rllib/examples/checkpoint_by_custom_criteria.py>`__:
-   Example of how to find a checkpoint after a `Tuner.fit()` via some custom defined criteria.
+   Example of how to find a `checkpoint <rllib-saving-and-loading-algos-and-policies.html>`__ after a `Tuner.fit()` via some custom defined criteria.


not a big deal, but using a :ref: here is more stable and prevents outdated links when moving files etc.

Done, but can you check, whether I did this correctly. Not sure how sphinx infers the actual html file. E.g. I don't find any serve-rllib-tutorial either (referenced further above this one here).

@sven1977 yeah, I don't like the way Sphinx solves this. essentially, if there's a file called serve-rllib-tutorial.md in the source, relative to the same file, you can use that as doc ref, otherwise (and this is usually much better), you'd just add a .. _serve-rllib-tutorial: tag in rst or (serve-rllib-tutorial)= in markdown in the respective doc and reference that, see here:

https://raw.githubusercontent.com/ray-project/ray/23b3a599b9df8100558c477e94b0b19b1a38ac27/doc/source/serve/tutorials/rllib.md

so in this case, please put .. _rllib-saving-and-loading-algos: at the top of your new doc/source/rllib/rllib-saving-and-loading-algos-and-policies.rst file. then you can reference it as:

:ref:`my reference <rllib-saving-and-loading-algos>`

this should also fix the build

doc/source/rllib/rllib-training.rst

doc/source/rllib/user-guides.rst

maxpumperla · 2022-10-17T15:58:09Z

rllib/examples/documentation/saving_and_loading_algos_and_policies.py

@@ -0,0 +1,308 @@
+# flake8: noqa
+
+# __create-algo-checkpoint-begin__


By convention (and I think this is a good one to keep consistent), we have a doc_code folder like here for Tune:

https://github.com/ray-project/ray/tree/master/doc/source/tune/doc_code

which has all the code referenced in docs. You can then basically copy this block in the bazel BUILD file to auto-test all imported snippets. I'm aware that this would likely also happen with rllib/examples but I think it'd be great to keep this code close to doc-sources (which e.g. makes it likely easier for new contributors to understand the structure).

https://github.com/ray-project/ray/blob/master/doc/BUILD#L155-L164

Should be a simple move... apart from that, this is exactly what I had in mind! :D

Dumb question on this one:
If we currently run all rllib/examples/documentation/* via the CI, how would we reference into the new location (docs/source/rllib/doc_code/*) from the rllib/BUILD file?

we should simply consider those two different CI runs. I didn't realise there was an rllib/examples/documentation folder already. We could either leave it as is, or (in a follow-up PR) move the contents of that folder into doc/source/rllib/doc_code. We can then delete that part from the rllib/BUILD and add the resp. bit to doc/BUILD. that way we don't have to think about the issue you brought up. wdyt?

let's simply skip this discussion for now and fix this later, don't think it's a big deal either way

…_update_checkpointing_and_exports

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_update_checkpointing_and_exports

maxpumperla

approving, pending the ref fix

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…_update_checkpointing_and_exports

Signed-off-by: sven1977 <svenmika1977@gmail.com>

…d Policy model exports in native format (ray-project#28812) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

sven1977 added 30 commits June 8, 2022 17:46

wip

d026a99

Merge branch 'master' of https://github.com/ray-project/ray into mode…

50b6d69

…l_export_overhaul # Conflicts: # rllib/BUILD

wip.

74cc999

Merge branch 'master' of https://github.com/ray-project/ray into mode…

f6de7cd

…l_export_overhaul

wip.

8a24cd8

wip.

de46f48

wip

e5611c1

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

01a9e47

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

eed6218

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

69352dc

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

88ec1d6

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

f8c796e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

6ad1e80

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into mode…

191550d

…l_export_overhaul

Merge branch 'master' of https://github.com/ray-project/ray into mode…

52235db

…l_export_overhaul

wip

5e44e2e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

fb8e89b

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

49429bd

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

f5a5430

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into mode…

41e9dd8

…l_export_overhaul

wip

5261c53

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

94399e1

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into mode…

8ddba8b

…l_export_overhaul

wip

799be4b

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

0117db2

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

e3bdf70

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

3071270

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

82f5b41

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip.

81c7d30

Signed-off-by: sven1977 <svenmika1977@gmail.com>

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

wip

9a1fd3d

Signed-off-by: sven1977 <svenmika1977@gmail.com>

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

maxpumperla reviewed Oct 7, 2022

View reviewed changes

rllib/examples/documentation/checkpoints_and_exports.py Outdated Show resolved Hide resolved

maxpumperla reviewed Oct 7, 2022

View reviewed changes

doc/source/rllib/rllib-checkpoints-and-exports.rst Outdated Show resolved Hide resolved

sven1977 added 3 commits October 12, 2022 14:00

wip

7270f0c

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into docs…

cfbeb0e

…_update_checkpointing_and_exports

wip

95d7d5b

Signed-off-by: sven1977 <svenmika1977@gmail.com>

maxpumperla requested changes Oct 17, 2022

View reviewed changes

sven1977 added 3 commits October 18, 2022 16:56

Merge branch 'master' of https://github.com/ray-project/ray into docs…

ca7d85a

…_update_checkpointing_and_exports

wip

14a9cde

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into docs…

16d5787

…_update_checkpointing_and_exports

maxpumperla approved these changes Oct 19, 2022

View reviewed changes

sven1977 added 6 commits October 19, 2022 14:23

wip

748f81e

Signed-off-by: sven1977 <svenmika1977@gmail.com>

Merge branch 'master' of https://github.com/ray-project/ray into docs…

2f40fb9

…_update_checkpointing_and_exports

wip

ca0b246

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

4ce3cb3

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

9e5e393

Signed-off-by: sven1977 <svenmika1977@gmail.com>

wip

f949a70

Signed-off-by: sven1977 <svenmika1977@gmail.com>

sven1977 merged commit 9ece8ac into ray-project:master Oct 20, 2022

rickyyx mentioned this pull request Dec 10, 2022

[core] stress_test_placement_group avg_pg_creation_time regression #30980

Closed

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints an…

7b08737

…d Policy model exports in native format (ray-project#28812) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

rusu24edward mentioned this pull request Mar 26, 2024

Support rllib 2.3.0 LLNL/Abmarl#508

Closed

sven1977 deleted the docs_update_checkpointing_and_exports branch December 27, 2024 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

sven1977 commented Sep 27, 2022 •

edited

Loading

maxpumperla Oct 7, 2022

maxpumperla Oct 7, 2022

sven1977 Oct 17, 2022

maxpumperla Oct 7, 2022

maxpumperla Oct 7, 2022

sven1977 Oct 7, 2022

sven1977 Oct 17, 2022

maxpumperla left a comment

maxpumperla Oct 17, 2022

sven1977 Oct 19, 2022

maxpumperla Oct 19, 2022 •

edited

Loading

maxpumperla Oct 19, 2022

maxpumperla Oct 17, 2022

maxpumperla Oct 17, 2022

sven1977 Oct 19, 2022

maxpumperla Oct 19, 2022

maxpumperla Oct 19, 2022

maxpumperla left a comment


		from ray.rllib.algorithms.ppo import PPOConfig # noqa

		# Create a new Algorithm (which contains a Policy, which contains a NN Model).

		@@ -0,0 +1,308 @@
		# flake8: noqa

		# __create-algo-checkpoint-begin__

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

[RLlib] Docs update: New Algorithm checkpoints, Policy checkpoints and Policy model exports in native format #28812

Conversation

sven1977 commented Sep 27, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxpumperla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxpumperla Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxpumperla left a comment

Choose a reason for hiding this comment

sven1977 commented Sep 27, 2022 •

edited

Loading

maxpumperla Oct 19, 2022 •

edited

Loading