From 5e8fced1f15b797e533ad790a0864ca3663ff726 Mon Sep 17 00:00:00 2001 From: Eric Liang Date: Sun, 26 May 2019 13:22:43 -0700 Subject: [PATCH] add rnn state info --- doc/source/rllib-concepts.rst | 61 +++++++++++++++++++++++++++++++++-- doc/source/rllib.rst | 4 +-- 2 files changed, 61 insertions(+), 4 deletions(-) diff --git a/doc/source/rllib-concepts.rst b/doc/source/rllib-concepts.rst index d16ce4e66b6a..06e890832295 100644 --- a/doc/source/rllib-concepts.rst +++ b/doc/source/rllib-concepts.rst @@ -8,7 +8,7 @@ Policies Policy classes encapsulate the core numerical components of RL algorithms. This typically includes the policy model that determines actions to take, a trajectory postprocessor for experiences, and a loss function to improve the policy given postprocessed experiences. For a simple example, see the policy gradients `graph definition `__. -Most interaction with deep learning frameworks is isolated to the `Policy interface `__, allowing RLlib to support multiple frameworks. To simplify the definition of policies, RLlib includes `Tensorflow `__ and `PyTorch-specific `__ templates. You can also write your own from scratch. Here is an example: +Most interaction with deep learning frameworks is isolated to the `Policy interface `__, allowing RLlib to support multiple frameworks. To simplify the definition of policies, RLlib includes `Tensorflow <#building-policies-in-tensorflow>`__ and `PyTorch-specific <#building-policies-in-pytorch>`__ templates. You can also write your own from scratch. Here is an example: .. code-block:: python @@ -46,6 +46,63 @@ Most interaction with deep learning frameworks is isolated to the `Policy interf def set_weights(self, weights): self.w = weights["w"] + +The above basic policy, when run, will produce batches of observations with the basic ``obs``, ``new_obs``, ``actions``, ``rewards``, ``dones``, and ``infos`` columns. There are two more mechanisms to pass along and emit extra information: + +**Policy recurrent state**: Suppose you want to compute actions based on the current timestep of the episode. While it is possible to have the environment provide this as part of the observation, we can instead compute and store it as part of the Policy recurrent state: + +.. code-block:: python + + def get_initial_state(self): + """Returns initial RNN state for the current policy.""" + return [0] # list of single state element (t=0) + # you could also return multiple values, e.g., [0, "foo"] + + def compute_actions(self, + obs_batch, + state_batches, + prev_action_batch=None, + prev_reward_batch=None, + info_batch=None, + episodes=None, + **kwargs): + assert len(state_batches) == len(self.get_initial_state()) + new_state_batches = [[ + t + 1 for t in state_batches[0] + ]] + return ..., new_state_batches, {} + + def learn_on_batch(self, samples): + # can access array of the state elements at each timestep + # or state_in_1, 2, etc. if there are multiple state elements + assert "state_in_0" in samples.keys() + assert "state_out_0" in samples.keys() + + +**Extra action info output**: You can also emit extra outputs at each step which will be available for learning on. For example, you might want to output the behaviour policy logits as extra action info, which can be used for importance weighting, but in general arbitrary values can be stored here (as long as they are convertible to numpy arrays): + +.. code-block:: python + + def compute_actions(self, + obs_batch, + state_batches, + prev_action_batch=None, + prev_reward_batch=None, + info_batch=None, + episodes=None, + **kwargs): + action_info_batch = { + "some_value": ["foo" for _ in obs_batch], + "other_value": [12345 for _ in obs_batch], + } + return ..., [], action_info_batch + + def learn_on_batch(self, samples): + # can access array of the extra values at each timestep + assert "some_value" in samples.keys() + assert "other_value" in samples.keys() + + Building Policies in TensorFlow ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -427,7 +484,7 @@ Trainers Trainers are the boilerplate classes that put the above components together, making algorithms accessible via Python API and the command line. They manage algorithm configuration, setup of the rollout workers and optimizer, and collection of training metrics. Trainers also implement the `Trainable API `__ for easy experiment management. -Example of three equivalent ways of interacting with the PPO trainer: +Example of three equivalent ways of interacting with the PPO trainer, all of which log results in ``~/ray_results``: .. code-block:: python diff --git a/doc/source/rllib.rst b/doc/source/rllib.rst index 724a3caf83d5..e77a0ab427f8 100644 --- a/doc/source/rllib.rst +++ b/doc/source/rllib.rst @@ -95,8 +95,8 @@ Offline Datasets * `Input API `__ * `Output API `__ -Building Custom Algorithms --------------------------- +Concepts and Building Custom Algorithms +--------------------------------------- * `Policies `__ - `Building Policies in TensorFlow `__