Skip to content

Commit

Permalink
Fix typos in docs for Multi Adapter RL (MARL). (#1312)
Browse files Browse the repository at this point in the history
* Fix more typos

* Fix typos in docs.
  • Loading branch information
elhusseiniali authored Feb 2, 2024
1 parent 3f7cee7 commit ae87b3a
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions docs/source/multi_adapter_rl.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multi Adapter RL (MARL) - a single base model for everything

Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not tested the convergence of the approach. We encourage the community to let us know if they potentially face into any issue.
Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not test the convergence of the approach. We encourage the community to let us know if they potentially face issues.

## Requirements

Expand Down Expand Up @@ -48,7 +48,7 @@ trainer = PPOTrainer(

...
```
Then inside your PPO training loop, call the `compute_reward_score` method by accessing to the `model` attribute from `PPOTrainer`.
Then inside your PPO training loop, call the `compute_reward_score` method by accessing the `model` attribute from `PPOTrainer`.

```python
rewards = trainer.model.compute_reward_score(**inputs)
Expand All @@ -58,8 +58,8 @@ rewards = trainer.model.compute_reward_score(**inputs)

### Control on the adapter name

If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is to train multiple adapters on the same base model to fine-tune on different policies.
In this case, you want to have a control on the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.
If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is train multiple adapters on the same base model to fine-tune on different policies.
In this case, you want to be able to control the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.

```python
adapter_name_policy_1 = "policy_1"
Expand Down Expand Up @@ -97,4 +97,4 @@ trainer = PPOTrainer(
...
)
...
```
```

0 comments on commit ae87b3a

Please sign in to comment.