This project showcases the implementation of two versions of Proximal Policy Optimization (PPO) algorithms — PPO-AdaptiveKL and PPO-Clip — applied to the Swimmer environment in the Gymnasium framework. These implementations help to explore the balance between policy update stability and efficiency in continuous control tasks.
-
PPO-Clip: This is the more commonly used version of PPO. It uses a clipping mechanism to restrict the policy update ratio, preventing large changes that might destabilize learning. This approach is simple yet effective, yielding stable training results.
-
PPO-AdaptiveKL: This variant introduces a penalty on the KL divergence between the old and new policies. The KL threshold is adapted dynamically during training, allowing more flexibility when updates are safe and restricting them when needed.
The PPO-Clip algorithm limits how much the policy can change by clipping the probability ratio between the new and old policies. This ensures smoother updates and stabilizes learning. For determine the policy parameter, we use this formula:
In this formula L
is:
Figure: PPO-Clip reward evolution during training episodes. The steady rise shows the model gradually improving policy performance while maintaining stability.
PPO-clip.mp4
In contrast, PPO-AdaptiveKL adjusts the KL divergence penalty dynamically. This allows larger updates when the divergence between the policies is small, and restricts updates when the divergence grows too large, making learning more flexible.
Figure: PPO-AdaptiveKL reward progression. The adaptive nature of the KL divergence threshold results in more fluctuation but eventually converges to a stable reward trajectory.
PPO-adaptive.mp4
Both algorithms are effective in controlling the swimmer, but they exhibit different characteristics:
- PPO-Clip maintains more stable and predictable updates, with smoother convergence.
- PPO-AdaptiveKL introduces more flexibility by adapting to the policy's behavior, allowing for potentially faster learning, but at the cost of occasional instability.
For a more detailed discussion of the algorithms, mathematical details, and hyperparameters, please refer to the full report.
This project is under the MIT License, and I’d be thrilled if you use and improve my work!