Gymnasium Swimmer Environment

This project showcases the implementation of two versions of Proximal Policy Optimization (PPO) algorithms — PPO-AdaptiveKL and PPO-Clip — applied to the Swimmer environment in the Gymnasium framework. These implementations help to explore the balance between policy update stability and efficiency in continuous control tasks.

Overview

PPO-Clip: This is the more commonly used version of PPO. It uses a clipping mechanism to restrict the policy update ratio, preventing large changes that might destabilize learning. This approach is simple yet effective, yielding stable training results.
PPO-AdaptiveKL: This variant introduces a penalty on the KL divergence between the old and new policies. The KL threshold is adapted dynamically during training, allowing more flexibility when updates are safe and restricting them when needed.

PPO-Clip Algorithm

The PPO-Clip algorithm limits how much the policy can change by clipping the probability ratio between the new and old policies. This ensures smoother updates and stabilizes learning. For determine the policy parameter, we use this formula:

In this formula L is:

Reward Progression

Figure: PPO-Clip reward evolution during training episodes. The steady rise shows the model gradually improving policy performance while maintaining stability.

PPO-clip.mp4

PPO-AdaptiveKL Algorithm

In contrast, PPO-AdaptiveKL adjusts the KL divergence penalty dynamically. This allows larger updates when the divergence between the policies is small, and restricts updates when the divergence grows too large, making learning more flexible.

Reward Progression

Figure: PPO-AdaptiveKL reward progression. The adaptive nature of the KL divergence threshold results in more fluctuation but eventually converges to a stable reward trajectory.

PPO-adaptive.mp4

Results Summary

Both algorithms are effective in controlling the swimmer, but they exhibit different characteristics:

PPO-Clip maintains more stable and predictable updates, with smoother convergence.
PPO-AdaptiveKL introduces more flexibility by adapting to the policy's behavior, allowing for potentially faster learning, but at the cost of occasional instability.

For a more detailed discussion of the algorithms, mathematical details, and hyperparameters, please refer to the full report.

License

This project is under the MIT License, and I’d be thrilled if you use and improve my work!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Adaptive		Adaptive
Clip		Clip
LICENSE		LICENSE
PPO-Adaptive.py		PPO-Adaptive.py
PPO-Clip.py		PPO-Clip.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gymnasium Swimmer Environment

Overview

PPO-Clip Algorithm

Reward Progression

PPO-AdaptiveKL Algorithm

Reward Progression

Results Summary

License

About

Releases

Packages

Languages

License

navidadkhah/Swimmer-v4

Folders and files

Latest commit

History

Repository files navigation

Gymnasium Swimmer Environment

Overview

PPO-Clip Algorithm

Reward Progression

PPO-AdaptiveKL Algorithm

Reward Progression

Results Summary

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages