Skip to content

PPO approche to the Swimmer environment of gymnasium.

License

Notifications You must be signed in to change notification settings

navidadkhah/Swimmer-v4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gymnasium Swimmer Environment

This project showcases the implementation of two versions of Proximal Policy Optimization (PPO) algorithms — PPO-AdaptiveKL and PPO-Clip — applied to the Swimmer environment in the Gymnasium framework. These implementations help to explore the balance between policy update stability and efficiency in continuous control tasks.

Overview

  • PPO-Clip: This is the more commonly used version of PPO. It uses a clipping mechanism to restrict the policy update ratio, preventing large changes that might destabilize learning. This approach is simple yet effective, yielding stable training results.

  • PPO-AdaptiveKL: This variant introduces a penalty on the KL divergence between the old and new policies. The KL threshold is adapted dynamically during training, allowing more flexibility when updates are safe and restricting them when needed.

PPO-Clip Algorithm

The PPO-Clip algorithm limits how much the policy can change by clipping the probability ratio between the new and old policies. This ensures smoother updates and stabilizes learning. For determine the policy parameter, we use this formula:

Policy Formula

In this formula L is:

About the L

Reward Progression

PPO-Clip Reward

Figure: PPO-Clip reward evolution during training episodes. The steady rise shows the model gradually improving policy performance while maintaining stability.

PPO-clip.mp4

PPO-AdaptiveKL Algorithm

In contrast, PPO-AdaptiveKL adjusts the KL divergence penalty dynamically. This allows larger updates when the divergence between the policies is small, and restricts updates when the divergence grows too large, making learning more flexible.

Reward Progression

PPO-Adaptive Reward

Figure: PPO-AdaptiveKL reward progression. The adaptive nature of the KL divergence threshold results in more fluctuation but eventually converges to a stable reward trajectory.

PPO-adaptive.mp4

Results Summary

Both algorithms are effective in controlling the swimmer, but they exhibit different characteristics:

  • PPO-Clip maintains more stable and predictable updates, with smoother convergence.
  • PPO-AdaptiveKL introduces more flexibility by adapting to the policy's behavior, allowing for potentially faster learning, but at the cost of occasional instability.

For a more detailed discussion of the algorithms, mathematical details, and hyperparameters, please refer to the full report.

License

This project is under the MIT License, and I’d be thrilled if you use and improve my work!

Releases

No releases published

Packages

No packages published

Languages