PPO (Proximal Policy Optimization)
A stable reinforcement learning algorithm used in the RL stage.
A stable reinforcement learning algorithm used in the RL stage. Updates policy using clipped gradients to prevent overshooting, avoiding training instability. PPO is simpler and more robust than earlier policy gradient methods like A3C or TRPO.