PPO¶
- class PPO(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[源代码]¶
基类:
parl.core.fluid.algorithm.Algorithm
- __init__(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[源代码]¶
PPO algorithm
- 参数
model (parl.Model) – model defining forward network of policy and value.
act_dim (float) – dimension of the action space.
policy_lr (float) – learning rate of the policy model.
value_lr (float) – learning rate of the value model.
epsilon (float) – epsilon used in the CLIP loss (default 0.2).
- policy_learn(obs, actions, advantages, beta=None)[源代码]¶
- Learn policy model with:
CLIP loss: Clipped Surrogate Objective
KLPEN loss: Adaptive KL Penalty Objective
- 参数
obs – Tensor, (batch_size, obs_dim)
actions – Tensor, (batch_size, act_dim)
advantages – Tensor (batch_size, )
beta – Tensor (1) or None if None, use CLIP Loss; else, use KLPEN loss.