PPO

class PPO(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[源代码]

基类:parl.core.fluid.algorithm.Algorithm

__init__(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[源代码]

PPO algorithm

参数
  • model (parl.Model) – model defining forward network of policy and value.

  • act_dim (float) – dimension of the action space.

  • policy_lr (float) – learning rate of the policy model.

  • value_lr (float) – learning rate of the value model.

  • epsilon (float) – epsilon used in the CLIP loss (default 0.2).

policy_learn(obs, actions, advantages, beta=None)[源代码]
Learn policy model with:
  1. CLIP loss: Clipped Surrogate Objective

  2. KLPEN loss: Adaptive KL Penalty Objective

See: https://arxiv.org/pdf/1707.02286.pdf

参数
  • obs – Tensor, (batch_size, obs_dim)

  • actions – Tensor, (batch_size, act_dim)

  • advantages – Tensor (batch_size, )

  • beta – Tensor (1) or None if None, use CLIP Loss; else, use KLPEN loss.

predict(obs)[源代码]

Use the policy model of self.model to predict means and logvars of actions

sample(obs)[源代码]

Use the policy model of self.model to sample actions

sync_old_policy()[源代码]

Synchronize weights of self.model.policy_model to self.old_policy_model

value_learn(obs, val)[源代码]

Learn the value model with square error cost

value_predict(obs)[源代码]

Use value model of self.model to predict value of obs