PPO

class PPO(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[source]

Bases: parl.core.fluid.algorithm.Algorithm

__init__(model, act_dim=None, policy_lr=None, value_lr=None, epsilon=0.2)[source]

PPO algorithm

Parameters
  • model (parl.Model) – model defining forward network of policy and value.

  • act_dim (float) – dimension of the action space.

  • policy_lr (float) – learning rate of the policy model.

  • value_lr (float) – learning rate of the value model.

  • epsilon (float) – epsilon used in the CLIP loss (default 0.2).

policy_learn(obs, actions, advantages, beta=None)[source]
Learn policy model with:
  1. CLIP loss: Clipped Surrogate Objective

  2. KLPEN loss: Adaptive KL Penalty Objective

See: https://arxiv.org/pdf/1707.02286.pdf

Parameters
  • obs – Tensor, (batch_size, obs_dim)

  • actions – Tensor, (batch_size, act_dim)

  • advantages – Tensor (batch_size, )

  • beta – Tensor (1) or None if None, use CLIP Loss; else, use KLPEN loss.

predict(obs)[source]

Use the policy model of self.model to predict means and logvars of actions

sample(obs)[source]

Use the policy model of self.model to sample actions

sync_old_policy()[source]

Synchronize weights of self.model.policy_model to self.old_policy_model

value_learn(obs, val)[source]

Learn the value model with square error cost

value_predict(obs)[source]

Use value model of self.model to predict value of obs