Getting Started¶
Goal of this tutorial:
Understand PARL’s abstraction at a high level
Train an agent to solve the Cartpole problem with Policy Gradient algorithm
This tutorial assumes that you have a basic familiarity of policy gradient.
Model¶
First, let’s build a Model
that predicts an action given the observation. As an objective-oriented programming framework, we build models on the top of parl.Model
and implement the forward
function.
Here, we construct a neural network with two fully connected layers.
import parl
from parl import layers
class CartpoleModel(parl.Model):
def __init__(self, act_dim):
act_dim = act_dim
hid1_size = act_dim * 10
self.fc1 = layers.fc(size=hid1_size, act='tanh')
self.fc2 = layers.fc(size=act_dim, act='softmax')
def forward(self, obs):
out = self.fc1(obs)
out = self.fc2(out)
return out
Algorithm¶
Algorithm
will update the parameters of the model passed to it. In general, we define the loss function in Algorithm
.
In this tutorial, we solve the benchmark Cartpole using the Policy Graident algorithm, which has been implemented in our repository.
Thus, we can simply use this algorithm by importting it from parl.algorithms
.
We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.
model = CartpoleModel(act_dim=2)
algorithm = parl.algorithms.PolicyGradient(model, lr=1e-3)
Note that each algorithm
should have two functions implemented:
learn
updates the model’s parameters given transition data
predict
predicts an action given current environmental state.
Agent¶
Now we pass the algorithm to an agent, which is used to interact with the environment to generate training data. Users should build their agents on the top of parl.Agent
and implement four functions:
build_program
define programs of fluid. In general, two programs are built here, one for prediction and the other for training.
learn
preprocess transition data and feed it into the training program.
predict
feed current environmental state into the prediction program and return an exectuive action.
sample
this function is usually used for exploration, fed with current state.
class CartpoleAgent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
self.obs_dim = obs_dim
self.act_dim = act_dim
super(CartpoleAgent, self).__init__(algorithm)
def build_program(self):
self.pred_program = fluid.Program()
self.train_program = fluid.Program()
with fluid.program_guard(self.pred_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.act_prob = self.alg.predict(obs)
with fluid.program_guard(self.train_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(name='act', shape=[1], dtype='int64')
reward = layers.data(name='reward', shape=[], dtype='float32')
self.cost = self.alg.learn(obs, act, reward)
def sample(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0)
act = np.random.choice(range(self.act_dim), p=act_prob)
return act
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0)
act = np.argmax(act_prob)
return act
def learn(self, obs, act, reward):
act = np.expand_dims(act, axis=-1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int64'),
'reward': reward.astype('float32')
}
cost = self.fluid_executor.run(
self.train_program, feed=feed, fetch_list=[self.cost])[0]
return cost
Start Training¶
First, let’s build an agent
. As the code shown below, we usually build a model, an algorithm and finally agent.
model = CartpoleModel(act_dim=2)
alg = parl.algorithms.PolicyGradient(model, lr=1e-3)
agent = CartpoleAgent(alg, obs_dim=OBS_DIM, act_dim=2)
Then we use this agent to interact with the environment, and run around 1000 episodes for training, after which this agent can solve the problem.
def run_episode(env, agent, train_or_test='train'):
obs_list, action_list, reward_list = [], [], []
obs = env.reset()
while True:
obs_list.append(obs)
if train_or_test == 'train':
action = agent.sample(obs)
else:
action = agent.predict(obs)
action_list.append(action)
obs, reward, done, info = env.step(action)
reward_list.append(reward)
if done:
break
return obs_list, action_list, reward_list
env = gym.make("CartPole-v0")
for i in range(1000):
obs_list, action_list, reward_list = run_episode(env, agent)
if i % 10 == 0:
logger.info("Episode {}, Reward Sum {}.".format(i, sum(reward_list)))
batch_obs = np.array(obs_list)
batch_action = np.array(action_list)
batch_reward = calc_discount_norm_reward(reward_list, GAMMA)
agent.learn(batch_obs, batch_action, batch_reward)
if (i + 1) % 100 == 0:
_, _, reward_list = run_episode(env, agent, train_or_test='test')
total_reward = np.sum(reward_list)
logger.info('Test reward: {}'.format(total_reward))
Summary¶
In this tutorial, we have shown how to build an agent step-by-step to solve the Cartpole problem.
The whole training code could be found here. Have a try quickly by running several commands:
# Install dependencies
pip install paddlepaddle
pip install gym
git clone https://github.com/PaddlePaddle/PARL.git
cd PARL
pip install .
# Train model
cd examples/QuickStart/
python train.py