## Background¶

(Previously:必威绝地大逃杀Introduction to RL,必威电竞Part 3)

The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return,and push down the probabilities of actions that lead to lower return,until you arrive at the optimal policy.

### Quick Facts¶

• VPG is an on-policy algorithm.
• VPG can be used for environments with either discrete or continuous action spaces.
• The Spinning Up implementation of VPG supports parallelization with MPI.

### Key Equations¶

Let denote a policy with parameters ,and denote the expected finite-horizon undiscounted return of the policy.The gradient of is where is a trajectory and is the advantage function for the current policy.

The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance: Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return,despite otherwise using the finite-horizon undiscounted policy gradient formula.

### Exploration vs.Exploitation¶

VPG trains a stochastic policy in an on-policy way.This means that it explores by sampling actions according to the latest version of its stochastic policy.The amount of randomness in action selection depends on both initial conditions and the training procedure.Over the course of training,the policy typically becomes progressively less random,as the update rule encourages it to exploit rewards that it has already found.This may cause the policy to get trapped in local optima.

### Pseudocode¶ ## Documentation¶

spinup. vpg ( env_fn, actor_critic= , ac_kwargs={}, seed=0, steps_per_epoch=4000, 时代= 50, gamma=0.99, pi_lr=0.0003, vf_lr=0.001, train_v_iters=80, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10 ) [source]
Parameters:
• env_fn– A function which creates a copy of the environment.The environment must satisfy the OpenAI Gym API.
• actor_critic

A function which takes in placeholder symbolsfor state,x_ph,and action,a_ph,and returns the mainoutputs from the agent's Tensorflow computation graph:

Symbol Shape Description
pi (batch,act_dim)
Samples actions from policy given
states.
logp (batch,)
Gives log probability,according to
the policy,of taking actions a_ph
in states x_ph.
logp_pi (batch,)
Gives log probability,according to
the policy,of the action sampled by
pi.
v (batch,)
Gives the value estimate for states
in x_ph.(Critical: make sure
to flatten this!)
• ac_kwargs(dict) – Any kwargs appropriate for the actor_criticfunction you provided to VPG.
• seed(int) – Seed for random number generators.
• steps_per_epoch(int) – Number of steps of interaction (state-action pairs)for the agent and the environment in each epoch.
• epochs(int) – Number of epochs of interaction (equivalent tonumber of policy updates) to perform.
• gamma(float) – Discount factor.(Always between 0 and 1.)
• pi_lr(float) – Learning rate for policy optimizer.
• vf_lr(float) – Learning rate for value function optimizer.
• train_v_iters(int) – Number of gradient descent steps to take onvalue function per epoch.
• lam(float) – Lambda for GAE-Lambda.(Always between 0 and 1,close to 1.)
• max_ep_len(int) – Maximum length of trajectory / episode / rollout.
• logger_kwargs(dict) – Keyword args for EpochLogger.
• save_freq(int) – How often (in terms of gap between epochs) to savethe current policy and value function.

### Saved Model Contents¶

The computation graph saved by the logger includes:

Key Value
x Tensorflow placeholder for state input.
pi Samples an action from the agent,conditioned on states inx.
v Gives value estimate for states inx.

This saved model can be accessed either by