Trust Region Policy Optimization


(Previously:Background for VPG)

TRPO updates policies by taking the largest step possible to improve performance,while satisfying a special constraint on how close the new and old policies are allowed to be.The constraint is expressed in terms ofKL-Divergence,a measure of (something like,but not exactly) distance between probability distributions.

This is different from normal policy gradient,which keeps new and old policies close in parameter space.But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance.This makes it dangerous to use large step sizes with vanilla policy gradients,thus hurting its sample efficiency.TRPO nicely avoids this kind of collapse,and tends to quickly and monotonically improve performance.

Quick Facts

  • TRPO is an on-policy algorithm.
  • TRPO can be used for environments with either discrete or continuous action spaces.
  • The Spinning Up implementation of TRPO supports parallelization with MPI.

Key Equations

Let\pi_{\theta}denote a policy with parameters\theta.The theoretical TRPO update is:

\theta_{k+1} = \arg \max_{\theta} \;& {\mathcal L}(\theta_k,\theta) \\\text{s.t.} \;& \bar{D}_{KL}(\theta || \theta_k) \leq \delta

where{\mathcal L}(\theta_k,\theta)is thesurrogate advantage,a measure of how policy\pi_{\theta}performs relative to the old policy\pi_{\theta_k}using data from the old policy:

{\mathcal L}(\theta_k,\theta) = \underE{s,a \sim \pi_{\theta_k}}{    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)    },

and\bar{D}_{KL}(\theta || \theta_k)is an average KL-divergence between policies across states visited by the old policy:

\bar{D}_{KL}(\theta || \theta_k) = \underE{s \sim \pi_{\theta_k}}{    D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right)}.

You Should Know

The objective and constraint are both zero when\theta = \theta_k.Furthermore,the gradient of the constraint with respect to\thetais zero when\theta = \theta_k.Proving these facts requires some subtle command of the relevant math—it's an exercise worth doing,whenever you feel ready!

The theoretical TRPO update isn't the easiest to work with,so TRPO makes some approximations to get an answer quickly.We Taylor expand the objective and constraint to leading order around\theta_k:

{\mathcal L}(\theta_k,\theta) &\approx g^T (\theta - \theta_k) \\\bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k)

resulting in an approximate optimization problem,

\theta_{k+1} = \arg \max_{\theta} \;& g^T (\theta - \theta_k) \\\text{s.t.} \;& \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta.

You Should Know

By happy coincidence,the gradientgof the surrogate advantage function with respect to\theta,evaluated at\theta = \theta_k,is exactly equal to the policy gradient,\nabla_{\theta} J(\pi_{\theta})!Try proving this,if you feel comfortable diving into the math.

This approximate problem can be analytically solved by the methods of Lagrangian duality[1],yielding the solution:

\theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.

If we were to stop here,and just use this final result,the algorithm would be exactly calculating theNatural Policy Gradient.A problem is that,due to the approximation errors introduced by the Taylor expansion,this may not satisfy the KL constraint,or actually improve the surrogate advantage.TRPO adds a modification to this update rule: a backtracking line search,

\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,

where\alpha \in (0,1)is the backtracking coefficient,andjis the smallest nonnegative integer such that\pi_{\theta_{k+1}}satisfies the KL constraint and produces a positive surrogate advantage.

Lastly: computing and storing the matrix inverse,H^{-1},is painfully expensive when dealing with neural network policies with thousands or millions of parameters.TRPO sidesteps the issue by using theconjugate gradientalgorithm to solveHx = gforx = H^{-1} g,requiring only a function which can compute the matrix-vector productHxinstead of computing and storing the whole matrixHdirectly.This is not too hard to do: we set up a symbolic operation to calculate

Hx = \nabla_{\theta} \left( \left(\nabla_{\theta} \bar{D}_{KL}(\theta || \theta_k)\right)^T x \right),

which gives us the correct output without computing the whole matrix.

[1] SeeConvex Optimizationby Boyd and Vandenberghe,especially chapters 2 through 5.

Exploration vs.Exploitation

TRPO trains a stochastic policy in an on-policy way.This means that it explores by sampling actions according to the latest version of its stochastic policy.The amount of randomness in action selection depends on both initial conditions and the training procedure.Over the course of training,the policy typically becomes progressively less random,as the update rule encourages it to exploit rewards that it has already found.This may cause the policy to get trapped in local optima.


spinup. trpo ( env_fn, actor_critic= , ac_kwargs={}, seed=0, steps_per_epoch=4000, 时代= 50, gamma=0.99, delta=0.01, vf_lr=0.001, train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10, backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10, algo='trpo' ) [source]
  • env_fn– A function which creates a copy of the environment.The environment must satisfy the OpenAI Gym API.
  • actor_critic

    A function which takes in placeholder symbolsfor state,x_ph,and action,a_ph,and returns the mainoutputs from the agent's Tensorflow computation graph:

    Symbol Shape Description
    pi (batch,act_dim)
    Samples actions from policy given
    logp (batch,)
    Gives log probability,according to
    the policy,of taking actions a_ph
    in states x_ph.
    logp_pi (batch,)
    Gives log probability,according to
    the policy,of the action sampled by
    info N/A
    A dict of any intermediate quantities
    (from calculating the policy or log
    probabilities) which are needed for
    analytically computing KL divergence.
    (eg sufficient statistics of the
    info_phs N/A
    A dict of placeholders for old values
    of the entries in info.
    d_kl ()
    A symbol for computing the mean KL
    divergence between the current policy
    ( pi) and the old policy (as
    specified by the inputs to
    info_phs) over the batch of
    states given in x_ph.
    v (batch,)
    Gives the value estimate for states
    in x_ph.(Critical: make sure
    to flatten this!)
  • ac_kwargs(dict) – Any kwargs appropriate for the actor_criticfunction you provided to TRPO.
  • seed(int) – Seed for random number generators.
  • steps_per_epoch(int) – Number of steps of interaction (state-action pairs)for the agent and the environment in each epoch.
  • epochs(int) – Number of epochs of interaction (equivalent tonumber of policy updates) to perform.
  • gamma(float)- - -折扣因素。(Always between 0 and 1.)
  • delta(float) – KL-divergence limit for TRPO / NPG update.(Should be small for stability.Values like 0.01,0.05.)
  • vf_lr(float) – Learning rate for value function optimizer.
  • train_v_iters(int) – Number of gradient descent steps to take onvalue function per epoch.
  • damping_coeff(float) –

    Artifact for numerical stability,should besmallish.Adjusts Hessian-vector product calculation:

    Hv \rightarrow (\alpha I + H)v

    where\alphais the damping coefficient.Probably don't play with this hyperparameter.

  • cg_iters(int) –

    Number of iterations of conjugate gradient to perform.Increasing this will lead to a more accurate approximationtoH^{-1} g,and possibly slightly-improved performance,but at the cost of slowing things down.

    Also probably don't play with this hyperparameter.

  • backtrack_iters(int) – Maximum number of steps allowed in thebacktracking line search.Since the line search usually doesn'tbacktrack,and usually only steps back once when it does,thishyperparameter doesn't often matter.
  • backtrack_coeff(float) – How far back to step during backtracking linesearch.(总是在0和1之间,usually above 0.5.)
  • lam(float) – Lambda for GAE-Lambda.(Always between 0 and 1,close to 1.)
  • max_ep_len(int) – Maximum length of trajectory / episode / rollout.
  • logger_kwargs(dict) – Keyword args for EpochLogger.
  • save_freq(int) – How often (in terms of gap between epochs) to savethe current policy and value function.
  • algo– Either ‘trpo' or ‘npg': this code supports both,since they arealmost the same.

Saved Model Contents

The computation graph saved by the logger includes:

Key Value
x Tensorflow placeholder for state input.
pi Samples an action from the agent,conditioned on states inx.
v Gives value estimate for states inx.

This saved model can be accessed either by


Why These Papers?

Schulman 2015 is included because it is the original paper describing TRPO.Schulman 2016 is included because our implementation of TRPO makes use of Generalized Advantage Estimation for computing the policy gradient.Kakade and Langford 2002 is included because it contains theoretical results which motivate and deeply connect to the theoretical foundations of TRPO.