Deep Deterministic Policy Gradient¶
Table of Contents
Background¶
(Previously:必威绝地大逃杀Introduction必威电竞 to RL Part 1: The Optimal QFunction and the Optimal Action)
Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Qfunction and a policy.It uses offpolicy data and the Bellman equation to learn the Qfunction,and uses the Qfunction to learn the policy.
This approach is closely connected to Qlearning,and is motivated the same way: if you know the optimal actionvalue function,then in any given state,the optimal actioncan be found by solving
DDPG interleaves learning an approximator towith learning an approximator to,and it does so in a way which is specifically adapted for environments with continuous action spaces.But what does it mean that DDPG is adaptedspecificallyfor environments with continuous action spaces?It relates to how we compute the max over actions in.
When there are a finite number of discrete actions,the max poses no problem,because we can just compute the Qvalues for each action separately and directly compare them.(This also immediately gives us the action which maximizes the Qvalue.) But when the action space is continuous,we can't exhaustively evaluate the space,and solving the optimization problem is highly nontrivial.Using a normal optimization algorithm would make calculatinga painfully expensive subroutine.And since it would need to be run every time the agent wants to take an action in the environment,this is unacceptable.
Because the action space is continuous,the functionis presumed to be differentiable with respect to the action argument.This allows us to set up an efficient,gradientbased learning rule for a policywhich exploits that fact.Then,instead of running an expensive optimization subroutine each time we wish to compute,we can approximate it with.See the Key Equations section details.
Quick Facts¶
 DDPG is an offpolicy algorithm.
 DDPG can only be used for environments with continuous action spaces.
 DDPG can be thought of as being deep Qlearning for continuous action spaces.
 The Spinning Up implementation of DDPG does not support parallelization.
Key Equations¶
Here,we'll explain the math behind the two 必威电竞parts of DDPG: learning a Q function,and learning a policy.
The QLearning Side of DDPG¶
First,let's recap the Bellman equation describing the optimal actionvalue function,.It's given by
whereis shorthand for saying that the next state,,is sampled by the environment from a distribution.
This Bellman equation is the starting point for learning an approximator to.Suppose the approximator is a neural network,与参数,and that we have collected a setof transitions(whereindicates whether stateis terminal).We can set up ameansquared Bellman error (MSBE)function,which tells us roughly how closelycomes to satisfying the Bellman equation:
Here,in evaluating,we've used a Python convention of evaluatingTrue
to 1 andFalse
to zero.Thus,whend==True
—which is to say,whenis a terminal state—the Qfunction should show that the agent gets no additional rewards after the current state.(This choice of notation corresponds to what we later implement in code.)
Qlearning algorithms for function approximators,such as DQN (and all its variants) and DDPG,are largely based on minimizing this MSBE loss function.There are two main tricks employed by all of them which are worth describing,and then a specific detail for DDPG.
Trick One: Replay Buffers.All standard algorithms for training a deep neural network to approximatemake use of an experience replay buffer.This is the setof previous experiences.In order for the algorithm to have stable behavior,the replay buffer should be large enough to contain a wide range of experiences,but it may not always be good to keep everything.If you only use the verymost recent data,you will overfit to that and things will break;if you use too much experience,you may slow down your learning.This may take some tuning to get right.
You Should Know
We've mentioned that DDPG is an offpolicy algorithm: this is as good a point as any to highlight why and how.Observe that the replay buffershouldcontain old experiences,even though they might have been obtained using an outdated policy.Why are we able to use these at all?The reason is that the Bellman equationdoesn't carewhich transition tuples are used,or how the actions were selected,or what happens after a given transition,because the optimal Qfunction should satisfy the Bellman equation forallpossible transitions.So any transitions that we've ever experienced are fair game when trying to fit a Qfunction approximator via MSBE minimization.
Trick Two: Target Networks.Qlearning algorithms make use oftarget networks.The term
is called thetarget,because when we minimize the MSBE loss,we are trying to make the Qfunction be more like this target.Problematically,the target depends on the same parameters we are trying to train:.This makes MSBE minimization unstable.The solution is to use a set of parameters which comes close to,but with a time delay—that is to say,a second network,called the target network,which lags the first.The parameters of the target network are denoted.
In DQNbased algorithms,the target network is just copied over from the main network every somefixednumber of steps.In DDPGstyle algorithms,the target network is updated once per main network update by polyak averaging:
whereis a hyperparameter between 0 and 1 (usually close to 1).(This hyperparameter is calledpolyak
in our code).
DDPG Detail: Calculating the Max Over Actions in the Target.As mentioned earlier: computing the maximum over actions in the target is a challenge in continuous action spaces.DDPG deals with this by using atarget policy networkto compute an action which approximately maximizes.The target policy network is found the same way as the target Qfunction: by polyak averaging the policy parameters over the course of training.
Putting it all together,Qlearning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:
whereis the target policy.
The Policy Learning Side of DDPG¶
Policy learning in DDPG is fairly simple.We want to learn a deterministic policywhich gives the action that maximizes.Because the action space is continuous,and we assume the Qfunction is differentiable with respect to action,we can just perform gradient ascent (with respect to policy parameters only) to solve
Note that the Qfunction parameters are treated as constants here.
Exploration vs.Exploitation¶
DDPG trains a deterministic policy in an offpolicy way.Because the policy is deterministic,if the agent were to explore onpolicy,in the beginning it would probably not try a wide enough variety of actions to find useful learning signals.To make DDPG policies explore better,we add noise to their actions at training time.The authors of the original DDPG paper recommended timecorrelatedOU noise,but more recent results suggest that uncorrelated,meanzero Gaussian noise works perfectly well.Since the latter is simpler,it is preferred.To facilitate getting higherquality training data,you may reduce the scale of the noise over the course of training.(We do not do this in our implementation,and keep noise scale fixed throughout.)
At test time,to see how well the policy exploits what it has learned,we do not add noise to the actions.
You Should Know
Our DDPG implementation uses a trick to improve exploration at the start of training.For a fixed number of steps at the beginning (set with thestart_steps
keyword argument),the agent takes actions which are sampled from a uniform random distribution over valid actions.After that,it returns to normal DDPG exploration.
Documentation¶

spinup.
ddpg
( env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, max_ep_len=1000, logger_kwargs={}, save_freq=1 ) [source] ¶ 
Parameters:  env_fn– A function which creates a copy of the environment.The environment must satisfy the OpenAI Gym API.
 actor_critic–
A function which takes in placeholder symbolsfor state,
x_ph
,and action,a_ph
,and returns the mainoutputs from the agent's Tensorflow computation graph:Symbol Shape Description pi
(batch,act_dim) Deterministically computes actionsfrom policy given states.q
(batch,) Gives the current estimate of Q* forstates inx_ph
and actions ina_ph
.q_pi
(batch,) Gives the composition ofq
andpi
for states inx_ph
:q(x,pi(x)).  ac_kwargs(dict) – Any kwargs appropriate for the actor_criticfunction you provided to DDPG.
 seed(int) – Seed for random number generators.
 steps_per_epoch(int) – Number of steps of interaction (stateaction pairs)for the agent and the environment in each epoch.
 epochs(int) – Number of epochs to run and train agent.
 replay_size(int) – Maximum length of replay buffer.
 gamma(float) – Discount factor.(Always between 0 and 1.)
 polyak(float) –
Interpolation factor in polyak averaging for targetnetworks.Target networks are updated towards main networksaccording to:
whereis polyak.(Always between 0 and 1,usuallyclose to 1.)
 pi_lr(float) – Learning rate for policy.
 q_lr(float) – Learning rate for Qnetworks.
 batch_size(int) – Minibatch size for SGD.
 start_steps(int) – Number of steps for uniformrandom action selection,before running real policy.Helps exploration.
 act_noise(float) – Stddev for Gaussian exploration noise added topolicy at training time.(At test time,no noise is added.)
 max_ep_len(int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs(dict) – Keyword args for EpochLogger.
 save_freq(int) – How often (in terms of gap between epochs) to savethe current policy and value function.
Saved Model Contents¶
The computation graph saved by the logger includes:
Key  Value 

x 
Tensorflow placeholder for state input. 
a 
Tensorflow placeholder for action input. 
pi 
Deterministically computes an action from the agent,conditioned
on states in
x .

q 
Gives actionvalue estimate for states inx and actions ina . 
This saved model can be accessed either by
 running the trained policy with thetest_policy.pytool,
 or loading the whole saved graph into a program withrestore_tf_graph.
References¶
Relevant Papers¶
 Deterministic Policy Gradient Algorithms,Silver et al.2014
 Continuous Control With Deep Reinforcement Learning,Lillicrap et al.2016
Why These Papers?¶
Silver 2014 is included because it establishes the theory underlying deterministic policy gradients (DPG).Lillicrap 2016 is included because it adapts the theoreticallygrounded DPG algorithm to the betway电竞deep RL setting,giving DDPG.