Key Papers in betway电竞Deep RL

What follows is a list of papers in betway电竞deep RL that are worth reading.This isfarfrom comprehensive,but should provide a useful starting point for someone looking to do research in the field.

1.Model-Free RL

a.Deep Q-Learning

[1] Playing Atari with Deep Reinforcement Learning,Mnih et al,2013.Algorithm: DQN.
[2] Deep Recurrent Q-Learning for 必威电竞Partially Observable MDPs,Hausknecht and Stone,2015.Algorithm: Deep Recurrent Q-Learning.
[3] Dueling Network Architectures for Deep Reinforcement Learning,Wang et al,2015.Algorithm: Dueling DQN.
[4] Deep Reinforcement Learning with Double Q-learning,Hasselt et al 2015.Algorithm: Double DQN.
[5] Prioritized Experience Replay,Schaul et al,2015.Algorithm: Prioritized Experience Replay (PER).
[6] Rainbow: Combining Improvements in Deep Reinforcement Learning,Hessel et al,2017.Algorithm: Rainbow DQN.

b.Policy Gradients

[7] Asynchronous Methods for Deep Reinforcement Learning,Mnih et al,2016.Algorithm: A3C.
[8] Trust Region Policy Optimization,Schulman et al,2015.Algorithm: TRPO.
[9] High-Dimensional Continuous Control Using Generalized Advantage Estimation,Schulman et al,2015.Algorithm: GAE.
[10] Proximal Policy Optimization Algorithms,Schulman et al,2017.Algorithm: PPO-Clip,PPO-Penalty.
[11] Emergence of Locomotion Behaviours in Rich Environments,Heess et al,2017.Algorithm: PPO-Penalty.
[12] Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation,Wu et al,2017.Algorithm: ACKTR.
[13] Sample Efficient Actor-Critic with Experience Replay,Wang et al,2016.Algorithm: ACER.
[14] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,Haarnoja et al,2018.Algorithm: SAC.

c.Deterministic Policy Gradients

[15] Deterministic Policy Gradient Algorithms,Silver et al,2014.Algorithm: DPG.
[16] Continuous Control With Deep Reinforcement Learning,Lillicrap et al,2015.Algorithm: DDPG.
[17] Addressing Function Approximation Error in Actor-Critic Methods,Fujimoto et al,2018.Algorithm: TD3.

d.Distributional RL

[18] A Distributional Perspective on Reinforcement Learning,Bellemare et al,2017.Algorithm: C51.
[19] Distributional Reinforcement Learning with Quantile Regression,Dabney et al,2017.Algorithm: QR-DQN.
[20] Implicit Quantile Networks for Distributional Reinforcement Learning,Dabney et al,2018.Algorithm: IQN.
[21] Dopamine: A Research Framework for Deep Reinforcement Learning,Anonymous,2018.Contribution:Introduces Dopamine,a code repository containing implementations of DQN,C51,IQN,and Rainbow.Code link.

e.Policy Gradients with Action-Dependent Baselines

[22] Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic,Gu et al,2016.Algorithm: Q-Prop.
[23] Action-depedent Control Variates for Policy Optimization via Stein's Identity,Liu et al,2017.Algorithm: Stein Control Variates.
[24] The Mirage of Action-Dependent Baselines in Reinforcement Learning,Tucker et al,2018.Contribution:interestingly,critiques and reevaluates claims from earlier papers (including Q-Prop and stein control variates) and finds important methodological errors in them.

f.Path-Consistency Learning

[25] Bridging the Gap Between Value and Policy Based Reinforcement Learning,Nachum et al,2017.Algorithm: PCL.
[26] Trust-PCL: An Off-Policy Trust Region Method for Continuous Control,Nachum et al,2017.Algorithm: Trust-PCL.

g.Other Directions for Combining Policy-Learning and Q-Learning

[27] Combining Policy Gradient and Q-learning,O'Donoghue et al,2016.Algorithm: PGQL.
[28] The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning,Gruslys et al,2017.Algorithm: Reactor.
[29] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning,Gu et al,2017.Algorithm: IPG.
[30] Equivalence Between Policy Gradients and Soft Q-Learning,Schulman et al,2017.Contribution:Reveals a theoretical link between these two families of RL algorithms.

h.Evolutionary Algorithms

[31] Evolution Strategies as a Scalable Alternative to Reinforcement Learning,Salimans et al,2017.Algorithm: ES.

2.Exploration

a.Intrinsic Motivation

[32] VIME: Variational Information Maximizing Exploration,Houthooft et al,2016.Algorithm: VIME.
[33] Unifying Count-Based Exploration and Intrinsic Motivation,Bellemare et al,2016.Algorithm: CTS-based Pseudocounts.
[34] Count-Based Exploration with Neural Density Models,Ostrovski et al,2017.Algorithm: PixelCNN-based Pseudocounts.
[35] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning,Tang et al,2016.Algorithm: Hash-based Counts.
[36] EX2: Exploration with Exemplar Models for Deep Reinforcement Learning,Fu et al,2017.Algorithm: EX2.
[37] Curiosity-driven Exploration by Self-supervised Prediction,Pathak et al,2017.Algorithm: Intrinsic Curiosity Module (ICM).
[38] Large-Scale Study of Curiosity-Driven Learning,Burda et al,2018.Contribution:Systematic analysis of how surprisal-based intrinsic motivation performs in a wide variety of environments.
[39] Exploration by Random Network Distillation,Burda et al,2018.Algorithm: RND.

b.Unsupervised RL

[40] Variational Intrinsic Control,Gregor et al,2016.Algorithm: VIC.
[41] Diversity is All You Need: Learning Skills without a Reward Function,Eysenbach et al,2018.Algorithm: DIAYN.
[42] Variational Option Discovery Algorithms,Achiam et al,2018.Algorithm: VALOR.

3.Transfer and Multitask RL

[43] Progressive Neural Networks,Rusu et al,2016.Algorithm: Progressive Networks.
[44] Universal Value Function Approximators,Schaul et al,2015.Algorithm: UVFA.
[45] Reinforcement Learning with Unsupervised Auxiliary Tasks,Jaderberg et al,2016.Algorithm: UNREAL.
[46] The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously,Cabi et al,2017.Algorithm: IU Agent.
[47] PathNet: Evolution Channels Gradient Descent in Super Neural Networks,Fernando et al,2017.Algorithm: PathNet.
[48] Mutual Alignment Transfer Learning,Wulfmeier et al,2017.Algorithm: MATL.
[49] Learning an Embedding Space for Transferable Robot Skills,Hausman et al,2018.
[50] Hindsight Experience Replay,Andrychowicz et al,2017.Algorithm: Hindsight Experience Replay (HER).

4.Hierarchy

[51] Strategic Attentive Writer for Learning Macro-Actions,Vezhnevets et al,2016.Algorithm: STRAW.
[52] FeUdal Networks for Hierarchical Reinforcement Learning,Vezhnevets et al,2017.Algorithm: Feudal Networks
[53] Data-Efficient Hierarchical Reinforcement Learning,Nachum et al,2018.Algorithm: HIRO.

5.Memory

[54] Model-Free Episodic Control,Blundell et al,2016.Algorithm: MFEC.
[55] Neural Episodic Control,Pritzel et al,2017.Algorithm: NEC.
[56] Neural Map: Structured Memory for Deep Reinforcement Learning,Parisotto and Salakhutdinov,2017.Algorithm: Neural Map.
[57] Unsupervised Predictive Memory in a Goal-Directed Agent,Wayne et al,2018.Algorithm: MERLIN.
[58] Relational Recurrent Neural Networks,Santoro et al,2018.Algorithm: RMC.

7.Meta-RL

[68] RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning,Duan et al,2016.Algorithm: RL^2.
[69] Learning to Reinforcement Learn,Wang et al,2016.
[70] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,Finn et al,2017.Algorithm: MAML.
[71] A Simple Neural Attentive Meta-Learner,Mishra et al,2018.Algorithm: SNAIL.

8.Scaling RL

[72] Accelerated Methods for Deep Reinforcement Learning,Stooke and Abbeel,2018.Contribution:Systematic analysis of parallelization in betway电竞deep RL across algorithms.
[73] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures,Espeholt et al,2018.Algorithm: IMPALA.
[74] Distributed Prioritized Experience Replay,Horgan et al,2018.Algorithm: Ape-X.
[75] Recurrent Experience Replay in Distributed Reinforcement Learning,Anonymous,2018.Algorithm: R2D2.
[76] RLlib: Abstractions for Distributed Reinforcement Learning,Liang et al,2017.Contribution:A scalable library of RL algorithm implementations.Documentation link.

10.Safety

[81] Concrete Problems in AI Safety,Amodei et al,2016.Contribution:establishes a taxonomy of safety problems,serving as an important jumping-off point for future research.We need to solve these!
[82] Deep Reinforcement Learning From Human Preferences,Christiano et al,2017.Algorithm: LFP.
[83] Constrained Policy Optimization,Achiam et al,2017.Algorithm: CPO.
[84] Safe Exploration in Continuous Action Spaces,Dalal et al,2018.Algorithm: DDPG+Safety Layer.
[85] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention,Saunders et al,2017.Algorithm: HIRL.
[86] Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning,Eysenbach et al,2017.Algorithm: Leave No Trace.

13.Bonus: Classic Papers in RL Theory or Review

[99] Policy Gradient Methods for Reinforcement Learning with Function Approximation,Sutton et al,2000.Contributions:Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
[100] An Analysis of Temporal-Difference Learning with Function Approximation,Tsitsiklis and Van Roy,1997.Contributions:Variety of convergence results and counter-examples for value-learning methods in RL.
[101] Reinforcement Learning of Motor Skills with Policy Gradients,Peters and Schaal,2008.Contributions:Thorough review of policy gradient methods at the time,many of which are still serviceable descriptions of betway电竞deep RL methods.
[102] Approximately Optimal Approximate Reinforcement Learning,Kakade and Langford,2002.Contributions:Early roots for monotonic improvement theory,later leading to theoretical justification for TRPO and other algorithms.
[103] A Natural Policy Gradient,Kakade,2002.Contributions:Brought natural gradients into RL,later leading to TRPO,ACKTR,and several other methods in betway电竞deep RL.
[104] Algorithms for Reinforcement Learning,Szepesvari,2009.Contributions:Unbeatable reference on RL before betway电竞deep RL,containing foundations and theoretical background.