Classic and Modern Reinforcement Learning

At the Deep Reinforcement Learning Symposium at NIPS this year I had the pleasure of shaking hands with Dimitri Bertsekas, whose work has been foundational to the mathematical theory of reinforcement learning. I still turn to Neuro-Dynamic Programming (Bertsekas and Tsitsiklis, 1996) when searching for tools to explain sample-based algorithms such as TD. Earlier this summer I read parts of Dynamic Programming and Optimal Control: Volume 2, which is full of gems that could grow into a full-blown NIPS or ICML paper (try it yourself). You can imagine my excitement at finally meeting the legend. So when he candidly asked me, “What’s deep reinforcement learning, anyway?” it was clear that “Q-learning with neural networks” wasn’t going to cut it.

There is an answer I do like to give. The novelty in deep reinforcement learning isn’t the combination of deep networks with RL methods; one of the oft-cited early successes of RL, TD-Gammon, was all about the MLP. Nor is it really in its chief algorithms: experience replay is 25 years old. Instead, it seems to me that the distinctive characteristic of deep reinforcement learning is its emphasis on perceptually complex, messy environments such as the Arcade Learning Environment, richly textured 3D mazes, vision-based robotics. As Rich Sutton would put it, deep reinforcement learning is not a solution, but rather a collection of problems.

What’s quite surprising is the relatively little attention that the field has given to the classic reinforcement learning literature — ranging from the method of temporal differences being designed as a purely predictive algorithm, to the wealth of results on exploration in Markov Decision Processes produced in the mid-2000s. This is surprising because many of the challenges posed by the environments characteristic of deep RL have strong roots in early reinforcement learning research. As the saying goes, those who cannot remember the past must eventually repeat it, and I think the community will benefit from a modern exposure to that literature.

The aim of this blog is then really twofold: first, to take a journey through some of the less appreciated results in reinforcement learning and decision making, and present them in a more accessible way. Second, to provide, when the occasion arises, simple proofs or illustrations of notions from RL folklore — filling in the gaps, if you will, left by past authors.

There are, of course, many people better suited to this task than I. Don’t expect, then, material which has been better explained elsewhere. In particular, a new edition of Sutton and Barto’s seminal book is due to be released soon, and I encourage you to consult it.

References

Reinforcement Learning: An Introduction. Sutton and Barto, 2nd edition (draft available online).
Dynamic Programming and Optimal Control: Volume 2. Bertsekas (2012).
Neuro-Dynamic Programming. Bertsekas and Tsitsiklis (1996).
Algorithms for Reinforcement Learning. Szepesvári (2010).