Reinforcement Learning: An Introduction

5 min readOct 12, 2019

Typically, if we want a computer to play a game, we used something known as brute force computation. This is a conventional technique in which the computer is made to consider all possible outcomes of a situation, identify the optimal outcome, and then perform a series of actions that will get it there.

A prime example of this is the collective of chess games between world champion Garry Kasparov and the IBM supercomputer Deep Blue. Deep Blue played the chess game by recursively applying brute force computation: determining all possible outcomes, finding one in which it will win and playing the moves that would get it there.

Now this seems to be all fine and well — Deep Blue was actually able to win a game by doing this. But here’s the thing, with complex problems this method is totally infeasible, and in some cases even impossible! For instance, with respect to the board game Go, the number of possible configurations of the board is greater than the number of atoms in the observable universe. So, even if each atom in the observable universe was able to store one such arrangement, we would still not have enough space! In this situation, brute force computation is physically impossible!

Because of this, we turn to reinforcement learning. RL is a subtype of Artificial Intelligence that is being developed to push past these restrictions of space and time encountered by brute force computation. It’s goal is to get the computer to learn in an inherently different way. Instead of relying on the ability to output all possible outcomes, the computer will learn more like a human.

Now why does this matter?! The implications of reinforcement learning can be summarized in the fact that it will enable us to solve really complex problems much faster than humans, or even normal computers can. At this point, RL seems really impractical — all it’s being applied to is arbitrary card games. However, as we begin to understand the technology more, we can apply it to practical scenarios including transportation, education, healthcare, finance, art generation and a host of other domains.

What is Reinforcement Learning

Consider a kid learning how to ride a bike. The first time the kid gets on the bike, she doesn’t really know how to balance it, so she probably tips over and gets a couple bruises. The next time she gets on, she is able to pedal a couple meters but then perhaps she hits a bump in the road and falls over. The next time, she rides into the curb, the next time, she goes too slowly… and on and on until eventually, she gets it! She learns how to ride through trial and error — by figuring out what works and what doesn’t.

Reinforcement learning is exactly this — trial and error. It is based on the ideas that computers can learn and improve from experience rather than being explicitly instructed. This is analogous to the kid who learns how to ride a bike by getting on her bike and practising rather than having someone who already knows how to ride a bike give her a comprehensive introduction on how, exactly, she should ride her bike.

Terminology

Reinforcement: Decisions evoke payoffs — good decisions are positively rewarded and bad decisions are negatively rewarded
Agent: The agent takes actions.
Action (A): A single possible move that the agent can make.
State (S): A singular arrangement of the environment.
Environment: The world through which the agent moves. The environment takes the agent’s current state and action as input, and returns as output the agent’s reward and its next state.

The Feedback Loop

An agent that is using reinforcement learning goes through a feedback loop every time it makes an attempt at performing a task. The feedback loop consists of four stages:

The environment exists in a state.
The agent chooses and performs an action.
The environment changes to another state.
A reward is returned to the agent based on its performance.

Learning algorithms

Learning algorithms are mathematical tools implemented by the programmer which allow the agent to effectively conduct trial and error when performing a task. Learning algorithms interpret the rewards and punishments returned to the agent from the environment and use the feedback to improve the agent’s choices for the future. The learning algorithm that a programmer implements is specific to the characteristics of the task the agent is trying to learn (the number of players involved in the game, the number of actions they have, what performing the task successfully looks like…etc).

Google DeepMind Challenge Match

The Google DeepMind Challenge Match was a collective of Go matches played in March 2016 between world champion Lee Sedol and AlphaGo. AlphaGo is a Google DeepMind’s computer program that was trained to play Go using reinforcement learning. In the series of five games, AlphaGo was able to win all but the fourth match!

This illustrates the power that reinforcement learning has. When it is applied to practical problems, its potential will be immense.