# Using Reinforcement Learning to achieve the Nash Equilibrium — A Python Implementation: Part 1

Let’s say that you’re at a casino, all dressed up in your fancy clothes hoping to make some big bucks. Your eyes shine and your heart races as you stand in front of a line of brightly coloured slot machines.

These slot machines have two settings: they either eat the money you feed it or return double its value to you. It’s been rumored that these slot machines have specific probabilities with which they will double your money, but no one knows what they are.

You — in your schemes to get rich — obviously want to figure out which slot machine has the highest chance of returning your money and then play that machine. With your newly acquired wealth, you’ll then drive off into the sunset in your new Lamborghini to your Californian beach house. You’ll flex your rolex watch and Gucci sneakers while partying with Drake on an exclusive yacht and you’ll import bottled air from the Arctic because why not? Simple.

Except… how will you figure out which slot machine has the highest chance of doubling your money?

In reinforcement learning, this problem is known as the **multi-armed bandit problem** — how should you dedicate your resources to several different options when you can never be certain about what will come of pulling each?

Heck, I don’t know, but a computer certainly does. Or can with the help of a **learning algorithm**.

This is the beginning of a four part series that will cover the python implementation of different learning algorithms which solve the multi-armed bandit problem. Each article will cover one of four learning algorithms — **epsilon greedy, UCB 1, linear reward penalty and linear reward penalty lagging anchor**.

The game which we will get be getting the computer agent to learn is based on Michihiro Kandori’s “Simple Card Game”. Two players (Red and Black) each have four cards (King, Ace, 2 and 3). Each player simultaneously plays a card, and depending on the combination, one player will win. The player who wins gets one point and the player who loses is deducted a point.

This part will specifically focus on the epsilon greedy algorithm.

## Exploration-Exploitation Tradeoff

Our multi-armed bandit problem can be described, in part, as the exploration the **exploration-exploitation** tradeoff.

Let’s say that you and your friends are trying to decide where to eat. In the past, you’ve always gone to a Mexican restaurant around the corner, and you’ve all really enjoyed it. However, this time, one of your friends mentions that a new Lebanese place has opened up down the street, and it’s supposed to be really good. Should you go to the Mexican restaurant, which you know to be really good, or should you try the Lebanese place which has the potential to be better or worse? Decisions?!

This is the exploration-exploitation tradeoff: at what point should you exploit options which you think to be the best rather than exploring options which have the potential to be better or worse (or vice-versa)?

The **Epsilon-Greedy Algorithm** is a learning algorithm which makes use of the exploration-exploitation tradeoff by instructing the computer to explore (i.e. choose a random option) with probability epsilon and exploit (i.e. choose the option which so far seems to be the best) the remainder of the time. Usually, epsilon is set to be around 10%.

In this implementation, we are only going to have Black use the learning algorithm to play the game — Red’s card will always be generated at random.

So, what strategy enables Black to get the highest expected reward?

In game theory, such a strategy is termed “playing the **Nash Equilibrium”**. It is the optimal outcome of a game from which no player can benefit by changing strategies unless his opponent does so as well.

In this specific game it can be derived using linear programing.

- Both players should play
**King with frequency 40% and Ace, 2 and 3 with frequency 20% each**. - It also predicts that the
**Black player should win 60%**of the time, and the**Red player should win 40%**of the time.

So, basically when the computer agent converge to playing such a strategy — when it has acheived the Nash Equilibrium — it have solved the multi-armed bandit problem!

Alright, here we go.

## Python Implementation of Epsilon Greedy

Firstly, we import the numpy and random libraries. We use the former to formulate the payoff table, and the latter to generate a random card when prompted to explore.

We now define a bunch of variables, lists and matrices.

- The variable “epsilon” which will be the rate at which the agent explores and set it to 0.1.
- The list “cards” contains the set of available cards (with 0 denoting king and 1 denoting Ace).
- The lists “black_rewards” and “red_rewards” contain the cumulative reward received from each card for each player.
- The lists “black_counts” and “red_counts” contain the number of times each card was played by each player.
- The lists “black_values” and “red_values” contain the average reward received from each card by each player.
- The numpy array is Black’s payoff table.

It is important to note that the nth index of each list corresponds to information for card n.

We define the “explore” function which randomly returns one of the cards, and the “black_exploit” function which returns Black’s greedy card (the card that currently has the highest value in the “values” list).

We define the “black_select_card” function which decides whether Black explores or exploits.

- It generates a random number between 0 and 1.
- If the number is greater than epsilon, it returns the exploit function.
- Otherwise, it returns the explore function.

The “red_exploit” and “red_select_card” serve similar purposes for Red.

The update function updates all parameters.

- It determines which player wins by consulting the payoff table
- Stores the result in “who_wins”, and adjusts the counts and rewards lists accordingly.
- It then updates each player’s values list by dividing their rewards list by their counts list to produce the average reward attained by each card.

The update function takes two arguments — “red_choice” and “black_choice”. They are defined later in the code.

We now define the “play_games” function. The function takes one argument — i — which denotes the number of rounds in the game. For each of the i iterations,

- It refers the players’ choices to the “black_select_card” and “red_select_card” functions
- Stores the results under “black_choice” and “red_choice”.
- It then executes the update function using the arguments “red_choice” and “black_choice”. They are used to access the played card’s respective index for each list and payoff table.

After all the iterations, it prints the rewards, counts and values lists.

And so finally, we can print i = 10,000 and we see that

A card distribution of about 38% King, 20% Ace, 21% 2 and 21% 3 — which is awfully close to the Nash equilibrium! So, there we have it: the epsilon greedy algorithm.

Stay tuned for Part 2 which will be focusing on the UCB 1 algorithm!