Q-Learning

The Q-Learning equation:
Q(state, action)= R(state,action) + γ [Q(next state, all actions)]

Q-Learning is an action - value association.

Relevant Slides from class

Day 10 covers reinforcment learning.
Slides 18-23 are specific to the procedure below.

Artificial Intelligence (Russell/Norvig)

Equation 21.7, p776.

The algorithm is executed as follows:
For each action that could be taken from a state in the set of states, initialize to zero.
Observe your current state. Then, until convergence do:

Select an action (A) (Left, Right, Up, Down).
Collect an immediate reward (R) for action A.
Update the table entry for the new state.
Move to new state and observe.
How it works:
The agent can be thought of as an "explorer". From the starting state, the agent sets out randomly.
All Q-values are initialized to zero, except for the win and lose states.
Once the agent finds the "win" or "lose" states, then the state that lead to the reward state can be updated.
As the agent explores more and more states, more and more states will be updated.

Applying the algorithm -
Suppose we have a game grid with six states, we start in state A:
Some probability is assigned to the legal actions:
Left = .25 Right = .25 Up = .3 Down = .2. Let state F be the "win" state, with reward of 100, The other states will begin with reward of zero. Set γ = 0.5 for ease in calculation.

State
A B C
D E F
The initial configuration for the values table looks like:
A - Right 0 B - Right 0 C   D - Right 0 E - Right 0 F  
A   B - Left 0 C - Left 0 D   E - Left 0 F - Left 0
A - Down 0 B - Down 0 C - Down 0 D   E   F  
A   B   C   D - Up 0 E - Up 0 F - Up 0

The first iteration might look like:
A-Right = 0.0 + 0.5 * max[A-Right, A-Down] = 0.
If choose A-Right (at random, because there are no rewards yet), then agent is in state B.
Update A-Right and continue.

State
A B C
D E F

A - Right 0 B - Right 0 C   D - Right 0 E - Right 0 F  
A   B - Left 0 C - Left 0 D   E - Left 0 F - Left 0
A - Down 0 B - Down 0 C - Down 0 D   E   F  
A   B   C   D - Up 0 E - Up 0 F - Up 0

B-Right = 0.0 + 0.5 * max[B-Right, B-Left, B-Down] = 0.
We choose Right again. Update B and continue.

State
A B C
D E F

A - Right 0 B - Right 0 C   D - Right 0 E - Right 0 F  
A   B - Left 0 C - Left 0 D   E - Left 0 F - Left 0
A - Down 0 B - Down 0 C - Down 0 D   E   F  
A   B   C   D - Up 0 E - Up 0 F - Up 0

Now we are at state C. F is the reward state, and C reaps that reward.
C = 0.0 + 0.5 * max[C-Left, C-Down] = 0.0 + 0.5 * 100

State
A B C
D E F

A - Right 0 B - Right 0 C   D - Right 0 E - Right 0 F  
A   B - Left 0 C - Left 0 D   E - Left 0 F - Left 0
A - Down 0 B - Down 0 C - Down 100 D   E   F  
A   B   C   D - Up 0 E - Up 0 F - Up 0

Now things get more interesting.
Suppose the agent starts again at A and runs the exact same sequence.
A-Right = 0.0 + 0.5 * max[A-Right, A-Down] = 0. Choose Right. Update A-Right and continue.
Now when B runs the max function, B-Right will have a weighted value to "favor" the B-Right path.
The legal actions for C are Left and Down.
Consulting the table, the value of C-Left is 0, but the value for C-Down is 100. C-Down is favored. B-Right = 0.0 + 0.5 * max[C-Left, C-Down] = 0.0 + 0.5 * 100 = 50.

State
A B C
D E F

A - Right 0 B - Right 50 C   D - Right 0 E - Right 0 F  
A   B - Left 0 C - Left 0 D   E - Left 0 F - Left 0
A - Down 0 B - Down 0 C - Down 100 D   E   F  
A   B   C   D - Up 0 E - Up 0 F - Up 0

Remember the agent chooses at random, so non-favored paths are taken; eventually all the weights are calculated.
This is especially true when the agent first starts exploring. Before long, the favored paths are known,
and the optimal path can be followed.
For this example, after just a few iterations, the table looks like:

A - Right 25 B - Right 50 C   D - Right 50 E - Right 100 F  
A   B - Left 12.5 C - Left 25 D   E - Left 25 F - Left 0
A - Down 25 B - Down 25 C - Down 100 D   E   F  
A   B   C   D - Up 12.5 E - Up 25 F - Up 0

Here's the data as it would appear on the gameboard: