Extra Credit Algorithm Walkthroughs - QLearning



So let's say that you are trapped in a house, and each room is represented by A, B, C, D, E and F. Your goal is to get to D, which is the exit to the outside (you can get outside by going through the garage, out the front, out the third story window etc... and of course staying outside in the warm Utah spring air is beneficial as well!). The matrix of the graph is listed below, along with how the graph looks like in picture form.

Reward matrix (going from the state on the left column to the state on the top right row produces the following reward, for example A -> E = 0 is the only allowable action that A can do. A->D has - because it is impossible.):
state\action A B C D E F
A - - - - 0 -
B - - - - 0 -
C - - - 100 - -
D - - 0 100 0 0
E 0 0 - 100 - -
F - - - 100 - -

graph:

Now that we know what our reward matrix is, we now need to define our Q matrix such that it knows nothing.
We will increment it slowly with using this expression: with as our learning parameter, which I will say is .5. Basically this makes it so we don't put too much reward on future stuff.

Okay, so let's get started. First, I will initialize my Q matrix to all 0.

Q matrix:
state\action A B C D E F
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0

And here is our reward matrix again for easy viewing Reward matrix:
state\action A B C D E F
A - - - - 0 -
B - - - - 0 -
C - - - 100 - -
D - - 0 100 0 0
E 0 0 - 100 - -
F - - - 100 - -

Suppose we start in row E for example, there are only three options. First is the edge EA, next is EB and finally there is ED.
Let's say that by random selection, we choose to go to D via the edge ED. From our function given above, we can now plug in the forumla for the values of :
Q(E, D) = R(E, D) + .5 * Max{Q(D, E), Q(D, F), Q(D, C), Q(D, D)} = 100 + .5 * 0 = 100
We now update our Q matrix to look like this: Q matrix:
state\action A B C D E F
A 0 0 0 0 0 0
B 0 0 0 0 0 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 100 0 0
F 0 0 0 0 0 0

For the next iteration, let us say that we pick BE. Our function would look like this:
Q(B, E) = R(B, E) + .5 * Max{Q(E, A), Q(E, D)} = 0 + .5*100 = 50
The reason our max function picks Q(E, A) and Q(E, D) to pick from is because those are all the possible choices that E can choose from. We want to pick whatever he sees as the best choice and use that to calculate our reward plus that one. We give it the discount to not give it the same weight as if the goal of being outside is one door away like it is for E.
Our Q matrix now looks like this Q matrix:
state\action A B C D E F
A 0 0 0 0 0 0
B 0 0 0 0 50 0
C 0 0 0 0 0 0
D 0 0 0 0 0 0
E 0 0 0 100 0 0
F 0 0 0 0 0 0

The values will eventually converge to give the highest value to those that are right by D, and less values to those that are one away (like A and B picking E). Of course because D can go to itself over and over it will grow bigger, but my table looks like this.
Q matrix:
state\action A B C D E F
A 0 0 0 0 50 0
B 0 0 0 0 50 0
C 0 0 0 100 0 0
D 0 0 0 100 0 0
E 0 0 0 100 0 0
F 0 0 0 100 0 0