Extra Credit Algorithm Walkthroughs - QLearning
So let's say that you are trapped in a house, and each room is
represented by A, B, C, D, E and F. Your goal is to get to D, which is
the exit to the outside (you can get outside by going through the
garage, out the front, out the third story window etc... and of course
staying outside in the warm Utah spring air is beneficial as well!). The matrix of the graph is listed
below, along with how the graph looks like in picture form.
Reward matrix (going from the state on the left column to the state on the top right row produces the following reward, for example A -> E = 0 is the only allowable action that A can do. A->D has - because it is impossible.):
| state\action |
A |
B |
C |
D |
E |
F |
| A |
- |
- |
- |
- |
0 |
- |
| B |
- |
- |
- |
- |
0 |
- |
| C |
- |
- |
- |
100 |
- |
- |
| D |
- |
- |
0 |
100 |
0 |
0 |
| E |
0 |
0 |
- |
100 |
- |
- |
| F |
- |
- |
- |
100 |
- |
- |
graph:
Now that we know what our reward matrix is, we now need to define our
Q matrix such that it knows nothing.
We will increment it slowly with using this expression:
with
as our learning parameter, which I
will say is .5. Basically
this makes it so we don't put too much reward on future stuff.
Okay, so let's get started.
First, I will initialize my Q matrix to all 0.
Q matrix:
| state\action |
A |
B |
C |
D |
E |
F |
| A |
0 |
0 |
0 |
0 |
0 |
0 |
| B |
0 |
0 |
0 |
0 |
0 |
0 |
| C |
0 |
0 |
0 |
0 |
0 |
0 |
| D |
0 |
0 |
0 |
0 |
0 |
0 |
| E |
0 |
0 |
0 |
0 |
0 |
0 |
| F |
0 |
0 |
0 |
0 |
0 |
0 |
And here is our reward matrix again for easy viewing
Reward matrix:
| state\action |
A |
B |
C |
D |
E |
F |
| A |
- |
- |
- |
- |
0 |
- |
| B |
- |
- |
- |
- |
0 |
- |
| C |
- |
- |
- |
100 |
- |
- |
| D |
- |
- |
0 |
100 |
0 |
0 |
| E |
0 |
0 |
- |
100 |
- |
- |
| F |
- |
- |
- |
100 |
- |
- |
Suppose we start in row E for example, there are only three
options. First is the edge EA, next is EB and finally there is
ED.
Let's say that by random selection, we choose to go to D via the edge
ED. From our function given above, we can now plug in the forumla for
the values of :
Q(E, D) = R(E, D) + .5 * Max{Q(D, E), Q(D, F), Q(D, C), Q(D, D)} = 100 + .5 * 0
= 100
We now update our Q matrix to look like this:
Q matrix:
| state\action |
A |
B |
C |
D |
E |
F |
| A |
0 |
0 |
0 |
0 |
0 |
0 |
| B |
0 |
0 |
0 |
0 |
0 |
0 |
| C |
0 |
0 |
0 |
0 |
0 |
0 |
| D |
0 |
0 |
0 |
0 |
0 |
0 |
| E |
0 |
0 |
0 |
100 |
0 |
0 |
| F |
0 |
0 |
0 |
0 |
0 |
0 |
For the next iteration, let us say that we pick BE. Our function would
look like this:
Q(B, E) = R(B, E) + .5 * Max{Q(E, A), Q(E, D)} = 0 + .5*100 = 50
The reason our max function picks Q(E, A) and Q(E, D) to pick from is
because those are all the possible choices that E can choose from. We
want to pick whatever he sees as the best choice and use that to
calculate our reward plus that one. We give it the discount
to not give it the same weight as if the goal of
being outside is one door away like it is for E.
Our Q matrix now looks like this
Q matrix:
| state\action |
A |
B |
C |
D |
E |
F |
| A |
0 |
0 |
0 |
0 |
0 |
0 |
| B |
0 |
0 |
0 |
0 |
50 |
0 |
| C |
0 |
0 |
0 |
0 |
0 |
0 |
| D |
0 |
0 |
0 |
0 |
0 |
0 |
| E |
0 |
0 |
0 |
100 |
0 |
0 |
| F |
0 |
0 |
0 |
0 |
0 |
0 |
The values will eventually converge to give the highest value to those
that are right by D, and less values to those that are one away (like
A and B picking E). Of course because D can go to itself over and over it will grow bigger, but my table looks like this.
Q matrix:
| state\action |
A |
B |
C |
D |
E |
F |
| A |
0 |
0 |
0 |
0 |
50 |
0 |
| B |
0 |
0 |
0 |
0 |
50 |
0 |
| C |
0 |
0 |
0 |
100 |
0 |
0 |
| D |
0 |
0 |
0 |
100 |
0 |
0 |
| E |
0 |
0 |
0 |
100 |
0 |
0 |
| F |
0 |
0 |
0 |
100 |
0 |
0 |