| A - Right | 0 | B - Right | 0 | C |   | D - Right | 0 | E - Right | 0 | F |   |
A |   | B - Left | 0 | C - Left | 0 | D |   | E - Left | 0 | F - Left | 0 |
A - Down | 0 | B - Down | 0 | C - Down | 0 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 0 | E - Up | 0 | F - Up | 0 |
The first iteration might look like:
A-Right = 0.0 + 0.5 * max[A-Right, A-Down] = 0.
If choose A-Right (at random, because there are no rewards yet), then agent is in state B.
Update A-Right and continue.
| A - Right | 0 | B - Right | 0 | C |   | D - Right | 0 | E - Right | 0 | F |   |
A |   | B - Left | 0 | C - Left | 0 | D |   | E - Left | 0 | F - Left | 0 |
A - Down | 0 | B - Down | 0 | C - Down | 0 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 0 | E - Up | 0 | F - Up | 0 |
B-Right = 0.0 + 0.5 * max[B-Right, B-Left, B-Down] = 0.
We choose Right again. Update B and continue.
| A - Right | 0 | B - Right | 0 | C |   | D - Right | 0 | E - Right | 0 | F |   |
A |   | B - Left | 0 | C - Left | 0 | D |   | E - Left | 0 | F - Left | 0 |
A - Down | 0 | B - Down | 0 | C - Down | 0 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 0 | E - Up | 0 | F - Up | 0 |
Now we are at state C. F is the reward state, and C reaps that reward.
C = 0.0 + 0.5 * max[C-Left, C-Down] = 0.0 + 0.5 * 100
| A - Right | 0 | B - Right | 0 | C |   | D - Right | 0 | E - Right | 0 | F |   |
A |   | B - Left | 0 | C - Left | 0 | D |   | E - Left | 0 | F - Left | 0 |
A - Down | 0 | B - Down | 0 | C - Down | 100 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 0 | E - Up | 0 | F - Up | 0 |
Now things get more interesting.
Suppose the agent starts again at A and runs the exact same sequence.
A-Right = 0.0 + 0.5 * max[A-Right, A-Down] = 0. Choose Right. Update A-Right and continue.
Now when B runs the max function, B-Right will have a weighted value to "favor" the B-Right path.
The legal actions for C are Left and Down.
Consulting the table, the value of C-Left is 0, but the value for C-Down is 100. C-Down is favored.
B-Right = 0.0 + 0.5 * max[C-Left, C-Down] = 0.0 + 0.5 * 100 = 50.
| A - Right | 0 | B - Right | 50 | C |   | D - Right | 0 | E - Right | 0 | F |   |
A |   | B - Left | 0 | C - Left | 0 | D |   | E - Left | 0 | F - Left | 0 |
A - Down | 0 | B - Down | 0 | C - Down | 100 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 0 | E - Up | 0 | F - Up | 0 |
Remember the agent chooses at random, so non-favored paths are taken; eventually all the weights are calculated.
This is especially true when the agent first starts exploring. Before long, the favored paths are known,
and the optimal path can be followed.
For this example, after just a few iterations, the table looks like:
| A - Right | 25 | B - Right | 50 | C |   | D - Right | 50 | E - Right | 100 | F |   |
A |   | B - Left | 12.5 | C - Left | 25 | D |   | E - Left | 25 | F - Left | 0 |
A - Down | 25 | B - Down | 25 | C - Down | 100 | D |   | E |   | F |   |
A |   | B |   | C |   | D - Up | 12.5 | E - Up | 25 | F - Up | 0 |
Here's the data as it would appear on the gameboard:
|