. Extra Credit Algorithm Walkthroughs - MDPs

Extra Credit Algorith Walkthroughs - MDPs



Bob, our venerable AI professor, decides to head over to Wendover for the weekend to play some Texas Holdem. He has been playing for a while and knows that most of the people from Salt Lake play like rocks - very conservatively. After some personal testing, he has come up with an MDP for a successful play style based on the number of chips he has.
(this is a matrix of the graph/edges, where the left most column represents the action that Bob can take and the subsequent probabilities of getting to given state. Every row adds up to 1!):

X Out Small chip stack Starting chip stack Lead chip stack Winner
Out 0 0 0 0 0
Small chip stack - conservative 2/3 0 1/3 0 0
Small chip stack - aggressive 1/2 0 1/2 0 0
Starting chip stack - conservative 0 2/3 0 1/3 0
Starting chip stack - aggressive 2/10 1/10 0 1/2 2/10
Lead chip stack - conservative 0 0 2/3 0 1/3
Lead chip stack - aggressive 0 0 1/2 0 1/2
Winner 0 0 0 0 0

With a graph that looks like this:

Here is how I calculated the first table:


First we initialize Small (and all tables) to be 0 at t=0.
Next, we set the value to its current reward (or the formal way is max(2/3*(0 + .5*0) + 1/3*(10 + .5*0), 1/2*(0 + .5*0) + 1/2*(10 + .5*0)) = 5).
For t=2, we start using the max function as used earlier.
The max function is used to determine which playing style is better, conservative or aggressive.
Thus, we want to select whichever one produces the maximum reward.
So the play style for conservative provides us with:
2/3 (probability of losing game)*(0(reward for losing game) + .5(future penalty)*5(previous reward at t-1)) + 1/3*(10 + .5*5) = 4.166
while the aggressive play-style gives us:
1/2*(0 + .5*5) + 1/2*(10 + .5*5) = 6.25
The max function will compare 4.166 to 6.25, and select the larger one (6.25).
This cycles for each state (out, small, starting stack, cheap lead and winner). with the probabilities for each.
Here is my final table.