If a system can be modeled as a Markov chain, there are simple algorithms for predicting states of the system based on either known states at another time, or given some clues about the states. A Markov chain is a specific type of Bayes’ net. For both, inference determines probabilities of a certain outcome based on known elements in other parts of the graph. The Markov chain, however, can be unlimited in length and has restraints on its structure that simplify the inference calculations. A Markov chain can be unlimited in length because they usually refer to changes of a state over time and the time period may be ‘infinite’. The limitation is what state dependencies are allowed at any given time. A set of states at one set in time can only be linked to states at a limited number of previous time points. In a First-Order Markov Chain the current states can only be linked to the states in the previous time. In the terms of a Bayes’ net, states at one time point, for a First-Order Markov Chain, can only be linked to its parent nodes and not to any of its grandparents.
Many real systems can be modeled as Markov chains despite this restraint on dependencies. Most physics equations, if discretized, can be modeled as a Markov chain. For example if you know the speed and location of a ball in the air at any time, you can predict where it will be at the next instant, no matter how it got there. As another example, knowing the physical properties of water, if you know a cup of water's temperature and the air temperature you can calculate how much it will have heated up or cooled down by the next time step. Although Markov chains are better for discrete systems, these examples show the basic idea. (Kalman filters are the Markov chain equivalent for continuous space systems).
In a Markov Chain there are two parts:
Transitions are how we get from state 1 at time t to state 2 at time t+1. To do this we need the transition probabilities.
By the definition of a first order Markov chain the probabilities of a state at time t+1 only depends on the probabilities of that state at time t:
P(X(t+1)|X(0:t))=P(X(t+1)|X(t))
The transition probabilities are either given to us or can be determined by the ‘physics’ of the system. When we have P(X(t+1)|X(t)) we can then find P(X(t+1)) using probability:
P(X(t+1))=P(X(t))*P(X(t+1)|X(t))
Example: A system can be in either of two states A or B, and at time t we know it is 30% likely to be in A and 70% likely to be in B. The transition probabilities are given that A will stay in A 40% of the time and B will stay in B 20%. What is the probability of being in state A or B at time t+1?
Solution: P(X(t))=[.3,.7]. For transition probabilities we can draw the state transitions in a lattice structure as follows:
Since A stays in A 40% the line from P(A,t) to P(A,t+1) is weighted 40%. In math terms P(A(t+1)|P(A(t))=.4. So P(B(t+1)|P(A(t))=.6 because they must sum to one. The same thing is true for B staying in B 20%: P(B(t+1)|B(t))=.2 and P(A(t+1)|B(t))=.8. To find P(A(t+1)) and P(B(t+1)) we use equation 2 above. In equation 2 the X stands for all the states, and from the lattice above we can see that some of the probability of being in A at time t+1 comes from being in state A and some from being in state B at time t.
P(A(t+1)) = P(A(t))*P(A(t+1)|A(t)) + P(B(t))*P(A(t+1)|B(t))We can restate the above equations in the terms of matrices as follows:
![P(X(t+1)=[0.4,0.8;0.6,0.2]](eqp1.bmp)
![*P(X(t))=[0.4,0.8;0.6,0.2]*](PxteqT.bmp)
![[0.3;0.7]=[0.68;0.32]](PteqPtp1.bmp)
A "stationary distribution" is a set of state probabilities that transition to the same probabilities. Such a stationary distribution can be found by repeating the same equations above until it doesn't change any more. From the example above we try:
P(A(t+1)) = 0.3*0.4 + 0.7*0.8 = 0.68
P(B(t+1)) = 0.3*0.6 + 0.7*0.2 = 0.32
P(A(t+2)) = 0.68*0.4 + 0.32*0.8 = 0.528
P(B(t+2)) = 0.68*0.6 + 0.32*0.2 = 0.472
P(A(t+3)) = 0.528*0.4 + 0.472*0.8 = 0.5888
P(B(t+3)) = 0.528*0.6 + 0.472*0.2 = 0.4112
P(A(t+4)) = 0.5888*0.4 + 0.4112*0.8 = 0.5645
P(B(t+4)) = 0.5888*0.6 + 0.4112*0.2 = 0.4355
P(A(t+5)) = 0.5645*0.4 + 0.4355*0.8 = 0.5742
P(B(t+5)) = 0.5645*0.6 + 0.4355*0.2 = 0.4258
P(A(t+6)) = 0.5742*0.4 + 0.4258*0.8 = 0.5703
P(B(t+6)) = 0.5742*0.6 + 0.4258*0.2 = 0.4297
P(A(t+7)) = 0.5703*0.4 + 0.4297*0.8 = 0.5719
P(B(t+7)) = 0.5703*0.6 + 0.4297*0.2 = 0.4281
P(A(t+8)) = 0.5719*0.4 + 0.4281*0.8 = 0.5713
P(B(t+8)) = 0.5719*0.6 + 0.4281*0.2 = 0.4287
P(A(t+9)) = 0.5713*0.4 + 0.4287*0.8 = 0.5715
P(B(t+9)) = 0.5713*0.6 + 0.4287*0.2 = 0.4285
P(A(t+10)) = 0.5715*0.4 + 0.4285*0.8 = 0.5714
P(B(t+10)) = 0.5715*0.6 + 0.4285*0.2 = 0.4286
P(A(t+11)) = 0.5714*0.4 + 0.4286*0.8 = 0.5714
P(B(t+11)) = 0.5714*0.6 + 0.4286*0.2 = 0.4286
You could also solve for it directly as follows. Since a stationary distribution gets the same probabilities as it transitions from we could use the two transition equations above and set P(X(t+1)=P(X(t)). Since we also know that P(X(t)) sums to one we can eliminate one of the equations(which are redundant) and replace it with sum(X(t))=1. For the example above we now have two equations and two variables. These steps are repeated below:
Our transition equations and summation equation:
P(A(t+1)) = P(A(t))*0.4 + P(B(t))*0.8
P(B(t+1)) = P(A(t))*0.6 + P(B(t))*0.2
P(A(t))+P(B(t))=1
Remove one of the redundant transition equations:
P(A(t+1)) = P(A(t))*0.4 + P(B(t))*0.8
P(A(t))+P(B(t))=1
Replace t+1 with t (in fact I will take out all t's):
P(A) = P(A)*0.4 + P(B)*0.8
P(A)+P(B)=1
We now have two equations and two variables so solve for P(A) and P(B):
P(B) = 1-P(A)
P(A) = P(A)*0.4 + (1-P(A))*0.8
=> P(A) = P(A)*0.4 + 0.8 - P(A)*0.8
=> P(A) = P(A)*(0.4 - 0.8) + 0.8
=> P(A) = P(A)*(-0.4) + 0.8
=> P(A) - P(A)*(-0.4) = 0.8
=> P(A) + P(A)*(0.4) = 0.8
=> P(A)*(1 + 0.4) = 0.8
=> P(A)*(1.4) = 0.8
=> P(A) = 0.8/1.4 = 4/7 = 0.5714
P(B) = 1-P(A) = 1-4/7 = 3/7 = 0.4286
So the stationary distribution is [P(A),P(B)] = [4/7,3/7]
We could do the same thing with matrices, where the transition matrix is T.
P(X(t+1))=T*P(X(t))
![T=[0.4,0.8;0.6,0.2]](Tmat.bmp)
Replace t+1 with t (again I will take out all t's from the notation):
P(X)=T*P(X)
Comine P(X). I is an nxn matrix with only the main diagonals with 1s and n is the number of states. 0 is a column of n zeros:
I*P(X)=T*P(X)
0=T*P(X)-I*P(X)
0=(T-I)*P(X)
T-I=
-
= ![T-I=[-0.6,0.8;0.6,-0.8]](TmI.bmp)
Notice that each row in T-I is a multiple of the other. This shows that the two equations are redundant, which is why we must replace one of them with a row that represents a different equation. Now we need to add the sum(P(X))=1 equation. To do this we replace one row of (T-I) with ones and the same row in the column of zeros. For our example we replace the bottom row:
0=(T-I)*P(X)
=> rowofones(0)=rowofones(T-I)*P(X)
=> ![[0;1]=[-0.6,0.8;1,1]*](rowones.bmp)

Then solve for P(X):
rowofones(0)=rowofones(T-I)*P(X)
rowofones(T-I)*P(X)=rowofones(0)
=>inverse(rowofones(T-I))*rowofones(T-I)*P(X)=inverse(rowofones(T-I))*rowofones(0)
=>P(X)=inverse(rowofones(T-I))*rowofones(0)
=>P(X)=![[-5/7,4/7;5/7,3/7]](inverse.bmp)
=![[4/7;3/7]](Ans.bmp)
One of the advantages of this last method is that it gives an equation that will work whether there are two states or 50 or more. If there is a single solution the equation
P(X)=inverse(rowofones(T-I))*rowofones(0) will find the stationary distribution. If you understand a little bit more of Linear algebra you will be able to look at rowofones(T-I) when there is no inverse and tell if there are more than one stationary distributions or none.

Example: The same system as above can still be in either of two states A or B. At any given time we don't know exactly what they are, but we have some sensors that give us some extra information. Our sensors have four levels: high, medium high, medium low, and low. If the state is A we will get high 75% of the time, medium high 20% of the time and medium low 5% of the time. Detecting B is harder and we get a reading of high 1% of the time, medium high 10% of the time medium low 30% of the time and then low 59% of the time. We are now at time t+1 and so from the problem above we believe we are 68% likely to be in A and 32% likely to be in B. If we now get a sensor reading of medium high what is the new state probabilities?
Solution: P(X(t+1))=[.68,.32]. A table of P(E|X) would look like:
| E | P(E|A) | P(E|B) |
|---|---|---|
| high | 0.75 | 0.01 |
| medium high | 0.20 | 0.10 |
| medium low | 0.05 | 0.30 |
| low | 0.00 | 0.59 |
Now we have the tools for predicting the states of a system at any time given some initial probability distribution for the states and given some set of evidences (sensor readings, observations etc.). There are state transitions between times and obervations at certain times. We start at the time of the initial state distribution. We apply any evidence we may have at that time and then we apply the state transition to reach the next time step. We continue doing this until we reach the desired time step and have applied all known evidences. Any evidences before the initial probability can be ignored since P(X(t+1)) only depends on P(X(t)) and not on any states before time t or on any evidence before time t+1. For example, if we have the same system as above, where at time t the probabiltiy is A; 30%, B: 10%, and we have sensor readings for time t-3, t-1, t+1, t+2, and t+4, we find the same probabilities at time t+1 as we just found above even though we have more evidence. To find the probability at t+4 we calculate in the way just stated as follows:
The sensor readings at time t+2 is medium low, we missed the reading at time t+3, but at time t+4 it was high. We already have P(X(t+1)|E(t+1)), so for time t+2, we calculate P(X(t+2)) from that using the transitions probabilities given above:
P(A(t+2)) = 0.8095*0.4 + 0.1905*0.8 = 0.4762
P(B(t+2)) = 0.8095*0.6 + 0.1905*0.2 = 0.5238
We could calculate the probabilities using just the first equation and then find P(B(t+2)) by 1-P(A(t+2)), but doing both helps as a check that there were no obvious math errors. We now apply the evidence at time t+2 of medium low:
P(X(t+2)=A)*P(E(t+2)=medium low|X(t+2)=A) = 0.4762*0.05 = 0.02381
P(X(t+2)=B)*P(E(t+2)=medium low|X(t+2)=B) = 0.5238*0.30 = 0.15714
Normalize by 0.02381+0.15714=0.18095:
P(A(t+2)|E(t+2))=0.02381/0.18095=0.1316
P(B(t+2)|E(t+2))=0.15714/0.18095=0.8684
We then transition to time t+3:
P(A(t+3)) = 0.1316*0.4 + 0.8684*0.8 = 0.7474
P(B(t+3)) = 0.1316*0.6 + 0.8684*0.2 = 0.2526
We don't have a sensor reading at time t+3 so we transition again, but to time t+4:
P(A(t+4)) = 0.7474*0.4 + 0.2526*0.8 = 0.5011
P(B(t+4)) = 0.7474*0.6 + 0.2526*0.2 = 0.4989
We do have a sensor reading at time t+4 so we apply the evidence of high at time t+4:
P(X(t+4)=A)*P(E(t+4)=high|X(t+4)=A) = 0.5011*0.75 = 0.375825
P(X(t+4)=B)*P(E(t+4)=high|X(t+4)=B) = 0.4989*0.01 = 0.004989
Normalize by 0.375825+0.004989=0.380814:
P(A(t+4)|E(t+4))=0.375825/0.38081=0.9869
P(B(t+4)|E(t+4))=0.004989/0.38081=0.0131
We now have our final answer of where we think the system is at time t+4. It is 99% sure to be in state A.
Sometimes we don't want to know the probability at a single time, but rather want what the most likely sequence of states were to get the system where it is now. This is different than finding the most probable state at each time point because later states can now effect beliefs of where we were in earlier states. For example we may believe we are more likely to be in a certain state, but when we transition to the next state, and apply our evidence we find we are in a state that would suggest we actually were more likely to have come from a different state than 'the most probable' state at that time. Using evidence later in time can give us better estimates of the past than if we had just stopped at that time point.
Calculating the most likely sequence is very similar to the method of applying observations and transitions, but with some small, but improtant differences. We already looked at how to find the probability at some final time tf given some intial state at time ti and all known evidences from time ti to tf. This is shown mathematically as:
P(X(tf) | X(ti),E(ti:tf))
For sequences we don't want this probability at a single time, but All the states that would maximize the probability of a sequence of states and a set of occuring all happening. Mathematically this looks like:
X(ti:tf) that obtains Max(P(X(ti:tf), E(ti:tf))), or
argMax(P(X(ti:tf), E(ti:tf)))
P(X(tf) | X(ti),E(ti:tf)) is calculated in the same way shown above in applying transitions and evidences. For our example we found P(X(t+4)|X(t),E(t+1),E(t+2),E(t+4)).
Example: From the same system as above we want to know instead what the probability of having the sequence of X(t)=A, X(t+1)+A, X(t+2)=B, X(t+3)=A, X(t+4)=A, E(t+1)=Medium High (or MH), E(t+2)=Medium Low (or ML), and E(t+4)=High (or H) all occuring. (Note that this is not yet finding the most likely sequence, just the probability of a single sequence happening).
Solution: We want to find P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H). We can find this directly from our Markov structure and using the dependency structure by:
P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H) = P(X(t)=A)*P(X(t+1)=A | X(t)=A) * P(E(t+1)=MH|X(t+1)=A) * P(X(t+2)=B|X(t+1)=A) * P(E(t+2)=ML|X(t+2)=B) * P(X(t+3)=A|X(t+2)=B) * P(X(t+4)=A|X(t+3)=A) * P(E(t+4)=H|X(t+4)=A)
We see our starting distribution in the P(X(t)) term, transition type terms in the P(X(t+n)|X(t+n-1)) terms and evidence type terms in the P(E|X) terms. We can calculate it all at once or in parts. We will do it in parts here since that will be the normal flow when we try to find the most likely sequence:
P(X(t)=A)=.3
P(X(t)=A,X(t+1)=A) = P(X(t)=A)*P(X(t+1)=A|P(X(t))=A) = 0.3*0.4 = 0.12
P(X(t:t+1)=A,A,E(t+1)=MH) = 0.12*P(E(t+1)=MH|X(t+1)=A) = 0.12*0.2 = 0.024
P(X(t:t+2)=A,A,B,E(t+1)=MH) = 0.024* P(X(t+2)=B|X(t+1)=A) = 0.024*0.6 = 0.0144
From we transition probabilities and evidence table we get the following:
P(E(t+2)=ML|X(t+2)=B) = 0.3
P(X(t+3)=A|X(t+2)=B) = 0.8
P(X(t+4)=A|X(t+3)=A) = 0.4
P(E(t+4)=H|X(t+4)=A) = 0.75
Continuing from where left off before we have:
P(X(t:t+2)=A,A,B, E(t+1)=MH) = 0.024*0.6 = 0.0144
P(X(t:t+2)=A,A,B, E(t+1,t+2)=MH,ML) = 0.0144*0.3 = 0.00432 = 4.32E-3
P(X(t:t+3)=A,A,B,A, E(t+1,t+2)=MH,ML) = 4.32E-3*0.8 = 3.456E-3
P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2)=MH,ML) = 3.456e-3*0.4 = 1.3824E-3
P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H) = 1.3824E-3*0.75 = 1.0368E-3
Thus the probability that this sequence would occur with the given evidence is 0.1%.
Note that we keep track of what state we came from to get the sequence probability shown. This is like a back pointer that we can use later to know how we got to the current state. Also note that the probabilities that we are keeping track of no longer sum to one. This is because we they are sequence probabilities and we are not keeping track of every possible sequence. If we were to stop here we would say that if we were in state A, we would have most likely come from B. If we were in B, we most likely would have come from A. We could furthur say that it is more likely that we followed the sequence B,A than A,B because 0.56 is higher than 0.18. This is then the most likely sequence up to this point, without applying any evidence. We now want to apply evidence. The only difference from before is that we don't normalize the data any more. (Unless it is applied at time t to find our initial probabilities, but we don't have any evidence for time t. Our initial probabilities should always sum to one). We don't normalize because we are again using a sequence probability and not a total probability. So for our example we would have:
For X(t+1)=A: 0.56*P(E(t+1)=MH|X(t+1)=A) = 0.56*0.2 = 0.112
For X(t+1)=B: 0.18*P(E(t+1)=MH|X(t+1)=B) = 0.18*0.1 = 0.018
This shows we are still more likely to be in A than B at time t+1 and that we would have gotten there from the sequence BA. If for some reason we knew we were in B we would know we still got there from A (we keep track since later evidence may show that we were in B). We could represent this in a lattice diagram similar to the one shown above. We write the sequence probabilities after applying evidence at that time in each box (the first set of boxes at the initial time should always sum to one). We keep track of the backpointers or how we got to each state at that time with a bold arrow.

Now We will continue our example and find the most likely sequence from time t to t+4 using the new transition, max and observation steps.
Transition to t+2:
P(X(t:t+1)=BA,E(t+1)=MH,X(t+2)=A) = 0.112*0.4 = 0.0448
P(X(t:t+1)=BA,E(t+1)=MH,X(t+2)=B) = 0.112*0.6 = 0.0672
P(X(t:t+1)=AB,E(t+1)=MH,X(t+2)=A) = 0.018*0.8 = 0.0144
P(X(t:t+1)=AB,E(t+1)=MH,X(t+2)=B) = 0.018*0.2 = 0.0036
Max step at t+2:
For X(t+2)=A: Max(X(t:t+1)=BA:0.0448, X(t:t+1)=AB:0.0144) = X(t:t+1)=BA:0.0448
For X(t+2)=B: Max(X(t:t+1)=BA:0.0672, X(t:t+1)=AB:0.0036) = X(t:t+1)=BA:0.0672
Observation at t+2:
For X(t+2)=A: 0.0448*P(E(t+2)=ML|X(t+2)=A) = 0.0448*0.05 = 0.00224
For X(t+2)=B: 0.0672*P(E(t+2)=ML|X(t+2)=B) = 0.0672*0.30 = 0.02016
Our lattice now looks like this:

Transition to t+3:
P(X(t:t+2)=BAA,E(t+1,t+2)=MH,ML,X(t+3)=A) = 0.00224*0.4 = 0.896E-3
P(X(t:t+2)=BAA,E(t+1,t+2)=MH,ML,X(t+3)=B) = 0.00224*0.6 = 1.344E-3
P(X(t:t+2)=BAB,E(t+1,t+2)=MH,ML,X(t+3)=A) = 0.02016*0.8 = 16.128E-3
P(X(t:t+2)=BAB,E(t+1,t+2)=MH,ML,X(t+3)=B) = 0.02016*0.2 = 4.032E-3
Max step at t+3:
For X(t+3)=A: Max(X(t:t+2)=BAA:0.896E-3, X(t:t+2)=BAB:16.128E-3) = X(t:t+2)=BAB:16.128E-3
For X(t+3)=B: Max(X(t:t+2)=BAA:1.344E-3, X(t:t+2)=BAB:4.032E-3) = X(t:t+2)=BAB:4.032E-3
No Observation at t+3 so the lattice now looks like this:

Transition to t+4:
P(X(t:t+3)=BABA,E(t+1,t+2)=MH,ML,X(t+4)=A) = 0.016128*0.4 = 6.4512E-3
P(X(t:t+3)=BABA,E(t+1,t+2)=MH,ML,X(t+4)=B) = 0.016128*0.6 = 9.6768E-3
P(X(t:t+3)=BABB,E(t+1,t+2)=MH,ML,X(t+4)=A) = 0.004032*0.8 = 3.2256E-3
P(X(t:t+3)=BABB,E(t+1,t+2)=MH,ML,X(t+4)=B) = 0.004032*0.2 = 0.8064E-3
Max step at t+4:
For X(t+4)=A: Max(X(t:t+3)=BABA:6.4512E-3, X(t:t+3)=BABB:3.2256E-3) = X(t:t+3)=BABA:6.4512E-3
For X(t+4)=B: Max(X(t:t+3)=BABA:9.6768E-3, X(t:t+3)=BABB:0.8064E-3) = X(t:t+3)=BABA:9.6768E-3
Observation at t+4:
For X(t+4)=A: 6.4512E-3*P(E(t+4)=H|X(t+4)=A) = 6.4512E-3*0.75 = 4.8384E-3
For X(t+4)=B: 9.6768E-3*P(E(t+4)=H|X(t+4)=B) = 9.6768E-3*0.01 = 0.096768E-3
The final lattice now looks like this:

We find then that the most likely sequence given the evidence is BABAA because A has the greatest sequence probability at t+4 and BABA was the sequence that led up to it. We could follow the arrows back from A at t+4 to retrace our steps graphically if we had not been keeping track of the path in the Max step as we have been.