Markov Chains


Overview

If a system can be modeled as a Markov chain, there are simple algorithms for predicting states of the system based on either known states at another time, or given some clues about the states. A Markov chain is a specific type of Bayes’ net. For both, inference determines probabilities of a certain outcome based on known elements in other parts of the graph. The Markov chain, however, can be unlimited in length and has restraints on its structure that simplify the inference calculations. A Markov chain can be unlimited in length because they usually refer to changes of a state over time and the time period may be ‘infinite’. The limitation is what state dependencies are allowed at any given time. A set of states at one set in time can only be linked to states at a limited number of previous time points. In a First-Order Markov Chain the current states can only be linked to the states in the previous time. In the terms of a Bayes’ net, states at one time point, for a First-Order Markov Chain, can only be linked to its parent nodes and not to any of its grandparents.

Many real systems can be modeled as Markov chains despite this restraint on dependencies. Most physics equations, if discretized, can be modeled as a Markov chain. For example if you know the speed and location of a ball in the air at any time, you can predict where it will be at the next instant, no matter how it got there. As another example, knowing the physical properties of water, if you know a cup of water's temperature and the air temperature you can calculate how much it will have heated up or cooled down by the next time step. Although Markov chains are better for discrete systems, these examples show the basic idea. (Kalman filters are the Markov chain equivalent for continuous space systems).

How it Works

In a Markov Chain there are two parts:

  1. Applying Transitions
  2. Applying Evidence

If a Markov chain has this second part it is often called a Hidden Markov Model or HMM. Evidence can also be called observations.

Applying Transitions in a Markov Chain

Standard Transitions

Transitions are how we get from state 1 at time t to state 2 at time t+1. To do this we need the transition probabilities. By the definition of a first order Markov chain the probabilities of a state at time t+1 only depends on the probabilities of that state at time t:

P(X(t+1)|X(0:t))=P(X(t+1)|X(t))
The transition probabilities are either given to us or can be determined by the ‘physics’ of the system. When we have P(X(t+1)|X(t)) we can then find P(X(t+1)) using probability:

P(X(t+1))=P(X(t))*P(X(t+1)|X(t))

Example: A system can be in either of two states A or B, and at time t we know it is 30% likely to be in A and 70% likely to be in B. The transition probabilities are given that A will stay in A 40% of the time and B will stay in B 20%. What is the probability of being in state A or B at time t+1?

Solution: P(X(t))=[.3,.7]. For transition probabilities we can draw the state transitions in a lattice structure as follows:
Probability Lattice

Since A stays in A 40% the line from P(A,t) to P(A,t+1) is weighted 40%. In math terms P(A(t+1)|P(A(t))=.4. So P(B(t+1)|P(A(t))=.6 because they must sum to one. The same thing is true for B staying in B 20%: P(B(t+1)|B(t))=.2 and P(A(t+1)|B(t))=.8. To find P(A(t+1)) and P(B(t+1)) we use equation 2 above. In equation 2 the X stands for all the states, and from the lattice above we can see that some of the probability of being in A at time t+1 comes from being in state A and some from being in state B at time t.

P(A(t+1)) = P(A(t))*P(A(t+1)|A(t)) + P(B(t))*P(A(t+1)|B(t))
P(B(t+1)) = P(A(t))*P(B(t+1)|A(t)) + P(B(t))*P(B(t+1)|B(t))

P(A(t+1)) = 0.3*0.4 + 0.7*0.8 = .68
P(B(t+1)) = 0.3*0.6 + 0.7*0.2 = .32

A good check that the final probability is plausible is that all the probabilities sum to one:
.68 + .32 = 1

We can restate the above equations in the terms of matrices as follows:

P(X(t+1)=[0.4,0.8;0.6,0.2]*P(X(t))=[0.4,0.8;0.6,0.2]*[0.3;0.7]=[0.68;0.32]

Stationary Distributions

A "stationary distribution" is a set of state probabilities that transition to the same probabilities. Such a stationary distribution can be found by repeating the same equations above until it doesn't change any more. From the example above we try:

P(A(t+1)) = 0.3*0.4 + 0.7*0.8 = 0.68
P(B(t+1)) = 0.3*0.6 + 0.7*0.2 = 0.32

P(A(t+2)) = 0.68*0.4 + 0.32*0.8 = 0.528
P(B(t+2)) = 0.68*0.6 + 0.32*0.2 = 0.472

P(A(t+3)) = 0.528*0.4 + 0.472*0.8 = 0.5888
P(B(t+3)) = 0.528*0.6 + 0.472*0.2 = 0.4112

P(A(t+4)) = 0.5888*0.4 + 0.4112*0.8 = 0.5645
P(B(t+4)) = 0.5888*0.6 + 0.4112*0.2 = 0.4355

P(A(t+5)) = 0.5645*0.4 + 0.4355*0.8 = 0.5742
P(B(t+5)) = 0.5645*0.6 + 0.4355*0.2 = 0.4258

P(A(t+6)) = 0.5742*0.4 + 0.4258*0.8 = 0.5703
P(B(t+6)) = 0.5742*0.6 + 0.4258*0.2 = 0.4297

P(A(t+7)) = 0.5703*0.4 + 0.4297*0.8 = 0.5719
P(B(t+7)) = 0.5703*0.6 + 0.4297*0.2 = 0.4281

P(A(t+8)) = 0.5719*0.4 + 0.4281*0.8 = 0.5713
P(B(t+8)) = 0.5719*0.6 + 0.4281*0.2 = 0.4287

P(A(t+9)) = 0.5713*0.4 + 0.4287*0.8 = 0.5715
P(B(t+9)) = 0.5713*0.6 + 0.4287*0.2 = 0.4285

P(A(t+10)) = 0.5715*0.4 + 0.4285*0.8 = 0.5714
P(B(t+10)) = 0.5715*0.6 + 0.4285*0.2 = 0.4286

P(A(t+11)) = 0.5714*0.4 + 0.4286*0.8 = 0.5714
P(B(t+11)) = 0.5714*0.6 + 0.4286*0.2 = 0.4286

You could also solve for it directly as follows. Since a stationary distribution gets the same probabilities as it transitions from we could use the two transition equations above and set P(X(t+1)=P(X(t)). Since we also know that P(X(t)) sums to one we can eliminate one of the equations(which are redundant) and replace it with sum(X(t))=1. For the example above we now have two equations and two variables. These steps are repeated below:

Our transition equations and summation equation:

P(A(t+1)) = P(A(t))*0.4 + P(B(t))*0.8
P(B(t+1)) = P(A(t))*0.6 + P(B(t))*0.2
P(A(t))+P(B(t))=1

Remove one of the redundant transition equations:

P(A(t+1)) = P(A(t))*0.4 + P(B(t))*0.8
P(A(t))+P(B(t))=1

Replace t+1 with t (in fact I will take out all t's):

P(A) = P(A)*0.4 + P(B)*0.8
P(A)+P(B)=1

We now have two equations and two variables so solve for P(A) and P(B):

P(B) = 1-P(A) P(A) = P(A)*0.4 + (1-P(A))*0.8
=> P(A) = P(A)*0.4 + 0.8 - P(A)*0.8
=> P(A) = P(A)*(0.4 - 0.8) + 0.8
=> P(A) = P(A)*(-0.4) + 0.8
=> P(A) - P(A)*(-0.4) = 0.8
=> P(A) + P(A)*(0.4) = 0.8
=> P(A)*(1 + 0.4) = 0.8
=> P(A)*(1.4) = 0.8
=> P(A) = 0.8/1.4 = 4/7 = 0.5714
P(B) = 1-P(A) = 1-4/7 = 3/7 = 0.4286

So the stationary distribution is [P(A),P(B)] = [4/7,3/7]

We could do the same thing with matrices, where the transition matrix is T.

P(X(t+1))=T*P(X(t))
T=[0.4,0.8;0.6,0.2]

Replace t+1 with t (again I will take out all t's from the notation):

P(X)=T*P(X)

Comine P(X). I is an nxn matrix with only the main diagonals with 1s and n is the number of states. 0 is a column of n zeros:

I*P(X)=T*P(X)
0=T*P(X)-I*P(X)
0=(T-I)*P(X)
T-I=T=[0.4,0.8;0.6,0.2]-[1,0;0,1] = T-I=[-0.6,0.8;0.6,-0.8]

Notice that each row in T-I is a multiple of the other. This shows that the two equations are redundant, which is why we must replace one of them with a row that represents a different equation. Now we need to add the sum(P(X))=1 equation. To do this we replace one row of (T-I) with ones and the same row in the column of zeros. For our example we replace the bottom row:

0=(T-I)*P(X)
=> rowofones(0)=rowofones(T-I)*P(X)
=> [0;1]=[-0.6,0.8;1,1]*P(X)

Then solve for P(X):

rowofones(0)=rowofones(T-I)*P(X)
rowofones(T-I)*P(X)=rowofones(0)
=>inverse(rowofones(T-I))*rowofones(T-I)*P(X)=inverse(rowofones(T-I))*rowofones(0)
=>P(X)=inverse(rowofones(T-I))*rowofones(0)
=>P(X)=[-5/7,4/7;5/7,3/7][0;1]=[4/7;3/7]

One of the advantages of this last method is that it gives an equation that will work whether there are two states or 50 or more. If there is a single solution the equation P(X)=inverse(rowofones(T-I))*rowofones(0) will find the stationary distribution. If you understand a little bit more of Linear algebra you will be able to look at rowofones(T-I) when there is no inverse and tell if there are more than one stationary distributions or none.

Applying Evidence in a Markov Chain

Using Evidence

In a Markov Chain the evidence at time t only depends on the state at time t:

P(E(t)|X(t),X(0:t-1),E(0:t-1))=P(E(t)|X(t))

In a Bayes' Net type model this looks like the following figure where T=P(X(t+1)|P(X(t)):
Probability and Evidence dependency picture
At any time t we can find the probabilty distribution P(X(t)) before applying evidence from P(X(t-1)) by applying transition probabilities in the same way as shown in the section above. After obtaining P(X(t)) (or P(X(t)|X(t-1)) to be concise) we then want to find P(X(t)|E(t)) (or P(X(t)|X(t-1),E(t))). To do this we apply "Bayes rule" where P(X|E)=P(E|X)*P(X)/P(E). We should have P(E|X) from our model, but we are not likely to have P(E). We find, however, that we don't need it since it acts as a normalizing term. We then find P(X|E) by normalizing P(E|X)*P(X). This could also be stated mathematically as P(X|E) = P(E|X)*P(X)/Σi(P(E|Xi)*P(Xi)).

Example: The same system as above can still be in either of two states A or B. At any given time we don't know exactly what they are, but we have some sensors that give us some extra information. Our sensors have four levels: high, medium high, medium low, and low. If the state is A we will get high 75% of the time, medium high 20% of the time and medium low 5% of the time. Detecting B is harder and we get a reading of high 1% of the time, medium high 10% of the time medium low 30% of the time and then low 59% of the time. We are now at time t+1 and so from the problem above we believe we are 68% likely to be in A and 32% likely to be in B. If we now get a sensor reading of medium high what is the new state probabilities?

Solution: P(X(t+1))=[.68,.32]. A table of P(E|X) would look like:

E P(E|A) P(E|B)
high 0.75 0.01
medium high 0.20 0.10
medium low 0.05 0.30
low 0.00 0.59

We then calculate probabilties P(X)*P(E|X):

P(X=A)*P(E=medium high|X=A) = 0.68*0.20 = .136
P(X=B)*P(E=medium high|X=B) = 0.32*0.10 = .032

Then to find P(X|E) we normalize the results above:

sum(P(X)*P(E|X))=0.136+0.032=0.168
P(A|E)=0.136/0.168=0.8095
P(B|E)=0.032/0.168=0.1905

Because the sensor agreed with where we thought we were, now we are more certain of the state we think the system is in.

Using Transitions and Evidence

Now we have the tools for predicting the states of a system at any time given some initial probability distribution for the states and given some set of evidences (sensor readings, observations etc.). There are state transitions between times and obervations at certain times. We start at the time of the initial state distribution. We apply any evidence we may have at that time and then we apply the state transition to reach the next time step. We continue doing this until we reach the desired time step and have applied all known evidences. Any evidences before the initial probability can be ignored since P(X(t+1)) only depends on P(X(t)) and not on any states before time t or on any evidence before time t+1. For example, if we have the same system as above, where at time t the probabiltiy is A; 30%, B: 10%, and we have sensor readings for time t-3, t-1, t+1, t+2, and t+4, we find the same probabilities at time t+1 as we just found above even though we have more evidence. To find the probability at t+4 we calculate in the way just stated as follows:

The sensor readings at time t+2 is medium low, we missed the reading at time t+3, but at time t+4 it was high. We already have P(X(t+1)|E(t+1)), so for time t+2, we calculate P(X(t+2)) from that using the transitions probabilities given above:

P(A(t+2)) = 0.8095*0.4 + 0.1905*0.8 = 0.4762
P(B(t+2)) = 0.8095*0.6 + 0.1905*0.2 = 0.5238

We could calculate the probabilities using just the first equation and then find P(B(t+2)) by 1-P(A(t+2)), but doing both helps as a check that there were no obvious math errors. We now apply the evidence at time t+2 of medium low:

P(X(t+2)=A)*P(E(t+2)=medium low|X(t+2)=A) = 0.4762*0.05 = 0.02381
P(X(t+2)=B)*P(E(t+2)=medium low|X(t+2)=B) = 0.5238*0.30 = 0.15714

Normalize by 0.02381+0.15714=0.18095:
P(A(t+2)|E(t+2))=0.02381/0.18095=0.1316
P(B(t+2)|E(t+2))=0.15714/0.18095=0.8684

We then transition to time t+3:

P(A(t+3)) = 0.1316*0.4 + 0.8684*0.8 = 0.7474
P(B(t+3)) = 0.1316*0.6 + 0.8684*0.2 = 0.2526

We don't have a sensor reading at time t+3 so we transition again, but to time t+4:

P(A(t+4)) = 0.7474*0.4 + 0.2526*0.8 = 0.5011
P(B(t+4)) = 0.7474*0.6 + 0.2526*0.2 = 0.4989

We do have a sensor reading at time t+4 so we apply the evidence of high at time t+4:

P(X(t+4)=A)*P(E(t+4)=high|X(t+4)=A) = 0.5011*0.75 = 0.375825
P(X(t+4)=B)*P(E(t+4)=high|X(t+4)=B) = 0.4989*0.01 = 0.004989

Normalize by 0.375825+0.004989=0.380814:
P(A(t+4)|E(t+4))=0.375825/0.38081=0.9869
P(B(t+4)|E(t+4))=0.004989/0.38081=0.0131

We now have our final answer of where we think the system is at time t+4. It is 99% sure to be in state A.

Most Likely Sequence

Sometimes we don't want to know the probability at a single time, but rather want what the most likely sequence of states were to get the system where it is now. This is different than finding the most probable state at each time point because later states can now effect beliefs of where we were in earlier states. For example we may believe we are more likely to be in a certain state, but when we transition to the next state, and apply our evidence we find we are in a state that would suggest we actually were more likely to have come from a different state than 'the most probable' state at that time. Using evidence later in time can give us better estimates of the past than if we had just stopped at that time point.

Calculating the most likely sequence is very similar to the method of applying observations and transitions, but with some small, but improtant differences. We already looked at how to find the probability at some final time tf given some intial state at time ti and all known evidences from time ti to tf. This is shown mathematically as:

P(X(tf) | X(ti),E(ti:tf))

For sequences we don't want this probability at a single time, but All the states that would maximize the probability of a sequence of states and a set of occuring all happening. Mathematically this looks like:

X(ti:tf) that obtains Max(P(X(ti:tf), E(ti:tf))), or
argMax(P(X(ti:tf), E(ti:tf)))

P(X(tf) | X(ti),E(ti:tf)) is calculated in the same way shown above in applying transitions and evidences. For our example we found P(X(t+4)|X(t),E(t+1),E(t+2),E(t+4)).

Example: From the same system as above we want to know instead what the probability of having the sequence of X(t)=A, X(t+1)+A, X(t+2)=B, X(t+3)=A, X(t+4)=A, E(t+1)=Medium High (or MH), E(t+2)=Medium Low (or ML), and E(t+4)=High (or H) all occuring. (Note that this is not yet finding the most likely sequence, just the probability of a single sequence happening).

Solution: We want to find P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H). We can find this directly from our Markov structure and using the dependency structure by:

P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H) = P(X(t)=A)*P(X(t+1)=A | X(t)=A) * P(E(t+1)=MH|X(t+1)=A) * P(X(t+2)=B|X(t+1)=A) * P(E(t+2)=ML|X(t+2)=B) * P(X(t+3)=A|X(t+2)=B) * P(X(t+4)=A|X(t+3)=A) * P(E(t+4)=H|X(t+4)=A)

We see our starting distribution in the P(X(t)) term, transition type terms in the P(X(t+n)|X(t+n-1)) terms and evidence type terms in the P(E|X) terms. We can calculate it all at once or in parts. We will do it in parts here since that will be the normal flow when we try to find the most likely sequence:

P(X(t)=A)=.3
P(X(t)=A,X(t+1)=A) = P(X(t)=A)*P(X(t+1)=A|P(X(t))=A) = 0.3*0.4 = 0.12
P(X(t:t+1)=A,A,E(t+1)=MH) = 0.12*P(E(t+1)=MH|X(t+1)=A) = 0.12*0.2 = 0.024
P(X(t:t+2)=A,A,B,E(t+1)=MH) = 0.024* P(X(t+2)=B|X(t+1)=A) = 0.024*0.6 = 0.0144
From we transition probabilities and evidence table we get the following:
P(E(t+2)=ML|X(t+2)=B) = 0.3
P(X(t+3)=A|X(t+2)=B) = 0.8
P(X(t+4)=A|X(t+3)=A) = 0.4
P(E(t+4)=H|X(t+4)=A) = 0.75
Continuing from where left off before we have:
P(X(t:t+2)=A,A,B, E(t+1)=MH) = 0.024*0.6 = 0.0144
P(X(t:t+2)=A,A,B, E(t+1,t+2)=MH,ML) = 0.0144*0.3 = 0.00432 = 4.32E-3
P(X(t:t+3)=A,A,B,A, E(t+1,t+2)=MH,ML) = 4.32E-3*0.8 = 3.456E-3
P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2)=MH,ML) = 3.456e-3*0.4 = 1.3824E-3
P(X(t:t+4)=A,A,B,A,A, E(t+1,t+2,t+4)=MH,ML,H) = 1.3824E-3*0.75 = 1.0368E-3

Thus the probability that this sequence would occur with the given evidence is 0.1%.

We now want to find the sequence of states that will maximize this probability. We could either list every possible outcome (AAAAA, AAAAB, AAABA, AAABB, ...) and choose the one with the highest probability, or we can do it the way that will be shown as follows. Again we have transition steps and observation steps.
For a stndard transition step we would sum probabilties of coming from any previous state. For example, at the beginning we did:

P(X(t+1)=A) = 0.3*0.4 + 0.7*0.8 = 0.68
P(X(t+1)=B) = 0.3*0.6 + 0.7*0.2 = 0.32

Here however the probability will be 'coming' from only one state (because we want the sequence probability and not the total probability) so we do each part seperately and don't sum:

P(X(t)=A,X(t+1)=A) = 0.3*0.4 = 0.12
P(X(t)=A,X(t+1)=B) = 0.3*0.6 = 0.18
P(X(t)=B,X(t+1)=A) = 0.7*0.8 = 0.56
P(X(t)=B,X(t+1)=B) = 0.7*0.2 = 0.14

Now we add a max step because we only care about the most likely way we got to X(t+1):

For X(t+1)=A: Max(X(t)=A:0.12, X(t)=B:0.56) = X(t)=B:0.56
For X(t+1)=B: Max(X(t)=A:0.18, X(t)=B:0.14) = X(t)=A:0.18

Note that we keep track of what state we came from to get the sequence probability shown. This is like a back pointer that we can use later to know how we got to the current state. Also note that the probabilities that we are keeping track of no longer sum to one. This is because we they are sequence probabilities and we are not keeping track of every possible sequence. If we were to stop here we would say that if we were in state A, we would have most likely come from B. If we were in B, we most likely would have come from A. We could furthur say that it is more likely that we followed the sequence B,A than A,B because 0.56 is higher than 0.18. This is then the most likely sequence up to this point, without applying any evidence. We now want to apply evidence. The only difference from before is that we don't normalize the data any more. (Unless it is applied at time t to find our initial probabilities, but we don't have any evidence for time t. Our initial probabilities should always sum to one). We don't normalize because we are again using a sequence probability and not a total probability. So for our example we would have:

For X(t+1)=A: 0.56*P(E(t+1)=MH|X(t+1)=A) = 0.56*0.2 = 0.112
For X(t+1)=B: 0.18*P(E(t+1)=MH|X(t+1)=B) = 0.18*0.1 = 0.018

This shows we are still more likely to be in A than B at time t+1 and that we would have gotten there from the sequence BA. If for some reason we knew we were in B we would know we still got there from A (we keep track since later evidence may show that we were in B). We could represent this in a lattice diagram similar to the one shown above. We write the sequence probabilities after applying evidence at that time in each box (the first set of boxes at the initial time should always sum to one). We keep track of the backpointers or how we got to each state at that time with a bold arrow.

Sequence Lattice for times t and t+1

Now We will continue our example and find the most likely sequence from time t to t+4 using the new transition, max and observation steps.

Transition to t+2:
P(X(t:t+1)=BA,E(t+1)=MH,X(t+2)=A) = 0.112*0.4 = 0.0448
P(X(t:t+1)=BA,E(t+1)=MH,X(t+2)=B) = 0.112*0.6 = 0.0672
P(X(t:t+1)=AB,E(t+1)=MH,X(t+2)=A) = 0.018*0.8 = 0.0144
P(X(t:t+1)=AB,E(t+1)=MH,X(t+2)=B) = 0.018*0.2 = 0.0036

Max step at t+2:
For X(t+2)=A: Max(X(t:t+1)=BA:0.0448, X(t:t+1)=AB:0.0144) = X(t:t+1)=BA:0.0448
For X(t+2)=B: Max(X(t:t+1)=BA:0.0672, X(t:t+1)=AB:0.0036) = X(t:t+1)=BA:0.0672

Observation at t+2:
For X(t+2)=A: 0.0448*P(E(t+2)=ML|X(t+2)=A) = 0.0448*0.05 = 0.00224
For X(t+2)=B: 0.0672*P(E(t+2)=ML|X(t+2)=B) = 0.0672*0.30 = 0.02016

Our lattice now looks like this:

Sequence Lattice for times t to t+2

Transition to t+3:
P(X(t:t+2)=BAA,E(t+1,t+2)=MH,ML,X(t+3)=A) = 0.00224*0.4 = 0.896E-3
P(X(t:t+2)=BAA,E(t+1,t+2)=MH,ML,X(t+3)=B) = 0.00224*0.6 = 1.344E-3
P(X(t:t+2)=BAB,E(t+1,t+2)=MH,ML,X(t+3)=A) = 0.02016*0.8 = 16.128E-3
P(X(t:t+2)=BAB,E(t+1,t+2)=MH,ML,X(t+3)=B) = 0.02016*0.2 = 4.032E-3

Max step at t+3:
For X(t+3)=A: Max(X(t:t+2)=BAA:0.896E-3, X(t:t+2)=BAB:16.128E-3) = X(t:t+2)=BAB:16.128E-3
For X(t+3)=B: Max(X(t:t+2)=BAA:1.344E-3, X(t:t+2)=BAB:4.032E-3) = X(t:t+2)=BAB:4.032E-3

No Observation at t+3 so the lattice now looks like this:

Sequence Lattice for times t to t+3

Transition to t+4:
P(X(t:t+3)=BABA,E(t+1,t+2)=MH,ML,X(t+4)=A) = 0.016128*0.4 = 6.4512E-3
P(X(t:t+3)=BABA,E(t+1,t+2)=MH,ML,X(t+4)=B) = 0.016128*0.6 = 9.6768E-3
P(X(t:t+3)=BABB,E(t+1,t+2)=MH,ML,X(t+4)=A) = 0.004032*0.8 = 3.2256E-3
P(X(t:t+3)=BABB,E(t+1,t+2)=MH,ML,X(t+4)=B) = 0.004032*0.2 = 0.8064E-3

Max step at t+4:
For X(t+4)=A: Max(X(t:t+3)=BABA:6.4512E-3, X(t:t+3)=BABB:3.2256E-3) = X(t:t+3)=BABA:6.4512E-3
For X(t+4)=B: Max(X(t:t+3)=BABA:9.6768E-3, X(t:t+3)=BABB:0.8064E-3) = X(t:t+3)=BABA:9.6768E-3

Observation at t+4:
For X(t+4)=A: 6.4512E-3*P(E(t+4)=H|X(t+4)=A) = 6.4512E-3*0.75 = 4.8384E-3
For X(t+4)=B: 9.6768E-3*P(E(t+4)=H|X(t+4)=B) = 9.6768E-3*0.01 = 0.096768E-3

The final lattice now looks like this:

Sequence Lattice for times t to t+4

We find then that the most likely sequence given the evidence is BABAA because A has the greatest sequence probability at t+4 and BABA was the sequence that led up to it. We could follow the arrows back from A at t+4 to retrace our steps graphically if we had not been keeping track of the path in the Max step as we have been.

References

Artificial Intelligence, A Modern Approach, Second Edition, Chapter 15.
Utah Class Lecture Slides cs5300-HMMs