| Files you'll edit: | |
lsa.py |
You will implement latent semantic analysis here. |
hmm.py |
You will implement code for hidden Markov models here. |
| Files you might want to look at: | |
datasets.py |
Includes (in python format) some simple toy data sets. |
util.py |
Some helpful utilities, including plotting functions. |
What to submit: You will handin all of the python files listed above under "Files you'll edit" as well as a partners.txt file that lists the names and uids of all members in your team. Finally, you'll hand in a writeup.pdf file that answers all the written questions in this assignment (denoted by WU#: in this .html file).
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, class time, and the mailing list are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask. One more piece of advice: if you don't know what a variable is, print it out.
>>> (nipsText, nipsDict) = datasets.readDocuments("nips.txt")
>>> nipsText.shape
(1254, 6547)
>>> (U,Sigma,V,topics) = lsa.lsa(nipsText, 10, nipsDict)
>>> norm(dot(U, dot(diag(Sigma), V)) - nipsText)
9175.9946201118455
>>> (U,Sigma,V,topics) = lsa.lsa(nipsText, 20, nipsDict)
>>> norm(dot(U, dot(diag(Sigma), V)) - nipsText)
8827.3855374102623
The first genreates 10 topics, the second generates 20
topics. As you can see, the reconstruction error goes down as the
number of topics goes up.
Warning: this will take a bit of time (a few minutes). If you want
to test your implementation in smaller settings, try on subset of
nipsText, or random data, and check by trying to reconstruction
X=USV.
We can now take a look at the top words for every topic:
>>> topics[0,:]
array(['yorktown', 'aronszajn', 'massimiliano', 'massart', 'thorsten',
'bie', 'alexsmola', 'lectures', 'karush', 'scholz', 'holloway',
'hush', 'dordrecht', 'uncorrupted', 'xiaojin', 'rockafellar',
'mincuts', 'fung', 'grg', 'datasetshtml'],
dtype='|S22')
>>> topics[1,:]
array(['policy', 'agents', 'player', 'game', 'reward', 'agent', 'actions',
'regret', 'mdp', 'equilibrium', 'one', 'arm', 'robot', 'expert',
'opponent', 'nash', 'equilibria', 'planning', 'word', 'games'],
dtype='|S22')
This first topic looks mostly like names (though not entirely) and the
second looks a lot like reinforcement learning.
WU1: Look through the top 20 topics and print them all out.
Try to figure out what they're talking about. How many of them are
redundant? Are any just totally random looking?
WU2: If you want to generate pretty pictures,
use datasets.plotWords(V,nipsDict). Find a place in the
resulting image that looks particularly good and include a copy of it
in your writeup.
Extra credit: I've also give you the collaborative filtering
data from P1. If you feel like (this is worth 10% bonus), run LSA on
this data (you'll have to massage it into the right format) and
present the "topics"... what are they? Can you see any trends?
>>> (a,b,pi) = datasets.getHMMData()
>>> a
array([[ 0.66666667, 0.33333333],
[ 0.5 , 0.5 ]])
>>> b
array([[ 0.66666667, 0.25 , 0.08333333],
[ 0.25 , 0.25 , 0.5 ]])
>>> pi
array([ 0.5, 0.5])
>>> hmm.viterbi(array([0,1,1,2]), a, b, pi)
array([0, 0, 0, 1])
>>> hmm.viterbi(array([0,2,1,2]), a, b, pi)
array([0, 1, 1, 1])
The observation sequences are Hot Cold Cold Wet and Hot Wet Cold Wet,
respectively. The inferred latent state sequences are G G G B and G B
B B, respectively.
If you need help debugging: I added statements to
print al and ze at the end of the viterbi function.
Here's the output I get:
>>> hmm.viterbi(array([0,1,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -3.29583687 -5.08759634 -7.97796809] [-0.69314718 -2.77258872 -3.98898405 -5.78074352 -6.8793558 ]] [[-1 0 0 0 0] [-1 1 0 0 0]] array([0, 0, 0, 1]) >>> hmm.viterbi(array([0,2,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -4.39444915 -5.37527841 -8.26565017] [-0.69314718 -2.77258872 -3.29583687 -5.37527841 -6.76157277]] [[-1 0 0 1 0] [-1 1 0 1 1]] array([0, 1, 1, 1])
>>> hmm.viterbi(array([0,1,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -3.29583687 -5.08759634 -7.16703788] [-0.69314718 -2.19722458 -3.98898405 -5.78074352 -7.16703788]] [[-1 0 0 0 1] [-1 0 0 0 1]] array([0, 0, 0, 1]) >>> hmm.viterbi(array([0,2,1,2]), a, b, pi) [[-0.69314718 -1.5040774 -3.58351894 -5.37527841 -7.04925484] [-0.69314718 -2.19722458 -3.58351894 -5.66296048 -7.04925484]] [[-1 0 1 0 1] [-1 0 1 1 1]] array([0, 1, 1, 1])
>>> 0.5 + 0.25 0.75 >>> log(0.75) -0.2876820724517809 >>> util.addLog( log(0.5) , log(0.25) ) -0.2876820724517809If you say addLog(a,b) then it's effectively computing log(exp(a) + exp(b)), but in such a way that there's no numeric underflow. If you're curious, you're encouraged to look at the code to see how it works! So, keep in mind that: you replace probability multiplication with log probability addition, you replace one with zero, replace zero with -inf, and you can now replace probability addition with log probability addLogition. Now, armed with this new tool, let's implement the forward algorithm. Once you've got a solution, here's some debugging help based on the same data we had from the Viterbi case:
>>> hmm.forward(array([0,1,1,2]), a, b, pi)
array([[-0.69314718, -1.25624123, -2.67140358, -4.06259134, -5.56850996],
[-0.69314718, -1.75093747, -3.09162132, -4.47051216, -5.70230466]])
>>> hmm.forward(array([0,2,1,2]), a, b, pi)
array([[-0.69314718, -1.25624123, -2.82648449, -4.11756738, -5.58815463],
[-0.69314718, -1.75093747, -2.96983593, -4.47862365, -5.71699198]])
From there, you can move on to the backward algorithm. Here's some
similar debugging help:
>>> hmm.backward(array([0,1,1,2]), a, b, pi)
array([[-4.56743938, -4.17757522, -2.89037176, -2.48490665, 0. ],
[-5.5405585 , -4.13148411, -2.61843804, -0.69314718, 0. ]])
>>> hmm.backward(array([0,2,1,2]), a, b, pi)
array([[-4.6660574 , -5.27618751, -2.89037176, -2.48490665, 0. ],
[-5.37008359, -3.43833693, -2.61843804, -0.69314718, 0. ]])
I've provided a sanityCheck function that ensures that you
get the same value for "sum al_i be_i" for all time positions.
>>> al = hmm.forward(array([0,1,1,2]), a, b, pi) >>> be = hmm.backward(array([0,1,1,2]), a, b, pi) >>> hmm.sanityCheck(al, be)If it prints something, that's bad! Finally, we have to implement parameter re-estimation based on the forward and backward tables. This goes in reestimate. Again, you'll need to use the magic of addLog to make this work. At the end, pi should contain log probabilities for the pi value. The normalizeLog function will turn an unnormalized vector of log probabilities into a normalized vector of probabilities. Here's an example of how it works:
>>> v = log(array([1,2,2,3,2])) >>> v array([ 0. , 0.69314718, 0.69314718, 1.09861229, 0.69314718]) >>> util.normalizeLog(v) array([ 0.1, 0.2, 0.2, 0.3, 0.2])What it's done is addLoged all the values in the vector and then used this as a normalization constant. It turns the result into a bunch of probabilities. Big hint: In the notes and the book and elsewhere, it makes a big deal out of re-estimating values as the fraction of something that you care about (expected counts) to the sum of expected counts. It's not worth computing the denominators. Just compute the numerators and let normalizeLog ensure that these probabilities sum to one. (This is all the denominator is doing, anyway). Here's re-estimation in practice:
>>> al = hmm.forward(array([0,1,1,2]), a, b, pi)
>>> be = hmm.backward(array([0,1,1,2]), a, b, pi)
>>> (a_new, b_new, pi_new) = hmm.reestimate(array([0,1,1,2]), al, be, a, b, pi)
>>> a_new
array([[ 0.53662942, 0.46337058],
[ 0.39886289, 0.60113711]])
>>> b_new
array([[ 0.35001693, 0.55333559, 0.09664748],
[ 0.14235731, 0.44259786, 0.41504483]])
>>> pi_new
array([ 0.72574077, 0.27425923])
And here's another one:
>>> al = hmm.forward(array([0,2,1,2]), a, b, pi)
>>> be = hmm.backward(array([0,2,1,2]), a, b, pi)
>>> (a_new, b_new, pi_new) = hmm.reestimate(array([0,2,1,2]), al, be, a, b, pi)
>>> a_new
array([[ 0.34624522, 0.65375478],
[ 0.35236106, 0.64763894]])
>>> b_new
array([[ 0.43532693, 0.30443136, 0.26024171],
[ 0.13435433, 0.21603434, 0.64961132]])
>>> pi_new
array([ 0.66907982, 0.33092018])
Once you've got this all set up, we can run EM on this single
example:
>>> (a_em,b_em,pi_em,logProbs) = hmm.runEM(array([0,1,1,2]), 2, 3)
iteration 1... log probability -5.25485
iteration 2... log probability -4.15148
iteration 3... log probability -4.14962
iteration 4... log probability -4.14593
iteration 5... log probability -4.13452
iteration 6... log probability -4.09672
iteration 7... log probability -3.99093
iteration 8... log probability -3.80666
iteration 9... log probability -3.64349
iteration 10... log probability -3.49871
iteration 11... log probability -3.29489
iteration 12... log probability -3.0354
iteration 13... log probability -2.83776
iteration 14... log probability -2.78245
iteration 15... log probability -2.77644
iteration 16... log probability -2.77452
iteration 17... log probability -2.77356
iteration 18... log probability -2.77308
iteration 19... log probability -2.77283
iteration 20... log probability -2.77271
iteration 21... log probability -2.77265
iteration 22... log probability -2.77262
iteration 23... log probability -2.7726
iteration 24... log probability -2.7726
iteration 25... log probability -2.77259
...
iteration 50... log probability -2.77259
>>> a_em
array([[ 2.54354295e-293, 1.00000000e+000],
[ 1.00000000e+000, 5.67486661e-014]])
>>> b_em
array([[ 5.00000000e-001, 5.00000000e-001, 1.49055507e-282],
[ 0.00000000e+000, 5.00000000e-001, 5.00000000e-001]])
>>> pi_em
array([ 1., 0.])
Note: There is internal randomization so you won't necessary
get these exact results: try running a couple of times.
WU4: Why does EM settle in this configuration? What is it
saying? What happens when you use three states instead of two?
What's the resulting probability of the data? What about four states?
What happens and why?
Now, we do some fun data. Take a look at words.txt. This
contains a small news article. We're going to run EM on this data
where each character is an observation. We can load this data
as:
>>> (words, wordsDict) = datasets.readCharacterFile("words.txt")
>>> words
array([ 8, 1, 22, ..., 12, 5, 19])
>>> wordsDict[words]
array(['h', 'a', 'v', ..., 'l', 'e', 's'],
dtype='|S1')
A space character has gotten translated to observation 0, and then "a" is 1, "b" is 2, ..., and "z" is 27. There are no other characters allowed. WU5: Run EM on this data using two states. Run it a few times until you get a final log probability at least -9250. (It should happen in two or three trials. This will take a few minutes to run.) When you do this, it will print out the resulting initial state probabilities, transition probabilities and emission probabilities. What has it learned? Include all of this output in your write-up.