\documentclass[fleqn]{article}

\usepackage{haldefs}
\usepackage{notes}
\usepackage{url}
\usepackage{verbatim}
\usepackage{fancyvrb}
\usepackage{color}

\definecolor{lightgreen}{rgb}{0.8,1.0,0.8}

\DefineVerbatimEnvironment%
  {matlab}{Verbatim}
  {baselinestretch=1.0,frame=single,fillcolor=\color{lightgreen}}


\begin{document}
\lecture{CS5350: Machine Learning}{Project 5: Clustering}{Due end of semester}

\leftright{Introduction}{}

In project 5, we experiment with unsupervised learning.  We use both
the $K$-means algorithm and Expectation Maximization for Gaussian
mixture models.  Refer to the course notes for the formulation.

Your implementation objectives are: {\tt clusterInit.m}, which should
initialize random clusters according to the ``furthest first''
heuristic, and {\tt gmm.m}, which should implement the Gaussian
mixture model.  You will have to implement both the E-step
(computation of ``pie charts'') and the M-step (updates of the means
$\mu$, the variances $\si^2$ and the class priors $\pi$).  Most of
this stuff you can copy and modify from the {\tt Gauss.m} code in
project 4.

There is a {\tt test\_kmeans} function and a {\tt test\_gmm} function
for testing the implemention on some simple 2D data.  If you don't see
reasonable clusters coming out, something is probably broken.

Again, like usual, there is a {\tt run} script which works on
(surprise surprise!) MNIST data!  

\leftright{Testing K-means}{\emph{(0\%)}}

This is just for fun, to give you a feel for the data.

There is a script, {\tt test\_kmeans} for (not surprisingly!)  testing
the {\tt kmeans} function (which I have implemented for you).  This
function reads in some simple two dimensional data and then tries to
cluster it.  The figures it produces are:

\bei
\i A plot of the data, unclustered
\i The data clustered with $K=2$ and random initialization
\i The data clustered with $K=3$ and random initialization
(this is run 16 times... you should see a small amount of variability
in the outputs).  The plots also include scores.
\i The data clustered with $K=3$ and ``furthest'' initialization
(this is run 16 times... you should see a small amount of variability
in the outputs).  The plots also include scores.  Note that this won't
work until you do the furthest-first initialization below.
\i A plot of $K \in \{2,3,4,5,6,8,10,15,20\}$ versus score.
\i The data clustered for $K \in \{2,3,4,5,6\}$ with together
with scores.
\eni

To verify that things seem to be going okay, aside from just checking
to see if your clusters look reasonable in the plots, the scores that
I get in Figure 6 are: $120$, $58$, $47$, $38$, $32$.  There will be a
small amount of variation due to the random initialization, but they
should be reasonably close.

\leftright{Furthest-First Initialization}{\emph{(30\%)}}

You'll find a partial implementation of the furthest-first heuristic
in {\tt clusterInit.m}.  You should finish the implementation.  You
should begin by selecting a ``bogus center'' and then iteratively
finding $K$-many ``real'' centers.  These ``real'' centers should be
as far as possible from any of the previous centers.  More formally,
if you've already selected $4$ centers, the $5$ center should be the
point whose \emph{minimum} distance to any of the $4$ centers is
\emph{maximum}.

After you've implemented this, you should re-try {\tt test\_kmeans} to
make sure things look reasonable.

\leftright{Gaussian Mixture Models}{\emph{(40\%)}}

Now, you need to implement EM for Gaussian mixture models in {\tt
  gmm.m}.  See the code for details on the implementation; in this
case, we're using ``full, shared'' covariance.  Almost all the code
can be copied and modified from {\tt Gauss.m}.

Once your implementation is up, you can run {\tt test\_gmm.m} to test
it.  This is basically identical to {\tt test\_kmeans.m} but a bit
shorter.  In Figure 2, you should see two good clusters and have a
score of about $-192$.  The figure three scores should all be about
$-191$.  Your Figure 4 flot should be more or less decreasing, though
I have a slight upward bump at $K=3$.  Figure 5 should be reasonably
intuitively pleasing; my scores are $-192, -190, -199, -226, -235$ for
the different settigns.

\leftright{Putting it all together\dots}{\emph{(30\%)}}

As usual, there is a {\tt run} script that puts this all together.
This one generates a \emph{large} number of plots.  However, in
contrast to previous projects, it only takes about few minutes to run
to completion.  All of this is based on the digits data from P1.

Please answer the questions associated with the figures in your
write-up.  This produces a fairly large number of figures, but doesn't
take too long (under $10$ minutes for me on my crummy laptop).

\begin{enumerate}
\item For experiment 1, we just do $K$-means clustering with the
  furthest-first initialization on the MNIST data with different
  values of $K$.  The scores you get should be roughly $22600, 19900,
  17500, 16400, 15600$.  This takes me about $1.5 minutes$.  It
  produces the following Figures:

  \begin{description}
    \item[Fig 1] displays a plot of the \emph{scores} produced by $K$
      means (the blue line) as well as ``regularized'' scores that try
      to take into account whether having more clusters is ``worth
      it'' or not.  The two versions of regularization we use are
      called BIC (Bayes Information Criteria) and AIC (Akaike
      Information Criteria).  If you had to choose a value of $K$
      using each of the three scoring methods, which would you choose?
      Based on what you know about this data, which scoring function
      do you like best?

    \item[Fig 2] displays ``classification'' accuracy results for the
      different values of $K$.  That is, we compare the clusters
      produced by kmeans to the \emph{true} clusters.  In all cases,
      higher scores are better.  The different lines are different
      methods for computing such a ``similarity of clusters'' score.
      Which seem to have good behaviour?  Why?

    \item[Fig 3-5] display the means found for $K=5, 10, 15$.  For
      each of these, do you see clusters that look like the actual
      digits?  How big does $K$ have to be before you essentially see
      a mean for each ``true'' digit?  Why do you suppose some digits
      are ``over-represented?''
    \end{description}

\item For experiment 2, we first reduce the dimensionality of the data
  using PCA.  We then test the Gaussian mixture model implementation
  for $5$ classes.  This takes me $30$ seconds.

  \begin{description}
  \item[Fig 6] plots the complete- (red xs) and incomplete- (blue os)
    log likelihoods for six runs of GMM with five clusters.  As a
    verification of implementation, you should see that the
    \emph{incomplete} log likelihood is always increasing.  Moreover,
    the complete should lower-bound the incomplete (i.e., the blue
    line should always be above the red line).  Is there a significant
    difference between the ILLs and CLLs?  Do the CLLs seem to
    converge roughly to the same value across all runs?
  \end{description}

\item Next, the GMM is tried with different number of clusters, from
  $1$ to $16$.  This takes me $5.5$ minutes.

  \begin{description}
    \item [Fig 7] shows the incomplete log likelihood as a function of
      number of clusters.  It should be more or less monotonically
      increasing.  Why doesn't it drop after $K=10$, which seems to be
      a ``good'' number of clusters for this data?

    \item[Fig 8-12] show the means found by GMM for different values
      of $K$.  How do these compare to the K-means means?  How big
      does $K$ have to be before you essentially see a mean for each
      ``true'' digit?  Why do you suppose some digits are
      ``over-represented?''  Does this differ from your answer to the
      (nearly identical) question about K-means?
  \end{description}

\item Finally, we use clustering as a method of speeding up $K$
  nearest neighbor classification.  What we do is: for each class $y$,
  generate $K$ clusters.  We think of these $K$ clusters as
  ``prototypical'' points for class $y$.  Thus, for MNIST, since there
  are $10$ classes, we end up with $10\times K$ points.  These are now
  the ``training data'' that is fed into the KNN classifier.  When
  $K=1$, each class just gets a single representative; when $K=5$,
  each class gets five representatives.  This takes about $10$ seconds
  for me.

  \begin{description}
  \item[Fig 13] displays test accuracy and test time as a function of
    the number of clusters per (true) class.  The test time should
    monotonically increase as a function of the number of clusters
    (per class).  Does test time increase linearly?  Why or why not?
    Does the test accuracy monotonically increase: that is, does
    adding more clusters always help?  Why or why not?  What number of
    clusters would you choose?
  \end{description}
\end{enumerate}


\leftright{What to hand in}{}

Please hand in (a) your {\tt results.mat} file and (b) a {\tt
  writeup.pdf} containing the answers to all the questions in the
previous section, plus your plots (please include your plots in the
write-up, not as separate files); and (c) your code.  Note that you do
\emph{not} have to discuss anything related to the ``toy'' data set.



\end{document}

