\documentclass[fleqn]{article}

\usepackage{haldefs}
\usepackage{notes}
\usepackage{url}

\begin{document}
\lecture{CS5350: Machine Learning}{HW2A: PAC learning}{Due 6 Nov 2008}

\section{Written Exercises}

\bee
\i Generalize the algorithm for the rectangle learning problem to
$D$-dimensional space.  In particlar, suppose $\cC$ is the class of
all $D$-dimensional axis aligned hyperprectangles (i.e.,
high-dimensional boxes), and $\cH$ is the same.  Show that $\cC$ is
efficiently PAC learnable using $\cH$.  In particular, show what the
sample complexity of this algorithm is.  How does this relate to our
discussion of feature selection?

\i Consider our standard decision tree learning algorithm on $D$
binary features.

\bee
\i Suppose that the concept class is \emph{any} binary function and we
allow ourselves to build decision trees that are as deep as we need.
(In the noise-free setting, we know that a decision tree can compute
any binary function.)  Use the Occam bound to show that
this problem is PAC learnable.  What is the sample
complexity?

\i Suppose we know that the concept class is limited to decision trees
of maximum depth $\ka$ and we limit our hypothesis space to decision
trees of maximum depth $\ka$.  (The depth of a decision tree is the
maximum number of features used in any decision.)  Can you use the
Occam bound to show whether this problem is PAC learnable?  Why or why
not?  What is the sample complexity?  What is the sample complexity in
terms of $\ka$ and $D$?  \ene


\i {\bf 6350 Only:} Prove the Occam bound.  In particular, consider a
fixed but unknown concept class $\cC$, distribution $\cX$ and concept
$c \in \cC$.  Suppose we have $N$ labeled examples from $\cD$ (labeled
with $c$).  Let $\cL$ be a polytime learning algorithm that outputs $h
\in \cH$ that is consistent with the training sample, and $\cH$ is
\emph{finite}.  Show that $\cL$ PAC-learns $\cC$ using $\cH$ so long
as $\log \card{\cH} \leq bN\ep - \log (1/\de)$ for some constant $b$.

\i Consider boosting decision stumps.  I claimed in class that this
learns a linear classifier.  Show formally that this is true.  (For
simplity, you may assume boolean features.)  What advantages does
boosting decision stumps have over, say, SVM learning?
\ene


\end{document}
