MLRG/spring09

From ResearchWiki

Revision as of 08:21, 19 February 2009 by Piyush (Talk | contribs)
Jump to: navigation, search

Contents

CS7941: Theoretical Machine Learning

Time: Thr 10:45-12:05pm (see schedule for each week)

Location: MEB 3105


Expected Student Involvement

TBD.

Participants

Schedule and Readings

Date Papers Presenters
Introduction to Computational Learning Theory
R 15 Jan PAC Learning; Kearns and Vazirani, Chapter 1 Hal
R 22 Jan Occam's Razor; Kearns and Vazirani, Chapter 2 Amit
R 29 Jan Learning with Noise; Kearns and Vazirani, Chapter 5 Kris
Sample Complexity and Infinite Hypothesis Spaces
R 5 Feb VC dimension; Kearns and Vazirani, Chapter 3 Shuying, Zhan
R 19 Feb Rademacher complexity; Bartlett and Mendelson Piyush
T 24 Feb? Covering numbers; Zhang Seth
R 26 Feb Pseudodimension, Fat Shattering Dimension; Bartlett, Long and Williamson Adam
R 5 Mar PAC-Bayes; McAllester Ruihong
Boosting
R 12 Mar Introduction to Boosting; Kearns and Vazirani, Chapter 4 Arvind
R 26 Mar Boosting and margins; Schapire, Freund, Bartlett and Lee Jiarong
Assorted Topics
R 2 Apr Hardness of learning; Kearns and Vazirani, Chapter 6 Jagadeesh & Pravin
R 9 Apr Portfolio selection; Blum and Kalai Parasaran, Olga
R 16 Apr Game theory and learning; Freund and Schapire Nathan
R 23 Apr TBD Scott

Related Classes

Additional Readings

Reading Summaries

PAC Learning / 15 Jan (Daumé III)

This chapter motivates and defines the notion of PAC learning by working through a few "partial" definitions. The final definition is: say that C is PAC learnable with H if there exists an algorithm that takes examples (x,c(x)) for x˜D an outputs h \in H that has error at most ε with probability at least 1 − δ. This must be try for all choices of c \in C and all distributions D. It is efficient if its runtime is poly(size(c),1 / ε,1 / δ), where size(\cdot) is some measure of the size of a concept. Intuitively, what's going on is that there's some unknown function c that maps inputs to binary labels that we want to "learn." We call the thing we learn h and we would like to learn a function with low error (ε). The space of hypotheses we choose, H, should be at least as representative as the set of things we're trying to learn, C. The reason why we have to allow the process to fail with probability δ is because we might just get supremely unlucky in the random training examples that we sample from D.

One of the key ideas that's not made hugely important in this chapter, but will be very important for the rest of class, is the notion of sample complexity. The sample complexity of a PAC learning algorithm is the number of samples needed in order to guarantee that we obey the PAC properties. In exactly the same way that algorithms people study computational complexity (because computation is the limited resource that they care about), computational learning theory people care about sample complexity (because samples are the limited resource that they care about).

The things that are not quite obvious why we need them are:

  • Why do we need to include size(c) in the notion of efficiency?
  • Why shouldn't we just have C = H?

The first issue is minor and also easy to understand. If we don't even allow ourselves time to write down the concept we want to learn, how can we be expected to learn it?

The second issue is the fun one, and we'll spend the majority of our time discussing it. Basically, what we'll show is that imposing C = H makes some problems not efficiently PAC learnable that otherwise "should be." In particular, we'll show that if C is the set of 3-term DNF formula, then we have two results. (1) C is not efficiently PAC learnable using H = C; but (2) C is efficiently PAC learnable using H = 3-CNF formulae. This is really a cool result because it says very specifically that how you represent things matters in a very real sense: in some representations we can learn efficiently, in others we cannot.

We will begin by showing the positive result: that we can learn 3-term DNFs with 3-CNFs. In order to do this, we'll first show that we can learn boolean conjunctions with boolean conjunctions. Boolean conjunctions are easy (in the sense that we do them in CS 5350). Once we can do that, the idea is to take three terms u,v,w and create a meta-term y_{u,v,w} = u \lor v \lor w. We then learn a conjunction of meta-terms and map this back to a 3-CNF. The technical details are arguing that the mapping from 3-CNFs over the original term set to boolean conjunctions over the meta-term set is one-to-one.

Next we show the negative result. We'll need some CS-theory definitions. RP is the set of languages that can be accepted by algorithms that run in polynomial time and are allowed to flip coins. It is widely assumed that RP \neq NP (yes, Scott, they should get on that...). Next, the graph 3 coloring problem is: let G = (V,E) be an undirected graph; we ask if there is a mapping from vertices to the set {red,green,blue} so that no edge has the same color on both sides. Solving this problem is NP-complete.

What we want to show is that learning 3-term DNFs with 3-term DNFs is hard. Specifically, we will show that if we could solve that learning problem efficiently, then we could also solve the graph 3 coloring problem efficiently, where we define efficiently as RP. Assuming RP \neq NP, this is a contradiction. The big question is how to turn a graph 3 coloring problem into a learning problem. The construction is easy. The positive examples are indicator vectors for the vertices; the negative examples are indicator vectors for the edges. First, we argue that the graph is 3 colorable if and only if the resulting labeled examples can be represented by a 3-term DNF. Now, we simple set ε = 1 / (2 | S | ), and put a uniform distribution over the examples. If we were able to PAC learn this efficiently, then we'd guarantee that we have found a hypothesis that makes no errors.

Occam's Razor / 22 Jan (Amit Goyal)

This chapter considers a different definition of learning where rather than measuring the predictive power of hypothesis, it judges hypothesis on the basis of how succinctly it explains the data. This new definition will be called as Occam learning. The idea behind this new learning comes from the philosophy expounded by William of Occam. The principle states that one should not make more assumptions than the minimum needed. This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building.

In our simple concept learning, how can we define succinctness?

Succinctness will be measured by the size of the representation of the hypothesis concept. Equivalently, we can also define succinctness by the cardinality of hypothesis class. If the cardinality of hypothesis class is large then we need more bits to represent that class as compared to a smaller class. Thus an algorithm is an Occam algorithm if it finds a short hypothesis which is consistent with the data.

Consistent Learner

Given (xi,c(xi)) \forall i \in m examples, c(x_i)=h(x_i) \forall i

Occam's Razor, Cardinality version

We will prove the result written below in the class which provides a general bound on number of training examples (sample complexity.) sufficient for any consistent learner to successfully learn any target concept in H. m \ge \frac{1}{\epsilon}(ln|H|+ln(1/\delta)). Notice m grows linearly in 1 / ε, logarithmically in 1 / δ and size of hypothesis space H.

Dependence on | H |

Why do we have a dependence on hypothesis space |H|? The bigger the hypothesis space, the more rules there are to explain the training examples. However, this doesn't holds for test examples. e.g.

  • If we represent all our training examples as a table, the training error will be zero but such a hypothesis will not have any predictive power.
  • Everyone in the class writes down a 0 or a 1 for 10 training points and for 10 test points. This is matched against 20 random coin tosses did by me before the class. The highest accuracy guess had a training error of 20\%. This matches well with our intuition that given many hypotheses, we will find a hypothesis that works very well with the training data. However, the person who made the best guess for the training data ended up with a test error of 60\% when it came time to match the guesses against the test data.

We will take the same algorithm of boolean conjunctions which we took last time (it used only positive examples) and will show that using this new result, we can show a better bound for sample complexity. Additionally, we will propose a new algorithm which now also considers negative examples and will learn conjunctions with few relevant variables.

Rademacher and Gaussian complexities / 19 Feb (Piyush Rai)

Learning seeks to minimize the expected error (also called the "true error") on test data. However, during training, all we have is a finite amount of training data so estimating the expected error (and minimizing it) is not possible in general. As a surrogate, we seek to minimize the upper bound on the expected error. Luckily, this upper bound is possible to compute based only on the training data, plus a knowledge of the measure of "complexity" of the function class being learned from. The general form of this bound, given the function class we are trying to learn from, look like this:

Expected error <= Empirical error + (function class complexity dependent penalty term)

The bound holds for each function f from the function class being considered. The empirical error above is simply the proportion of training examples misclassified by f.

Why care about the error bound? These bounds are used for the purpose of doing model selection using a scheme called "complexity regularization": choose the model class that contains the function achieving the best upper bound on the expected error. Clearly, we want tightest possible upper bound on the expected error. And since it depends on the complexity of the function class being considered, the measure of complexity happens to play a crucial role in obtaining a tight bound.

Last time, we saw one measure of function class complexity (VC dimension) that is based purely on combinatorial arguments and doesn't take into account the training data. There has been empirical and theoretical evidence that bounds assuming a fixed complexity penalty (i.e., one that doesn't depend on training data) are not universally effective. This motivates us to consider some alternative complexity measures that are data-dependent. Examples include Maximum Discrepancy, Rademacher Complexity, and Gaussian Complexity. It has been shown that error bounds depending on such measures can never be much worse than VC-dimension based results.

All these measures can be computed using the training data. For example, the (empirical) Rademacher complexity of a function class F is defined as:

\hat{R_n}(F) = \mathbb{E}[\sup_{f \in F}\sum_{i=1}^n |\sigma_i f(X_i)| | X_1,\ldots, X_n]

where σi are independent uniform {+/-1} valued random variables, and the expectation is taken over the sequence σ. The Rademacher complexity of F is R_n(F) = \mathbb{E}(\hat{R_n}(F)), where the expectation is taken over all the samples of size n.

Gaussian complexity can also be defined in a likewise manner if we replace σi by gi where each gi is a Gaussian N(0,1) random variable.

Intuitively, the Rademacher (and Gaussian) complexity is a measure of the extent to which some function in class F can be correlated to a noisy sequence of length n. High correlation naturally implies "wiggliness" (and thus high complexity) of F.

By definition, however, computing these complexity measures can be prohibitively hard for many interesting function classes (since it involves an optimization over the function class F, which is usually a hard problem). But, fortunately, there exist simpler schemes that allow us to compute these complexities for specific cases. The basic idea is that some function classes can be represented as a combination of functions from simpler "basis" classes (think of kernel expansion form of the kernel-based SVM solution). The theory of Rademacher and Gaussian complexities tells us that estimating the complexity of such a class just requires estimating the complexities of the basis classes. We will look at some examples of such cases.

Past Semesters

Personal tools