MLRG/spring09

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
(PAC Learning / 15 Jan (Daum&eacute III))
(Schedule and Readings)
Line 62: Line 62:
| R 19 Feb  
| R 19 Feb  
|| ''Covering numbers''; [http://stat.rutgers.edu/~tzhang/papers/jmlr02_cover.pdf Zhang]
|| ''Covering numbers''; [http://stat.rutgers.edu/~tzhang/papers/jmlr02_cover.pdf Zhang]
-
||  
+
|| Seth
|-
|-
| R 26 Feb  
| R 26 Feb  

Revision as of 17:13, 15 January 2009

Contents

CS7941: Theoretical Machine Learning

Time: Thr 10:45-12:05pm (see schedule for each week)

Location: MEB 3105


Expected Student Involvement

TBD.

Participants

Schedule and Readings

Date Papers Presenters
Introduction to Computational Learning Theory
R 15 Jan PAC Learning; Kearns and Vazirani, Chapter 1 Hal
R 22 Jan Occam's Razor; Kearns and Vazirani, Chapter 2 Amit
R 29 Jan Learning with Noise; Kearns and Vazirani, Chapter 5 John
Sample Complexity and Infinite Hypothesis Spaces
R 5 Feb VC dimension; Kearns and Vazirani, Chapter 3 Arvind
R 12 Feb Rademacher complexity; Bartlett and Mendelson Piyush
R 19 Feb Covering numbers; Zhang Seth
R 26 Feb Pseudodimension, Fat Shattering Dimension; Bartlett, Long and Williamson Adam
R 5 Mar PAC-Bayes; McAllester Ruihong
Boosting
R 12 Mar Introduction to Boosting; Kearns and Vazirani, Chapter 4 Arvind
R 26 Mar Boosting and margins; Schapire, Freund, Bartlett and Lee Jiarong
Assorted Topics
R 2 Apr Hardness of learning; Kearns and Vazirani, Chapter 6 Jagadeesh
R 9 Apr Portfolio selection; Blum and Kalai Parasaran
R 16 Apr Game theory and learning; Freund and Schapire Nathan
R 23 Apr TBD

Related Classes

Additional Readings

Reading Summaries

PAC Learning / 15 Jan (Daumé III)

This chapter motivates and defines the notion of PAC learning by working through a few "partial" definitions. The final definition is: say that C is PAC learnable with H if there exists an algorithm that takes examples (x,c(x)) for x˜D an outputs h \in H that has error at most ε with probability at least 1 − δ. This must be try for all choices of c \in C and all distributions D. It is efficient if its runtime is poly(size(c),1 / ε,1 / δ), where size(\cdot) is some measure of the size of a concept. Intuitively, what's going on is that there's some unknown function c that maps inputs to binary labels that we want to "learn." We call the thing we learn h and we would like to learn a function with low error (ε). The space of hypotheses we choose, H, should be at least as representative as the set of things we're trying to learn, C. The reason why we have to allow the process to fail with probability δ is because we might just get supremely unlucky in the random training examples that we sample from D.

One of the key ideas that's not made hugely important in this chapter, but will be very important for the rest of class, is the notion of sample complexity. The sample complexity of a PAC learning algorithm is the number of samples needed in order to guarantee that we obey the PAC properties. In exactly the same way that algorithms people study computational complexity (because computation is the limited resource that they care about), computational learning theory people care about sample complexity (because samples are the limited resource that they care about).

The things that are not quite obvious why we need them are:

  • Why do we need to include size(c) in the notion of efficiency?
  • Why shouldn't we just have C = H?

The first issue is minor and also easy to understand. If we don't even allow ourselves time to write down the concept we want to learn, how can we be expected to learn it?

The second issue is the fun one, and we'll spend the majority of our time discussing it. Basically, what we'll show is that imposing C = H makes some problems not efficiently PAC learnable that otherwise "should be." In particular, we'll show that if C is the set of 3-term DNF formula, then we have two results. (1) C is not efficiently PAC learnable using H = C; but (2) C is efficiently PAC learnable using H = 3-CNF formulae. This is really a cool result because it says very specifically that how you represent things matters in a very real sense: in some representations we can learn efficiently, in others we cannot.

We will begin by showing the positive result: that we can learn 3-term DNFs with 3-CNFs. In order to do this, we'll first show that we can learn boolean conjunctions with boolean conjunctions. Boolean conjunctions are easy (in the sense that we do them in CS 5350). Once we can do that, the idea is to take three terms u,v,w and create a meta-term y_{u,v,w} = u \lor v \lor w. We then learn a conjunction of meta-terms and map this back to a 3-CNF. The technical details are arguing that the mapping from 3-CNFs over the original term set to boolean conjunctions over the meta-term set is one-to-one.

Next we show the negative result. We'll need some CS-theory definitions. RP is the set of languages that can be accepted by algorithms that run in polynomial time and are allowed to flip coins. It is widely assumed that RP \neq NP (yes, Scott, they should get on that...). Next, the graph 3 coloring problem is: let G = (V,E) be an undirected graph; we ask if there is a mapping from vertices to the set {red,green,blue} so that no edge has the same color on both sides. Solving this problem is NP-complete.

What we want to show is that learning 3-term DNFs with 3-term DNFs is hard. Specifically, we will show that if we could solve that learning problem efficiently, then we could also solve the graph 3 coloring problem efficiently, where we define efficiently as RP. Assuming RP \neq NP, this is a contradiction. The big question is how to turn a graph 3 coloring problem into a learning problem. The construction is easy. The positive examples are indicator vectors for the vertices; the negative examples are indicator vectors for the edges. First, we argue that the graph is 3 colorable if and only if the resulting labeled examples can be represented by a 3-term DNF. Now, we simple set ε = 1 / (2 | S | ), and put a uniform distribution over the examples. If we were able to PAC learn this efficiently, then we'd guarantee that we have found a hypothesis that makes no errors.

Past Semesters

Personal tools