MLRG/spring09
From ResearchWiki
(→Participants) |
m |
||
| Line 17: | Line 17: | ||
* [http://www.cs.utah.edu/~ngilbert/ Nathan Gilbert], PhD Student, School of Computing | * [http://www.cs.utah.edu/~ngilbert/ Nathan Gilbert], PhD Student, School of Computing | ||
* [http://www.cs.utah.edu/~piyush/ Piyush Rai], PhD Student, School of Computing | * [http://www.cs.utah.edu/~piyush/ Piyush Rai], PhD Student, School of Computing | ||
| - | * [mailto:john.moeller@utah.edu John Moeller], | + | * [mailto:john.moeller@utah.edu John Moeller], PhD Student, School of Computing |
* [http://www.cs.utah.edu/~praman/ Parasaran Raman], PhD Student, School of Computing | * [http://www.cs.utah.edu/~praman/ Parasaran Raman], PhD Student, School of Computing | ||
* [mailto:huangrh@cs.utah.edu Ruihong Huang], PhD Student, School of Computing | * [mailto:huangrh@cs.utah.edu Ruihong Huang], PhD Student, School of Computing | ||
Revision as of 02:59, 13 February 2009
Contents |
CS7941: Theoretical Machine Learning
Time: Thr 10:45-12:05pm (see schedule for each week)
Location: MEB 3105
Expected Student Involvement
TBD.
Participants
- Hal Daumé III, Assistant Professor, School of Computing
- Arvind Agarwal, PhD Student, School of Computing
- Nathan Gilbert, PhD Student, School of Computing
- Piyush Rai, PhD Student, School of Computing
- John Moeller, PhD Student, School of Computing
- Parasaran Raman, PhD Student, School of Computing
- Ruihong Huang, PhD Student, School of Computing
- Seth Juarez, PhD Student, School of Computing
- Shuying Shen, PhD Student, Department of Biomedical Informatics
- Jiarong Jiang, PhD Student, School of Computing
- Olga Patterson, PhD Student, Department of Biomedical Informatics
- Jagadeesh Jagarlamudi, PhD Student, School of Computing
- Adam R. Teichert, MS Student, School of Computing
- Amit Goyal, PhD Student, School of Computing
- Pravin Chandrasekaran, MS Student, School of Computing
- Kristina Doing-Harris, MA, MS, PhD (Glasgow), BS Student, School of Computing
- Scott Alfeld, MS Student, School of Computing
- Zhan Wang, PhD Student, School of Computing
Schedule and Readings
| Date | Papers | Presenters |
|---|---|---|
| Introduction to Computational Learning Theory | ||
| R 15 Jan | PAC Learning; Kearns and Vazirani, Chapter 1 | Hal |
| R 22 Jan | Occam's Razor; Kearns and Vazirani, Chapter 2 | Amit |
| R 29 Jan | Learning with Noise; Kearns and Vazirani, Chapter 5 | Kris |
| Sample Complexity and Infinite Hypothesis Spaces | ||
| R 5 Feb | VC dimension; Kearns and Vazirani, Chapter 3 | Shuying, Zhan |
| T 17 Feb | Rademacher complexity; Bartlett and Mendelson | Piyush |
| R 19 Feb | Covering numbers; Zhang | Seth |
| R 26 Feb | Pseudodimension, Fat Shattering Dimension; Bartlett, Long and Williamson | Adam |
| R 5 Mar | PAC-Bayes; McAllester | Ruihong |
| Boosting | ||
| R 12 Mar | Introduction to Boosting; Kearns and Vazirani, Chapter 4 | Arvind |
| R 26 Mar | Boosting and margins; Schapire, Freund, Bartlett and Lee | Jiarong |
| Assorted Topics | ||
| R 2 Apr | Hardness of learning; Kearns and Vazirani, Chapter 6 | Jagadeesh & Pravin |
| R 9 Apr | Portfolio selection; Blum and Kalai | Parasaran, Olga |
| R 16 Apr | Game theory and learning; Freund and Schapire | Nathan |
| R 23 Apr | TBD | Scott |
Related Classes
- Machine Learning Theory, CMU by Avrim Blum
- Computational Learning Theory, Penn by Michael Kearns and Koby Crammer
- Learning Theory, TTI-C by Sham Kakade
- Introduction to Computational Learning Theory, Columbia by Rocco Servedio
- Theoretical Machine Learning, Princeton by Rob Shapire
- The Computational Complexity of Machine Learning, U Texas by Adam Klivans
Additional Readings
Reading Summaries
PAC Learning / 15 Jan (Daumé III)
This chapter motivates and defines the notion of PAC learning by working through a few "partial" definitions. The final definition is: say that C is PAC learnable with H if there exists an algorithm that takes examples (x,c(x)) for x˜D an outputs
that has error at most ε with probability at least 1 − δ. This must be try for all choices of
and all distributions D. It is efficient if its runtime is poly(size(c),1 / ε,1 / δ), where
is some measure of the size of a concept. Intuitively, what's going on is that there's some unknown function c that maps inputs to binary labels that we want to "learn." We call the thing we learn h and we would like to learn a function with low error (ε). The space of hypotheses we choose, H, should be at least as representative as the set of things we're trying to learn, C. The reason why we have to allow the process to fail with probability δ is because we might just get supremely unlucky in the random training examples that we sample from D.
One of the key ideas that's not made hugely important in this chapter, but will be very important for the rest of class, is the notion of sample complexity. The sample complexity of a PAC learning algorithm is the number of samples needed in order to guarantee that we obey the PAC properties. In exactly the same way that algorithms people study computational complexity (because computation is the limited resource that they care about), computational learning theory people care about sample complexity (because samples are the limited resource that they care about).
The things that are not quite obvious why we need them are:
- Why do we need to include size(c) in the notion of efficiency?
- Why shouldn't we just have C = H?
The first issue is minor and also easy to understand. If we don't even allow ourselves time to write down the concept we want to learn, how can we be expected to learn it?
The second issue is the fun one, and we'll spend the majority of our time discussing it. Basically, what we'll show is that imposing C = H makes some problems not efficiently PAC learnable that otherwise "should be." In particular, we'll show that if C is the set of 3-term DNF formula, then we have two results. (1) C is not efficiently PAC learnable using H = C; but (2) C is efficiently PAC learnable using H = 3-CNF formulae. This is really a cool result because it says very specifically that how you represent things matters in a very real sense: in some representations we can learn efficiently, in others we cannot.
We will begin by showing the positive result: that we can learn 3-term DNFs with 3-CNFs. In order to do this, we'll first show that we can learn boolean conjunctions with boolean conjunctions. Boolean conjunctions are easy (in the sense that we do them in CS 5350). Once we can do that, the idea is to take three terms u,v,w and create a meta-term
. We then learn a conjunction of meta-terms and map this back to a 3-CNF. The technical details are arguing that the mapping from 3-CNFs over the original term set to boolean conjunctions over the meta-term set is one-to-one.
Next we show the negative result. We'll need some CS-theory definitions. RP is the set of languages that can be accepted by algorithms that run in polynomial time and are allowed to flip coins. It is widely assumed that
(yes, Scott, they should get on that...). Next, the graph 3 coloring problem is: let G = (V,E) be an undirected graph; we ask if there is a mapping from vertices to the set {red,green,blue} so that no edge has the same color on both sides. Solving this problem is NP-complete.
What we want to show is that learning 3-term DNFs with 3-term DNFs is hard. Specifically, we will show that if we could solve that learning problem efficiently, then we could also solve the graph 3 coloring problem efficiently, where we define efficiently as RP. Assuming
, this is a contradiction. The big question is how to turn a graph 3 coloring problem into a learning problem. The construction is easy. The positive examples are indicator vectors for the vertices; the negative examples are indicator vectors for the edges. First, we argue that the graph is 3 colorable if and only if the resulting labeled examples can be represented by a 3-term DNF. Now, we simple set ε = 1 / (2 | S | ), and put a uniform distribution over the examples. If we were able to PAC learn this efficiently, then we'd guarantee that we have found a hypothesis that makes no errors.
Occam's Razor / 22 Jan (Amit Goyal)
This chapter considers a different definition of learning where rather than measuring the predictive power of hypothesis, it judges hypothesis on the basis of how succinctly it explains the data. This new definition will be called as Occam learning. The idea behind this new learning comes from the philosophy expounded by William of Occam. The principle states that one should not make more assumptions than the minimum needed. This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building.
In our simple concept learning, how can we define succinctness?
Succinctness will be measured by the size of the representation of the hypothesis concept. Equivalently, we can also define succinctness by the cardinality of hypothesis class. If the cardinality of hypothesis class is large then we need more bits to represent that class as compared to a smaller class. Thus an algorithm is an Occam algorithm if it finds a short hypothesis which is consistent with the data.
Consistent Learner
Given (xi,c(xi))
m examples, c(x_i)=h(x_i)
Occam's Razor, Cardinality version
We will prove the result written below in the class which provides a general bound on number of training examples (sample complexity.) sufficient for any consistent learner to successfully learn any target concept in H.
. Notice m grows linearly in 1 / ε, logarithmically in 1 / δ and size of hypothesis space H.
Dependence on | H |
Why do we have a dependence on hypothesis space |H|? The bigger the hypothesis space, the more rules there are to explain the training examples. However, this doesn't holds for test examples. e.g.
- If we represent all our training examples as a table, the training error will be zero but such a hypothesis will not have any predictive power.
- Everyone in the class writes down a 0 or a 1 for 10 training points and for 10 test points. This is matched against 20 random coin tosses did by me before the class. The highest accuracy guess had a training error of 20
. This matches well with our intuition that given many hypotheses, we will find a hypothesis that works very well with the training data. However, the person who made the best guess for the training data ended up with a test error of 60
when it came time to match the guesses against the test data.
We will take the same algorithm of boolean conjunctions which we took last time (it used only positive examples) and will show that using this new result, we can show a better bound for sample complexity. Additionally, we will propose a new algorithm which now also considers negative examples and will learn conjunctions with few relevant variables.