MLRG/spring09
From ResearchWiki
CS7941: Theoretical Machine Learning
Time: Thr 10:45-12:05pm (see schedule for each week)
Location: MEB 3105
Expected Student Involvement
TBD.
Participants
- Hal Daumé III, Assistant Professor, School of Computing
- Arvind Agarwal, PhD Student, School of Computing
- Nathan Gilbert, PhD Student, School of Computing
- Piyush Rai, PhD Student, School of Computing
- John Moeller, PhD Student, School of Computing
- Parasaran Raman, PhD Student, School of Computing
- Ruihong Huang, PhD Student, School of Computing
- Seth Juarez, PhD Student, School of Computing
- Shuying Shen, PhD Student, Department of Biomedical Informatics
- Jiarong Jiang, PhD Student, School of Computing
- Olga Patterson, PhD Student, Department of Biomedical Informatics
- Jagadeesh Jagarlamudi, PhD Student, School of Computing
- Adam R. Teichert, MS Student, School of Computing
- Amit Goyal, PhD Student, School of Computing
- Pravin Chandrasekaran, MS Student, School of Computing
- Kristina Doing-Harris, MA, MS, PhD (Glasgow), BS Student, School of Computing
- Scott Alfeld, MS Student, School of Computing
- Zhan Wang, PhD Student, School of Computing
- Avishek Saha, PhD Student, School of Computing
Schedule and Readings
| Date | Papers | Presenters |
|---|---|---|
| Introduction to Computational Learning Theory | ||
| R 15 Jan | PAC Learning; Kearns and Vazirani, Chapter 1 | Hal |
| R 22 Jan | Occam's Razor; Kearns and Vazirani, Chapter 2 | Amit |
| R 29 Jan | Learning with Noise; Kearns and Vazirani, Chapter 5 | Kris |
| Sample Complexity and Infinite Hypothesis Spaces | ||
| R 5 Feb | VC dimension; Kearns and Vazirani, Chapter 3 | Shuying, Zhan |
| R 19 Feb | Rademacher complexity; Bartlett and Mendelson | Piyush |
| R 26 Feb | Covering numbers; Zhang | Seth |
| R 5 Mar | Pseudodimension, Fat Shattering Dimension; Bartlett, Long and Williamson | Adam |
| T 10 Mar | PAC-Bayes; McAllester | Ruihong |
| Boosting | ||
| R 12 Mar | Introduction to Boosting; Kearns and Vazirani, Chapter 4 | Arvind |
| R 26 Mar | Boosting and margins; Schapire, Freund, Bartlett and Lee Optional: Evidence Contrary to the Statistical View of Boosting (plus responses and rejoinder) | Jiarong |
| Assorted Topics | ||
| R 2 Apr | Hardness of learning; Kearns and Vazirani, Chapter 6 | Jagadeesh & Pravin |
| R 9 Apr | Portfolio selection; Blum and Kalai | Parasaran, Olga |
| R 16 Apr | Game theory and learning; Freund and Schapire | Nathan |
| R 23 Apr | Learning finite state automata; Kearns and Vazirani, Chapter 8 | Scott |
Related Classes
- Machine Learning Theory, CMU by Avrim Blum
- Computational Learning Theory, Penn by Michael Kearns and Koby Crammer
- Learning Theory, TTI-C by Sham Kakade
- Introduction to Computational Learning Theory, Columbia by Rocco Servedio
- Theoretical Machine Learning, Princeton by Rob Shapire
- The Computational Complexity of Machine Learning, U Texas by Adam Klivans
Additional Readings
Reading Summaries
PAC Learning / 15 Jan (Daumé III)
This chapter motivates and defines the notion of PAC learning by working through a few "partial" definitions. The final definition is: say that C is PAC learnable with H if there exists an algorithm that takes examples (x,c(x)) for x˜D an outputs
that has error at most ε with probability at least 1 − δ. This must be try for all choices of
and all distributions D. It is efficient if its runtime is poly(size(c),1 / ε,1 / δ), where
is some measure of the size of a concept. Intuitively, what's going on is that there's some unknown function c that maps inputs to binary labels that we want to "learn." We call the thing we learn h and we would like to learn a function with low error (ε). The space of hypotheses we choose, H, should be at least as representative as the set of things we're trying to learn, C. The reason why we have to allow the process to fail with probability δ is because we might just get supremely unlucky in the random training examples that we sample from D.
One of the key ideas that's not made hugely important in this chapter, but will be very important for the rest of class, is the notion of sample complexity. The sample complexity of a PAC learning algorithm is the number of samples needed in order to guarantee that we obey the PAC properties. In exactly the same way that algorithms people study computational complexity (because computation is the limited resource that they care about), computational learning theory people care about sample complexity (because samples are the limited resource that they care about).
The things that are not quite obvious why we need them are:
- Why do we need to include size(c) in the notion of efficiency?
- Why shouldn't we just have C = H?
The first issue is minor and also easy to understand. If we don't even allow ourselves time to write down the concept we want to learn, how can we be expected to learn it?
The second issue is the fun one, and we'll spend the majority of our time discussing it. Basically, what we'll show is that imposing C = H makes some problems not efficiently PAC learnable that otherwise "should be." In particular, we'll show that if C is the set of 3-term DNF formula, then we have two results. (1) C is not efficiently PAC learnable using H = C; but (2) C is efficiently PAC learnable using H = 3-CNF formulae. This is really a cool result because it says very specifically that how you represent things matters in a very real sense: in some representations we can learn efficiently, in others we cannot.
We will begin by showing the positive result: that we can learn 3-term DNFs with 3-CNFs. In order to do this, we'll first show that we can learn boolean conjunctions with boolean conjunctions. Boolean conjunctions are easy (in the sense that we do them in CS 5350). Once we can do that, the idea is to take three terms u,v,w and create a meta-term
. We then learn a conjunction of meta-terms and map this back to a 3-CNF. The technical details are arguing that the mapping from 3-CNFs over the original term set to boolean conjunctions over the meta-term set is one-to-one.
Next we show the negative result. We'll need some CS-theory definitions. RP is the set of languages that can be accepted by algorithms that run in polynomial time and are allowed to flip coins. It is widely assumed that
(yes, Scott, they should get on that...). Next, the graph 3 coloring problem is: let G = (V,E) be an undirected graph; we ask if there is a mapping from vertices to the set {red,green,blue} so that no edge has the same color on both sides. Solving this problem is NP-complete.
What we want to show is that learning 3-term DNFs with 3-term DNFs is hard. Specifically, we will show that if we could solve that learning problem efficiently, then we could also solve the graph 3 coloring problem efficiently, where we define efficiently as RP. Assuming
, this is a contradiction. The big question is how to turn a graph 3 coloring problem into a learning problem. The construction is easy. The positive examples are indicator vectors for the vertices; the negative examples are indicator vectors for the edges. First, we argue that the graph is 3 colorable if and only if the resulting labeled examples can be represented by a 3-term DNF. Now, we simple set ε = 1 / (2 | S | ), and put a uniform distribution over the examples. If we were able to PAC learn this efficiently, then we'd guarantee that we have found a hypothesis that makes no errors.
Occam's Razor / 22 Jan (Amit Goyal)
This chapter considers a different definition of learning where rather than measuring the predictive power of hypothesis, it judges hypothesis on the basis of how succinctly it explains the data. This new definition will be called as Occam learning. The idea behind this new learning comes from the philosophy expounded by William of Occam. The principle states that one should not make more assumptions than the minimum needed. This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building.
In our simple concept learning, how can we define succinctness?
Succinctness will be measured by the size of the representation of the hypothesis concept. Equivalently, we can also define succinctness by the cardinality of hypothesis class. If the cardinality of hypothesis class is large then we need more bits to represent that class as compared to a smaller class. Thus an algorithm is an Occam algorithm if it finds a short hypothesis which is consistent with the data.
Consistent Learner
Given (xi,c(xi))
m examples, c(x_i)=h(x_i)
Occam's Razor, Cardinality version
We will prove the result written below in the class which provides a general bound on number of training examples (sample complexity.) sufficient for any consistent learner to successfully learn any target concept in H.
. Notice m grows linearly in 1 / ε, logarithmically in 1 / δ and size of hypothesis space H.
Dependence on | H |
Why do we have a dependence on hypothesis space |H|? The bigger the hypothesis space, the more rules there are to explain the training examples. However, this doesn't holds for test examples. e.g.
- If we represent all our training examples as a table, the training error will be zero but such a hypothesis will not have any predictive power.
- Everyone in the class writes down a 0 or a 1 for 10 training points and for 10 test points. This is matched against 20 random coin tosses did by me before the class. The highest accuracy guess had a training error of 20
. This matches well with our intuition that given many hypotheses, we will find a hypothesis that works very well with the training data. However, the person who made the best guess for the training data ended up with a test error of 60
when it came time to match the guesses against the test data.
We will take the same algorithm of boolean conjunctions which we took last time (it used only positive examples) and will show that using this new result, we can show a better bound for sample complexity. Additionally, we will propose a new algorithm which now also considers negative examples and will learn conjunctions with few relevant variables.
Learning with Noise / 29 Jan (Kristina Doing-Harris)
The goal is to PAC learn in the presence of classification noise.
Let C be a concept class and H be a representation class over X. We say that C is PAC learnable using H in the presence of classification noise, if there exists an algorithm L with the following property: for any concept c ∈ C, any distribution D on X, any 0 ≤ η < ½, and any 0 < ε < 1, 0 < δ < 1, and η0 (where η ≤ η0 < ½), if L is given access to and inputs ε, δ, and η0, then with probability at least 1 - δ, L outputs a hypothesis concept h ∈ H satisfying error(h) ≤ ε. This probability is taken over the randomization in the calls to the noisy oracle and any internal randomization of L.
If L runs in time polynomial in n, 1/ε, 1/δ, and 1/(1-2η0), we say that C is efficiently PAC learnable using H in the presence of classification noise.
In pursuit of this goal they define the Statistical Query Model Let C be a concept class and H be a representation class over X. We say that C is efficiently learnable from statistical queries using H, if there exists an algorithm L such that for any concept c ∈ C, any distribution D on X, any 0 ≤ ε < ½, if L is given access to STAT(c,D) and input ε, then:
- for every query (χ, τ) made by L, the predicate χ is evaluated in time polynomial in (1/ε, n, size(c)) sufficiently “simple” statistics and 1/τ is bounded by a polynomial in (1/ε, n, size(c)) sufficiently coarse tolerance.
- L will halt in time bounded by a polynomial in (1/ε, n, size(c)).
- L will output a hypothesis h ∈ H satisfying error(h) ≤ ε.
A statistical Query (χ, τ) is a request for Pχ, where Pχ = Prx∈D[χ(x, c(x)) = 1] and STAT(c,D) returns an approximation of Pχ called Pχ_hat such that Pχ- τ ≤ Pχ_hat ≤ Pχ+ τ, where τ is the tolerance of STAT(c,D).
Theorem: If C is efficiently learnable from statistical queries using H, C is efficiently PAC learnable using H. This follows because you are given an EX(c,D) oracle for PAC learning, which gives you the “truth” from c. You can then simulate STAT(c,D) using EX(c,D) and since STAT(c,D) is defined to be efficient learning remains efficient.
Then they define an algorithm to simulate Statistical Queries given (ε, δ, and η0):
- First define τmin with the polynomial bound 1/(4r(1/ε, n, size(c)) on the inverse tolerance for all queries of statistical query algorithm L.
- Define Δ to be c0 τmin/(2(1-2η0)2).
- For i = 0 to ⎡1/2 Δ⎤ - 1:
- η_hat defined as iΔ
- Simulate statistical query algorithm L. with accuracy parameter ε and using η_hat as the guessed noise rate. More precisely for every statistical query (χ, τ) made by L.
- Randomly sample from the noisy oracle to compute estimates pi_hat for p_i = Pr_EX(c,D)[x ∈ X1], q_hat for q = Pr_EX_η(c,D_1)[x = 1], and r_hat for r = Pr_EX_η(c,D)[(χ = 1) & (x ∈ X2)]. Here X1 and X2 are the partitions of X defined by χ. These estimates should be accurate (with high probability) within an additive error τ’ = τ/27.
- Pχ_hat defined as p_1_hat(q_hat - η_hat)/(1 - 2η_hat) + r_hat
- Return Pχ_hat to L.
- Let h_i be the hypothesis returned by the ith simulation of L.
- For i = 0 to ⎡1/2 Δ⎤ - 1:
- Let γi = Pr_EX_η(c,D)[h_i(x) ≠ b]
- Randomly sample from EX_η(c,D) to obtain estimates γi_hat that are accurate within additive error ε/2(1-2η0) and output h_i with the lowest γi_hat.
The Vapnik-Chervonenkis Dimension / 5 Feb (Shuying Shen & Zhan Wang)
The task of this chapter is to solve this problem: how many random examples does a learning algorithm need to draw before it has sufficient information to learn an unknown target comcept chosen from the concept class C? We want to find a general measure of complexity for concept classes of infinite cardinality. We define a purely combinatorial measure named Vapnik-Chervonenkis Dimension which assigns to each concept class C a single number that characterizes the sample size needed to PAC learning C.
Let X be the instance space. Define set
. If S = {x1,...,xm},
. Then Πc(S) is the set of all the behaviors or dichotomies on S that are induced or realized by C.
If Πc(S) = {0,1}m (where m = | S | ), then say S is shattered by C. That is, S is shattered by C if C realizes all possible dichotomies of S.
After the above definitions we get the definition of VC Dimension of C. VCD(C) denotes the cardinality d of the largest set S shattered by C. If arbitrarily large finite sets can be shattered by C, then
. For example the linear halfspaces in the plane, a straight line can divide any arrangement of 3 sample points but can not gurantee to classify 4 points. Then the VCD for linear halfspace is 3.
Define for any natural number: Πc(m) = max{ | Πc(S) | : | S | = m}, meaning the upper bound of Πc(S) using m sample points, also a measure of the complexity of C. Next define for any natural numbers m and d, the function Φd(m) = Φd(m − 1) + Φd − 1(m − 1) with initial conditions Φd(0) = Φ0(m) = 1.
Based on the difinitions above, we give the Lemma 3.1: If VCD(C) = d, then for any m,
. Lemma 3.2:
.
Next we want to know how many sample points we need for a algorithm to satisfy it as a PAC learning algorithm. The fundamental concept we should know is the class of error regions with respect to c and C by
. Note that VCD(C) = VCD(Δ(c)). We then refine the definition of Δ(c) to consider only those error regions with weight at least ε under the fixed target distribution D. Thus let
, we can make the definition: for any ε > 0, we say that a set S is an ε-net for Δ(c) if every region in Δε(c) is "hit" by a point in S, that is if for every
we have
.
Thus an ε-net for Δ(c) is a set that hits all of the regions with a weight larger than ε. If we can bound the probability that the random sample S fails to form an ε-net for Δ(c), then we have bounded the probability that a hypothesis consistent with S has error greater than ε. For the case of finite C, for any fixed error region
, the probability that we fail to hit cΔh in m random examples is at most (1 − ε)m. Thus the probability that we fail to hit some
is bounded above by | Δ(c) | (1 − ε)m, which in turn is bounded by | C | (1 − ε)m. This gives us a bound of Φd( | X | )(1 − ε)m on the probability of failing to draw an ε-net for Δ(c).
Since the size of the required sample depends on the VC dimension d and ε and δ and independent of |C| and |X|, we will get an upper bound on the number of examples required for PAC learning that depends only on these same quantities. We draw a multiset S1 of m random examples from D and let A denote the event that the elements of S1 fail to form an ε-net for Δ(c). Then our goal is to upper bound the probability of event A. If A occurs, then S1 misses some region
. We fix this region r and then draw another multiset S2 of m random examples from D. Since each elements of S2 has probability at least ε of hitting r, if m = O(1 / ε) the probability S2 hits r at least εm / 2 times is at least 1/2 by Markov's inequality.
Now let B be the combined event over the random draws of S1 and S2 that A occurs on the draw of S1 and S2 has at least εm / 2 hits in a region of Δε(c) that is missed by S1, then
. Since the definition of event B already requires that event A occurs on S1, we also have Pr[B] = Pr[B|A]Pr[A], so
. Thus we can upper bound the probability of event A by upper bounding the probability of event B. To analyze the probability of event B, we need only consider the regions of
.
We can draw a multiset S of 2m instances and divide it into S1 and S2. Once S is drawn and fixed, we may also fix a region
satisfying
. For this fixed S and fixed r, we now analyze the probability that
. We then obtain the bound on the probability of event B by summing over all possible fixed
and applying the union bound.
Now we need only to calculate the bound of probability that all l(
) of regions that fall into S2 (that is, the probability that
). Because
, we get the probability that the random partitioning of S results in
is at most 2 − εm / 2. Based on Theorem 2.2, we get Theorem 3.3: An algorithm L is a PAC learning algorithm for C provided it is given a random sample of m examples from EX(c,D), where m obeys
for some constant c0 > 0.
Now we talk about a tighter bound given by Theorem 3.3 within a factor of O(log1 / ε)(ignoring the dependence on δ). Theorem 3.5: Any algorithm for PAC learning a concept class of VC dimension d must use Ω(d / ε) examples in the worst case.
Rademacher and Gaussian complexities / 19 Feb (Piyush Rai)
Learning seeks to minimize the expected error (also called the "true error") on test data. However, during training, all we have is a finite amount of training data so estimating the expected error (and minimizing it) is not possible in general. As a surrogate, we seek to minimize the upper bound on the expected error. Luckily, this upper bound is possible to compute based only on the training data, plus a knowledge of the measure of "complexity" of the function class being learned from. The general form of this bound, given the function class we are trying to learn from, look like this:
Expected error <= Empirical error + (function class complexity dependent penalty term, number of examples)
The bound holds for each function f from the function class being considered. The empirical error above is simply the proportion of training examples misclassified by f. In addition, the complexity term goes to zero as the number of examples tends to infinity.
Why care about the error bound? These bounds are used for the purpose of doing model selection using a scheme called "complexity regularization": choose the model class that contains the function achieving the best upper bound on the expected error. Clearly, we want tightest possible upper bound on the expected error. And since it depends on the complexity of the function class being considered, the measure of complexity happens to play a crucial role in obtaining a tight bound.
Last time, we saw one measure of function class complexity (VC dimension) that is based purely on combinatorial arguments and doesn't take into account the training data. There has been empirical and theoretical evidence that bounds assuming a fixed complexity penalty (i.e., one that doesn't depend on training data) are not universally effective. This motivates us to consider some alternative complexity measures that are data-dependent. Examples include Maximum Discrepancy, Rademacher Complexity, and Gaussian Complexity. It has been shown that error bounds depending on such measures can never be much worse (and in some cases even better) than VC-dimension based results.
All these measures can be computed using the training data. For example, the (empirical) Rademacher complexity of a function class F is defined as:
where σi are independent uniform {+/-1} valued random variables, and the expectation is taken over all sequences σ. The Rademacher complexity of F is
, where the expectation is taken over all the samples of size n.
Gaussian complexity can also be defined in a likewise manner if we replace σi by gi where each gi is a Gaussian N(0,1) random variable.
Intuitively, the Rademacher (and Gaussian) complexity is a measure of the extent to which some function in class F (or rather its predictions on a sample of size n) can be correlated to a noisy sequence of length n. High correlation naturally implies "wiggliness" (and thus high complexity) of F.
By definition, however, computing these complexity measures can be prohibitively hard for many interesting function classes (since it involves an optimization over the function class F, which is usually a hard problem). But, fortunately, there exist simpler schemes that allow us to compute these complexities for specific cases. The basic idea is that some function classes can be represented as a combination of functions from simpler "basis" classes (think of kernel expansion form of the kernel-based SVM solution). The theory of Rademacher and Gaussian complexities tells us that estimating the complexity of such a class just requires estimating the complexities of the basis classes. We will look at some examples of such cases.
Covering Number Bounds / 26 Feb (Seth Juarez)
Under Construction
Fat Shattering Dimension; / 5 Mar (Adam R. Teichert)
We've seen that our ability to generalize well depends in some respect to the "size" of our hypothesis class. However, even for fairly simple hypothesis classes (axis aligned rectangles for example), the hypothesis class is infinite. VC dimension allowed us to compare infinite hypothesis classes. Still, however, we have been looking at learning discrete functions (mostly we talk about +/- or 0/1). Fat shattering is a generalization of VC dimension which allows us to talk about the learnability of [0,1] valued functions. We also discuss learnability of these real valued functions in the presence of noise.
The game we play with Fat-shattering is that for a set of d example points and given some function class F (the functions in F map data points to real valued labels between 0 and 1),
- We assign a value between 0 and 1 to each of the points.
- Now, the other player comes along and gets to choose whether to move each of our labels (the values that we assigned in step 1) up or down by some value γ.
- Finally, we get to pick a function
on the data points to try to "clear" the shifted values. When I say "clear", I mean, that, on the data points for which our labels were moved up, the function must give a value that is at least as high as the shifted labels, and on the the labels that were moved down, the function must produce a value that is no higher than the lowered labels.
If there is a way of assigning the values in step 1 so that we can always find a function
that "clears" the shifted labels no matter which labels our opponent chooses to shift in each direction, we say that the points are γ-shattered by F. For a given γ, the largest number of points that we can γ-shatter with F is denoted by fatF(γ). If there is such a largest number and if it is finite, then we say that F has a finite fat-shattering function.
The main result of the paper is that the property of a class of [0,1]-valued functions having a finite fat-shattering function is equivalent to the property of the class being learnable from a finite training sample with observation noise [which noise must fit some mild conditions (see the paper)]. Also, the paper proves that if the fat-shattering function grows polynomially and different restrictions are placed on the type of noise (Gaussian noise fits the bill for instance), the real-valued function class is learnable from a "small" sample with noise.
PAC-Bayesian Theorems / 10 March (Ruihong Huang)
As we saw before, PAC algorithms output a concept which performs generally well on the training data and guarantees a small error(ε) with high probability (1 − δ) on the test data drawn independently from an identical distribution with the training data.
As to Bayesian algorithms, considering the uncertainty of the data generation process, they return the whole hypothesis space weighted with the product of the given prior and the likelihood of the training data, in general, if the data is indeed generated according to the given prior, better performance can be derived, however, what if the condition is not held?
The PAC-Bayesian theorems and corresponding algorithms attempt to get the best of both PAC and Bayesian approaches by combining them together.
The following part will give two PAC-Bayesian theorems, each proceeded with a preliminary theorem. While the first theorem deals with realizable case, the second one deals with unrealizable case in a more general sense. From them, we can derive new PAC-Bayesian algorithms.
Here, we just give the main results of the theorems, please refer to the paper for the detailed description.
Preliminary Theorem 1:
According to it, we can derive a reasonable algorithm which output a consistent concept with highest prior probability.
Theorem 1:
Considering uncertainty, if the target concept is selected according to the prior used in the algorithm, the optimal concept consists of a subset of concepts consistent with the sample weighted with its prior probability. Thus, the above theorem gives performance guarantee to such case.
Preliminary Theorem 2:
It implies an algorithm which selects a concept minimizing a function of P(c) and the loss function.
Theorem 2:
Accordingly, the algorithm implied here tries to find a set U minimizing a function of P(U) and the loss function. Then the natural question is how to identify such a subset? For this problem, We will give an lemma which deals with the relation between large margin and the existence of a good U.
Boosting: Strong and Weak Learning / 12 March (Arvind Agarwal)
Recall, in PAC learning, we are interested in an algorithm that outputs good hypothesis (which gives less error with respect to the data distribution) with high confidence. Mathematically, algorithm outputs a hypothesis h such that
. Note that there are two parameters that qualify and control the performance of a PAC learning algorithm - the error parameter ε and the confidence parameter δ. The smaller the values of these parameters, the stronger the guarantee on the hypothesis output by the learning algorithm. In our definition of PAC learning, we demanded that a learning algorithm be able to achieve any small arbitrarily small values of these parameters and, that running time be polynomial in 1 / ε and 1 / δ.
Let say that we do not have such an algorithm, instead have a weaker but still useful algorithm L that achieves the PAC criteria not for any values of ε and δ but a fixed ε0 and δ0. Now question is: Is there any way that we can use L as a subroutine and get a new algorithm L' that achieves the PAC criteria for any values of ε and δ. Answer to the above question is positive in the strongest possible sense: even an efficient algorithm whose hypotheses are slightly better than random guessing, can be used to obtain an efficient algorithm meeting PAC criteria for any values of ε and δ. Algorithm L is known as weak PAC learning algorithm and L' is known as strong PAC learning algorithm
We are using weak learning algorithm L to obtain a strong learning algorithm, In other words, we want to use L whose hypotheses were only slightly better than "random guessing" to obtain a new algorithm whose hypotheses have small error with high confidence. We want to use L to reduce the error and at the same time increase the confidence. There are two problems that we have to deal with:
Boosting the Confidence: This part is easy because algorithm L does give us a set of "acceptable" hypotheses, but only with small confidence. All we do now is to call L multiple times and, hope that, in these calls, there exists an acceptable hypothesis with high confidence. It turns out that if we allow an additional error γ then finding that acceptable hypothesis from the set of hypotheses output by these calls is easy and can be done with high confidence.
Boosting the Accuracy: This part is little tricky because now we have an algorithm that gives us "unacceptable" hypotheses (whose error is larger than the desired error) with the given confidence and this might so happen that the algorithm L always gives us a hypothesis that is not acceptable (with the error greater than ε). In such situation, idea is to first call L on the all of the examples and then, run L again on the set of examples where hypothesis output by first run (h1) made an error. Note that this new run gives us some new information i.e. makes correct prediction for the examples where h1 made mistakes. Now, if these two hypotheses could somehow be combined then the new combined hypothesis would perform better than any of the individual hypothesis hence would give us an error better than the error of L. Let call this new algorithm L'.
Given such algorithm L', now all we do is to run this L' recursively and decrease error in each step unless we reach to the desired error of ε.
Boosting the margin / 26 March (Jiarong Jiang)
Recall voting methods, we expect to achieve high accuracy by voting the prediction of several classifiers, but by Occam's Razor, low generalization error means simple classifiers. The intuition of the whole analysis is that by voting the base-classifiers, we might not actually increase the complexity, but smooth their predictions. Define classification margin as the difference between the weight assigned to the correct label and the maximal weight assigned to any incorrect label. Then an example is classified correctly if and only if its margin is positive.
Generalization error as a function of margin distributions
This part contains the analysis of both finite and infinite base-classifier spaces. The two main results are:
Let D be a distribution over
, and let S be a sample of m examples chosen independently at random according to D. Assume that the base-classifier space H is finite, and let δ > 0. Then with probability at least 1 − δ over the random choice of the traning set S, every weighted average function
satisfies the following bound for all θ > 0:
.
If H is infinite, suppose it has VC-dimension d, we have
.
However, these two bounds start to be meaningful only when the size of training set is really large and the actual performance of AdaBoost is much better than predicted by the bounds, so they are very loose.
The effect of boosting margins
This part analyze why AdaBosst is especially suited to the task of maximizing the number of training examples with large margin.
Theorem 5 gives out such a fact, even if the prediction of the base classifiers are slightly better than random guessing, some settings of parameters will lead to exponentially decrease of
with T.
Inherent Unpredictability / 2 Apr (Jagadeesh & Pravin)
In the first chapter, we learned that when RP != NP, k-term DNF is not PAC learnable using k-term DNF. But it is learnable using kCNF. But in this session we will refer to concept classes that are hard to PAC learn, regardless of the hypothesis class (as in the case of previous statement) , but due to the complexity involved in prediction. In this session, we will prove that there exists a certain concept classes that are inherently unpredictable. (i.e) VC dimensionality of concept class is polynomial , but still C is not PAC learnable using any algorithm that can come up with an hypothesis in polynomial time.
Discrete Cube Root Problem :
Two primes p and q are chosen such that 3 does not divide Φ(n) =(p-1) * (q-1) , where N=p*q. Now we choose x from a group
that is formed under the opeartion of multiplication modulo. Let fN(x) be x^3 mod N and consider the problem of inverting the function f -- i.e. the problem of computing x on inputs N and fN(x) (but not p, q).
Discrete Cube Root assumption:
From the analysis in Cryptography, the Discrete Cube Root Assumption states that there is no algorithm that will run in time p(n) [basically polynomial time] & that takes input N and y= fN(x) will output x with probability more than 1/p(n). Where, N is a n-bit number that is product of 2 randomly chosen primes such that 3 does not divide Φ(n) =(p-1) * (q-1) and x is chosen randomly from group
.
Discrete Cube Root Learning problem
In the rest of the talk, we will identify the class of functions that are inherently unpredictable (in other words can not be PAC learnable in polynomial time). In order to prove that a particular learning problem is "Hard to learn" we will proceed by reducing the Discrete Cube Root problem to this problem (like a standard NP reduction).
Learning problem, given a set of training examples
we need to learn in time polynomial of n a hypothesis h that is consistent with
(corresponding to f(x) = x^3 mod N, where N is n bits long and is product of two primes p, q such that 3 is relatively prime with (p-1)(q-1)) atleast 1 / p(n) times.
Discrete Cube Root Problem: Given N and y (= f(x) defined by x^3 mod N) we need to find the 'x' corresponding to the given value of 'y'.
Reduction from the Discrete Cube Root problem, randomly sample xi (i = 1, ... , k) and compute yi = f(xi) and form the pairs (xi,yi). Now if the learning problem is solvable, we use this generated samples as training data to learn a hypothesis h and then predict the correct x for the given input value y with probability of 1 / p(n). Thus contradicting the Discrete Cube Root assumption.
The only problem with our formulation of the problem statement that keeps it different from PAC model is that our function class
is a class of multivalued functions and not boolean functions. An easy fix is that, since we know that the x is n bit long, we will check if each of these bits can be learned with sufficient accuracy (and hence the final output is learned with required accuracy). Precisely, for each of the multivalued function
we define n boolean functions
for each of the ith bit
. Now we let Cn be the boolean function class obtained by including
in Cn for all
Thearom:Under the Discrete Cube Root Assumption, the concept class C is not efficiently PAC learnable (using any polynomially evaluatable hypothesis class).
Inherently Unpredictable Boolean Circuits: We have seen that there do exist inherently unpredictable classes, we want to answer how "computationally simple" such classes can be? An obvious way to characterize this is based on the resources required to evaluate a concept class in C. In the subsequent section it is showed that we can actually construct a polynomial size boolean circuit for the Discrete Cube Root learning problem. Which means that, if the class of polynomial size boolean circuits are efficiently PAC learnable than we can find the inverting exponent in polynomial time for the function f(x) = x^3 mod N, thus braking the Discrete Cube Root Assumptions. Hence under the Discrete Cube Root assumption the class of polynomial size boolean circuits is not efficiently PAC learnable (using any polynomially evaluatable hypothesis class).
Infact, using the technique of iterated products, we can construct a polynomial size log-depth boolean circuit for the Discrete Cube Root learning problem. Hence, the class of polynomial size log-depth boolean circuits are not efficiently PAC learnable as well. Since any class of functions that 'computes iterated products' is not efficiently PAC learnable, in the final section, it is showed that the class of functions represented by Neural Networks is also not efficiently PAC learnable (even when it is constrained to choose only weights in {0,1} for the linear threshold function at each node).
Universal Portfolios with and without transaction costs / April 9 (Olga and Parasaran)
A constant rebalanced portfolio (CRP) is an investment strategy which keeps the same distribution of wealth among a set of stocks from period to period. After rebalancing in the beginning of each period, the proportion of the total wealth in each stock is the same. In the previous research Cover (1996) defined an algorithm for the portfolio selection that outperforms the best stock on the market called the universal portfolio strategy. This article extends the universal model by adding a fixed percentage commission cost c < 1 to each transaction.
Definitions:
m - number of assets
n - number of periods
Market performance where xij is the nonnegative price relative of the jth stock for the ith period.
Price relative is the ratio of the closing price of a stock to the opening price of the stock.
Portfolio is the distribution of wealth among multiple stocks:
During one period the wealth increase is
CPR strategy achieves wealth over n periods:
If the commissions c (
) are charged only for the purchases and not sales, the optimal fraction distributed each period to rebalance from b to b' is:
This equation can be solved approximately :
.
Using this approximation gives commission costs from each stock, which are not optimal and in some cases will cause the same stock to be purchased and sold within the same transaction.
Game Theory, Online Prediction and Boosting / 16 April (Nathan Gilbert)
Today's paper is the second installment of fun topics related to learning theory. The authors show a connection between (who would have guessed) some basic game theoretic principles and online prediction/boosting. This is a very easy to read exposition of how these three topics interrelate. This is not a good overview of game theory, online prediction or boosting (how could it be in 8 pages?), if you are looking for that go somewhere else.
Most of the algorithms described in this paper work under the assumption of a two player game in normal form. That is, a game that can be represented by a matrix where one player choosing moves according to the rows and the other according to the columns. The M(i,j) entries of the game matrix M represent the loss accrued by the row player for taking certain moves. The row players objective is to minimize this loss, whereas in a zero-sum game, the column player objective is to maximize the row player's loss.
In sequentially played games (each player takes turns), you might expect that the player that goes second has an advantage. They have already seen the move of the first player so might know a good counter move. It turns out, that for many games, order of play does not matter. Von Neumann showed this when he proved the MINMAX theorem which states maxQminPM(P,Q) = minPmaxQM(P,Q).
The authors motivate the online prediction connection by showing how a sequential two player game can be caste as an online learning problem. The row and column players become the learner and environment respectively. Given a game matrix M and a game that is a played in a series of rounds.
Then on round
:
1. the learner chooses a mixed strategy (a probabilistic strategy) Pt 2. the environment chooses a mixed strategy Qt 3. the learner is then permit to observe the M(i,Qt(i)) it would have suffered if it played any given i 4. the learner suffers loss M(Pt,Qt)
The goal of the learner is to do as close to the best strategy as possible, given Q1,dots,QT (the sequence of plays chosen by the environment.)
The authors give a modified version of the Littlestone and Warmuth online learning algorithm and prove bounds on how close to the optimal strategy this approximation comes. The bounds given in this paper are on the order of
, the authors allude to other bounds that are tighter.
Next, the author's show that Boosting is can be obtained by running the same online prediction algorithm on the Dual of the game. The dual of a game is determined by reversing the other of player turns and reversing what max and min loss means for the game. Or more formally, the dual of a game M can be found by
.
Learning finite state automata / 23 April (Scott)
Notes available here.
Past Semesters
- Fall 2008 (Graphical Models)
- Summer 2008 (Multitask Learning)
- Spring 2008 (Kernel Methods)
- Fall 2007 (Unsupervised Bayes)
- Summer 2007 (Manifolds)