MLRG/spring10

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
m (Paper Summaries)
m (Positive Results)
Line 200: Line 200:
==== Positive Results ====
==== Positive Results ====
-
Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax -y- in Y: during prediction (w^T*\psy(x,y)) and while generating output from the Separation Oracle (w^T*\psy(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax. The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=-null- (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees. During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall does not occur when LBP is combined with Greedy search. They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings.
+
Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax y in Y: during prediction (w^T*\psy(x,y)) and while generating output from the Separation Oracle (w^T*\psy(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax.  
 +
The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=null (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees.  
 +
During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall did not occur when LBP was combined with Greedy search.  
 +
They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings.
== Additional Optional Reading ==
== Additional Optional Reading ==

Revision as of 21:54, 18 February 2010

Contents

CS7941: Topics in Machine Learning

Time: Thursdays, 10:45-12:05

Location: MEB 3105, except as noted

Topic: Structured Prediction

Expectations

Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:

  • 20 mins for presenter A to talk about paper A
  • 20 mins for presenter B to talk about paper B
  • 10 mins for presenter A to critique paper B
  • 10 mins for presenter B to critique paper A
  • 20 mins open discussion trying to reach a resolution

The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.

You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.

Participants

Schedule

Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!

Date Papers Presenter
Introduction to Structured Prediction
14 Jan Maximum Entropy Markov Models versus Conditional Random Fields Hal vs Piyush
21 Jan M3Ns versus SVMstruct _ vs _
28 Jan Incremental Perceptron versus Searn Thanh vs Clifton
04 Feb Dependency Estimation versus Density Estimation Zhan vs Amit
Theory
11 Feb Generalization Bounds versus Structure Compilation Avishek vs Jags
18 Feb Negative Results versus Positive Results Seth vs Kris
25 Feb Learning with Hints versus Multiview Learning Ruihong vs Adam
Dynamic Programming and Search
04 Mar Factorie versus Compiling Comp Ling Jiarong vs Youngjun
11 Mar Minimum Spanning Trees versus Matrix-Tree Theorem Adam vs Jags
18 Mar Contrastive Divergence versus Contrastive Estimation Arvind vs Amit
01 Apr Searching over Partitions versus End-to-end Machine Translation Ruihong vs Matthew O.
Inverse Optimal Control (aka IRL)
08 Apr Learning to Search versus Apprenticeship Learning Kris vs Lalindra
15 Apr MaxEnt IRL versus Apprenticeship RL Zhan vs _
22 Apr IO Heuristic Control versus Parsing with IOC _ vs _

Paper Summaries

14 Jan: MEMMs and CRFs (Hal and Piyush)

Maximum Entropy Markov Models

Maximum entropy Markov models are essentially HMMs where the p(x_n|y_n)*p(y_n|y_{n-1}) has been replaced with a general (conditional) exponential model. Another way of thinking of it is to train "classifiers" to predict p(y_n | y_{n-1}, x). Note that this can depend on all of x, not just x_n. This gives major advantages because we can now throw in arbitrary overlapping features, without having to worry about killing ourselves with the naive Bayes assumption that we're forced to make in the generative model. The transition probabilities can also be estimated with classifiers. All this, and the Viterbi algorithm still works for decoding and the forward-backward algorithm still works for computing expectations (useful for un- or semi-supervised learning, or learning with latent variables). Training is straightforward: you just train a maxent classifier, predicting, for instance, y_10 given x and y_9 for all of the training examples. Test is also straightforward: you run Viterbi on the (probabilistic) output of the classifier. (In the original formulation, they trained separate transition models for each "previous state", but this is a bad idea and no one does it any more.) Note that the estimated distribution provides a locally-conditioned transition model, and locally-conditioned classifier, which makes training much more efficient (the normalizing constant only has to sum over #tags possibilities). The typical way in which features are defined are as functions of (previous tag, current tag, entire input) and then a single giant weight vector is learned over these features.

Conditional Random Fields

Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.

Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.

CRFs bring in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to label graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.


21 Jan: M3N and SVMStruct

M3N

SVMStruct

28 Jan: Incremental Perceptron and Searn (Thanh and Clifton)

Incremental Perceptron

Many structured prediction algorithms deal with the argmax problem y^ = \argmax_{y \in GEN(x)}{w^T*\phi(x,y)}. So does structured perceptron (Collin2002) which makes significant assumption that Φ admits efficient search over Y (using dynamic programming). In general case, argmax might not be tractable. Two possible mentioned solutions includes the re-rank (Collins2000) and the incremental perceptron (this paper, Collins 2004). Rerank first produces a “n-best" list of outputs and uses a second model for picking an output from this n-best list which can avoid the intractable problem of argmax. Rerank supports flexible loss functions but it is inefficient and assumes a reasonable performance for the baseline model.

As an alternative approach, the incremental perceptron, which is a variant on the structured perceptron, deals with the issue of the argmax may not be analytically available. It uses heuristic methods for finding argmax, replaces argmax by incremental beam search strategies (which returns a much smaller set of the candidates) y^ = \argmax_{y \in F_n}{w^T*\phi(x,y)}. The paper introduces two refinements including the ‘repeated use of hypothesis’ and the ‘early update’. The first refinement maintains a cache of examples and repeatedly iterate over them to update the model if the gold standard parse is not the best scoring parse from among the stored candidates (dynamically generate the constraints i.e. incorrect parses, and use these constraints to update the model while the original algorithm only looks at one constraint on each sentence and is extremely wasteful with the generated constraints implied by previously parsed sentences). Early-update aborts the search algorithm as soon as it has detected that an error has been made rather than allowing the parser to continue to the end of the sentence which leads to less noisy input to the parameter estimation algorithm; and also improve the efficiency.

Incremental perceptron deals with the issue of the argmax may not be analytically available, simplicity, supports only 0/1 loss but it has minimal requirement on Y and \phi. Note that its incremental beam search is only a heuristic, there is no guarantee that this procedure will find the highest scoring parse, search errors when \argmax_{h \in F_n}{w^T*\phi(h)} \ne \argmax_{h \in H_n}{w^T*\phi(h)}. Therefore, they need to show the improvement in the empirical experiment.

Searn

4 Feb: Dependency Estimation and Density Estimation (Zhan and Amit)

Dependency Estimation

Density Estimation

11 Feb: Generalization Bounds versus Structure Compilation (Avishek and Jags)

Generalization Bounds

The aim of traditional (non-bayesian) generalization theory is to formally bound the expected test error, given the empirical training error and some additional terms (which depend on the complexity of the hypothesis space). These traditional bounds, in many cases, can be further improved by careful refinement using PAC-Bayesian analysis; the high level idea being able to formally incorporate the notion of prior belief which constrains the hypothesis space resulting in tighter bounds.

This paper provides PAC-Bayesian generalization bounds and consistency results for structured prediction. The generalization results demonstrate the efficacy of using Hamming distortions (Theorem 2) over 0-1 loss functions (Theorem 1). The proof of the theorems follow from the PAC-Bayesian theorem and standard concentration arguments to bound magnitude of the margin to a small value with some high probability. In addition, these bounds also demonstrate the advantages of using Hamming distances in the training algorithm irrespective of the loss functions used at test time. At this point it is worth noting that both Theorem 1 and Theorem 2 are non-convex in their parameters.

The above results have been proved for logistic regression which usually applies to binary classification. SVMs, a popular non-probabilistic alternative to logistic regressions, have been previously extended for multiclass structured prediction where the different multiclass-SVMs vary in their formulation of the multiclass hinge loss. This paper shows that the different notions of multiclass hinge loss are non-convex in their parameters. Thereafter, these loss functions are appropriately modified to achieve convexity; but in the process they loose consistency. The paper also shows that the initial non-convex generalization bounds (Th 1 and Th 2) are actually consistent as they converge to the optimal classifier in the limit of infinite training data, thereby suggesting that for structured predicton non-convex generalization bounds might result in superior (consistent) performance as compared to their convex counterparts.

Finally, these bounds are generalized for non-linear cases and for generalized version of "parts", a notion that captures the decoding specifics for structured prediction.

Structure Compilation

Structured compilation is motivated by the need for a "quick response during test time" and is willing to spend more training time as a trade off. Structured models like MEMMs and CRFs, consider the interaction among the labels for each node (example the pos tag of each word in the case of POS tagging) where as ILR (Independent Logistic Regression) decides the label independent of other nodes and hence is faster. Hence structured compilation uses ILR to achieve its goal. To leverage, the information lost, by not considering the interaction among labels, structured compilation proposes to use more features (which we call expanded feature set).

It assumes the availability of small labeled corpus and considerable amount of unlabeled corpus. The main idea is the following: first train the CRF classifier on small labeled data and use this model to predict labels for the unlabeled data. Now use both the original labeled data and the newly predicted data to train the ILR model (with expanded features). Use this ILR model in the test time.

The reason for the need for huge unlabeled data is simple. ILR tend to overfit on small corpus, in order to prevent it from the overfitting it needs to trained on bigger data sets. Theoretical analysis and experiments show that, if the original labeled data is small then structured compilation will catch up the performance of the CRF classifiers.

18 Feb: Positive Results versus Negative Results (Kris and Seth)

Positive Results

Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax y in Y: during prediction (w^T*\psy(x,y)) and while generating output from the Separation Oracle (w^T*\psy(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax. The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=null (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees. During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall did not occur when LBP was combined with Greedy search. They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings.

Additional Optional Reading

See Structured Prediction on Braque.

Past Semesters

Personal tools