MLRG/spring10
From ResearchWiki
Frankzappa (Talk | contribs) (→Paper Summaries) |
(→Schedule) |
||
| Line 68: | Line 68: | ||
| 11 Feb | | 11 Feb | ||
|| [http://nagoya.uchicago.edu/~dmcallester/colbounds.pdf Generalization Bounds] versus [http://hal3.name/docs/daume08flat.pdf Structure Compilation] | || [http://nagoya.uchicago.edu/~dmcallester/colbounds.pdf Generalization Bounds] versus [http://hal3.name/docs/daume08flat.pdf Structure Compilation] | ||
| - | || Avishek vs | + | || Avishek vs Jags |
|- | |- | ||
| 18 Feb | | 18 Feb | ||
Revision as of 18:23, 10 February 2010
Contents
|
CS7941: Topics in Machine Learning
Time: Thursdays, 10:45-12:05
Location: MEB 3105, except as noted
Topic: Structured Prediction
Expectations
Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:
- 20 mins for presenter A to talk about paper A
- 20 mins for presenter B to talk about paper B
- 10 mins for presenter A to critique paper B
- 10 mins for presenter B to critique paper A
- 20 mins open discussion trying to reach a resolution
The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.
You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.
Participants
- Hal Daumé III, Assistant Professor, School of Computing
- Jagadeesh Jagarlamudi, PhD Student, School of Computing
- Lalindra De Silva, PhD Student, School of Computing
- Piyush Rai, PhD Student, School of Computing
- Zhan Wang, PhD Student, School of Computing
- Thanh Nguyen, PhD Student, School of Computing
- Amit Goyal, PhD Student, School of Computing
- Adam R. Teichert, MS Student, School of Computing
- Ruihong Huang, PhD Student, School of Computing
- Kristina Doing-Harris, Post Doctoral Fellow, Biomedical Informatics Department
- Jiarong Jiang, PhD Student, School of Computing
- Youngjun Kim, PhD Student, School of Computing
- Seth Juarez, PhD Student, School of Computing
- Sandeep P, MS Student, School of Computing
Schedule
Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!
| Date | Papers | Presenter |
|---|---|---|
| Introduction to Structured Prediction | ||
| 14 Jan | Maximum Entropy Markov Models versus Conditional Random Fields | Hal vs Piyush |
| 21 Jan | M3Ns versus SVMstruct | _ vs _ |
| 28 Jan | Incremental Perceptron versus Searn | Thanh vs Clifton |
| 04 Feb | Dependency Estimation versus Density Estimation | Zhan vs Amit |
| Theory | ||
| 11 Feb | Generalization Bounds versus Structure Compilation | Avishek vs Jags |
| 18 Feb | Negative Results versus Positive Results | Seth vs Kris |
| 25 Feb | Learning with Hints versus Multiview Learning | Ruihong vs Adam |
| Dynamic Programming and Search | ||
| 04 Mar | Factorie versus Compiling Comp Ling | Jiarong vs Youngjun |
| 11 Mar | Minimum Spanning Trees versus Matrix-Tree Theorem | Adam vs Jags |
| 18 Mar | Contrastive Divergence versus Contrastive Estimation | Arvind vs Amit |
| 01 Apr | Searching over Partitions versus End-to-end Machine Translation | Ruihong vs Matthew O. |
| Inverse Optimal Control (aka IRL) | ||
| 08 Apr | Learning to Search versus Apprenticeship Learning | Kris vs Lalindra |
| 15 Apr | MaxEnt IRL versus Apprenticeship RL | Zhan vs _ |
| 22 Apr | IO Heuristic Control versus Parsing with IOC | _ vs _ |
Paper Summaries
14 Jan: MEMMs and CRFs (Hal and Piyush)
Maximum Entropy Markov Models
Maximum entropy Markov models are essentially HMMs where the p(x_n|y_n)*p(y_n|y_{n-1}) has been replaced with a general (conditional) exponential model. Another way of thinking of it is to train "classifiers" to predict p(y_n | y_{n-1}, x). Note that this can depend on all of x, not just x_n. This gives major advantages because we can now throw in arbitrary overlapping features, without having to worry about killing ourselves with the naive Bayes assumption that we're forced to make in the generative model. The transition probabilities can also be estimated with classifiers. All this, and the Viterbi algorithm still works for decoding and the forward-backward algorithm still works for computing expectations (useful for un- or semi-supervised learning, or learning with latent variables). Training is straightforward: you just train a maxent classifier, predicting, for instance, y_10 given x and y_9 for all of the training examples. Test is also straightforward: you run Viterbi on the (probabilistic) output of the classifier. (In the original formulation, they trained separate transition models for each "previous state", but this is a bad idea and no one does it any more.) Note that the estimated distribution provides a locally-conditioned transition model, and locally-conditioned classifier, which makes training much more efficient (the normalizing constant only has to sum over #tags possibilities). The typical way in which features are defined are as functions of (previous tag, current tag, entire input) and then a single giant weight vector is learned over these features.
Conditional Random Fields
Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.
Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.
CRFs bring in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to label graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.
21 Jan: M3N and SVMStruct
M3N
SVMStruct
28 Jan: Incremental Perceptron and Searn (Thanh and Clifton)
Incremental Perceptron
Many structured prediction algorithms deal with the argmax problem y^ = \argmax_{y \in GEN(x)}{w^T*\phi(x,y)}. So does structured perceptron (Collin2002) which makes significant assumption that Φ admits efficient search over Y (using dynamic programming). In general case, argmax might not be tractable. Two possible mentioned solutions includes the re-rank (Collins2000) and the incremental perceptron (this paper, Collins 2004). Rerank first produces a “n-best" list of outputs and uses a second model for picking an output from this n-best list which can avoid the intractable problem of argmax. Rerank supports flexible loss functions but it is inefficient and assumes a reasonable performance for the baseline model.
As an alternative approach, the incremental perceptron, which is a variant on the structured perceptron, deals with the issue of the argmax may not be analytically available. It uses heuristic methods for finding argmax, replaces argmax by incremental beam search strategies (which returns a much smaller set of the candidates) y^ = \argmax_{y \in F_n}{w^T*\phi(x,y)}. The paper introduces two refinements including the ‘repeated use of hypothesis’ and the ‘early update’. The first refinement maintains a cache of examples and repeatedly iterate over them to update the model if the gold standard parse is not the best scoring parse from among the stored candidates (dynamically generate the constraints i.e. incorrect parses, and use these constraints to update the model while the original algorithm only looks at one constraint on each sentence and is extremely wasteful with the generated constraints implied by previously parsed sentences). Early-update aborts the search algorithm as soon as it has detected that an error has been made rather than allowing the parser to continue to the end of the sentence which leads to less noisy input to the parameter estimation algorithm; and also improve the efficiency.
Incremental perceptron deals with the issue of the argmax may not be analytically available, simplicity, supports only 0/1 loss but it has minimal requirement on Y and \phi. Note that its incremental beam search is only a heuristic, there is no guarantee that this procedure will find the highest scoring parse, search errors when \argmax_{h \in F_n}{w^T*\phi(h)} \ne \argmax_{h \in H_n}{w^T*\phi(h)}. Therefore, they need to show the improvement in the empirical experiment.
Searn
4 Feb: Dependency Estimation and Density Estimation (Zhan and Amit)
Dependency Estimation
Density Estimation
Additional Optional Reading
See Structured Prediction on Braque.
Past Semesters
- Fall 2009 (Online Learning)
- Summer 2009 (Assorted Topics)
- Spring 2009 (Learning Theory)
- Fall 2008 (Graphical Models)
- Summer 2008 (Multitask Learning)
- Spring 2008 (Kernel Methods)
- Fall 2007 (Unsupervised Bayes)
- Summer 2007 (Manifolds)