MLRG/spring10

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
(Schedule)
(Schedule)
Line 68: Line 68:
| 11 Feb
| 11 Feb
|| [http://nagoya.uchicago.edu/~dmcallester/colbounds.pdf Generalization Bounds] versus [http://hal3.name/docs/daume08flat.pdf Structure Compilation]
|| [http://nagoya.uchicago.edu/~dmcallester/colbounds.pdf Generalization Bounds] versus [http://hal3.name/docs/daume08flat.pdf Structure Compilation]
-
|| Avishek vs  
+
|| Avishek vs _
|-
|-
| 18 Feb
| 18 Feb

Revision as of 02:37, 4 February 2010

Contents

CS7941: Topics in Machine Learning

Time: Thursdays, 10:45-12:05

Location: MEB 3105, except as noted

Topic: Structured Prediction

Expectations

Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:

  • 20 mins for presenter A to talk about paper A
  • 20 mins for presenter B to talk about paper B
  • 10 mins for presenter A to critique paper B
  • 10 mins for presenter B to critique paper A
  • 20 mins open discussion trying to reach a resolution

The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.

You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.

Participants

Schedule

Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!

Date Papers Presenter
Introduction to Structured Prediction
14 Jan Maximum Entropy Markov Models versus Conditional Random Fields Hal vs Piyush
21 Jan M3Ns versus SVMstruct _ vs _
28 Jan Incremental Perceptron versus Searn Thanh vs Clifton
04 Feb Dependency Estimation versus Density Estimation Zhan vs Amit
Theory
11 Feb Generalization Bounds versus Structure Compilation Avishek vs _
18 Feb Negative Results versus Positive Results Seth vs Kris
25 Feb Learning with Hints versus Multiview Learning Ruihong vs Adam
Dynamic Programming and Search
04 Mar Factorie versus Compiling Comp Ling Jiarong vs Youngjun
11 Mar Minimum Spanning Trees versus Matrix-Tree Theorem Adam vs _
18 Mar Contrastive Divergence versus Contrastive Estimation Arvind vs Amit
01 Apr Searching over Partitions versus End-to-end Machine Translation Ruihong vs Matthew O.
Inverse Optimal Control (aka IRL)
08 Apr Learning to Search versus Apprenticeship Learning Kris vs Lalindra
15 Apr MaxEnt IRL versus Apprenticeship RL Zhan vs _
22 Apr IO Heuristic Control versus Parsing with IOC _ vs _

Paper Summaries

14 Jan: MEMMs and CRFs (Hal and Piyush)

Maximum Entropy Markov Models

Maximum entropy Markov models are essentially HMMs where the p(x_n|y_n)*p(y_n|y_{n-1}) has been replaced with a general (conditional) exponential model. Another way of thinking of it is to train "classifiers" to predict p(y_n | y_{n-1}, x). Note that this can depend on all of x, not just x_n. This gives major advantages because we can now throw in arbitrary overlapping features, without having to worry about killing ourselves with the naive Bayes assumption that we're forced to make in the generative model. The transition probabilities can also be estimated with classifiers. All this, and the Viterbi algorithm still works for decoding and the forward-backward algorithm still works for computing expectations (useful for un- or semi-supervised learning, or learning with latent variables). Training is straightforward: you just train a maxent classifier, predicting, for instance, y_10 given x and y_9 for all of the training examples. Test is also straightforward: you run Viterbi on the (probabilistic) output of the classifier. (In the original formulation, they trained separate transition models for each "previous state", but this is a bad idea and no one does it any more.) Note that the estimated distribution provides a locally-conditioned transition model, and locally-conditioned classifier, which makes training much more efficient (the normalizing constant only has to sum over #tags possibilities). The typical way in which features are defined are as functions of (previous tag, current tag, entire input) and then a single giant weight vector is learned over these features.

Conditional Random Fields

Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.

Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.

CRFs bring in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to label graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.


21 Jan: M3N and SVMStruct

M3N

SVMStruct

28 Jan: Incremental Perceptron and Searn (Thanh and Clifton)

Incremental Perceptron

Many structured prediction algorithms deal with the argmax problem y^ = \argmax_{y \in GEN(x)}{w^T*\phi(x,y)}. So does structured perceptron (Collin2002) which makes significant assumption that Φ admits efficient search over Y (using dynamic programming). In general case, argmax might not be tractable. Two possible mentioned solutions includes the re-rank (Collins2000) and the incremental perceptron (this paper, Collins 2004). Rerank first produces a “n-best" list of outputs and uses a second model for picking an output from this n-best list which can avoid the intractable problem of argmax. Rerank supports flexible loss functions but it is inefficient and assumes a reasonable performance for the baseline model.

As an alternative approach, the incremental perceptron, which is a variant on the structured perceptron, deals with the issue of the argmax may not be analytically available. It uses heuristic methods for finding argmax, replaces argmax by incremental beam search strategies (which returns a much smaller set of the candidates) y^ = \argmax_{y \in F_n}{w^T*\phi(x,y)}. The paper introduces two refinements including the ‘repeated use of hypothesis’ and the ‘early update’. The first refinement maintains a cache of examples and repeatedly iterate over them to update the model if the gold standard parse is not the best scoring parse from among the stored candidates (dynamically generate the constraints i.e. incorrect parses, and use these constraints to update the model while the original algorithm only looks at one constraint on each sentence and is extremely wasteful with the generated constraints implied by previously parsed sentences). Early-update aborts the search algorithm as soon as it has detected that an error has been made rather than allowing the parser to continue to the end of the sentence which leads to less noisy input to the parameter estimation algorithm; and also improve the efficiency.

Incremental perceptron deals with the issue of the argmax may not be analytically available, simplicity, supports only 0/1 loss but it has minimal requirement on Y and \phi. Note that its incremental beam search is only a heuristic, there is no guarantee that this procedure will find the highest scoring parse, search errors when \argmax_{h \in F_n}{w^T*\phi(h)} \ne \argmax_{h \in H_n}{w^T*\phi(h)}. Therefore, they need to show the improvement in the empirical experiment.

Searn

Additional Optional Reading

See Structured Prediction on Braque.

Past Semesters

Personal tools