MLRG/spring10

From ResearchWiki

Revision as of 20:25, 14 January 2010 by Piyush (Talk | contribs)
Jump to: navigation, search

Contents

CS7941: Topics in Machine Learning

Time: Thursdays, 10:45-12:05

Location: MEB 3105, except as noted

Topic: Structured Prediction

Expectations

Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:

  • 20 mins for presenter A to talk about paper A
  • 20 mins for presenter B to talk about paper B
  • 10 mins for presenter A to critique paper B
  • 10 mins for presenter B to critique paper A
  • 20 mins open discussion trying to reach a resolution

The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.

You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.

Participants

Schedule

Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!

Date Papers Presenter
Introduction to Structured Prediction
14 Jan Maximum Entropy Markov Models versus Conditional Random Fields Hal vs Piyush
21 Jan M3Ns versus SVMstruct _ vs _
28 Jan Incremental Perceptron versus Searn _ vs _
04 Feb Dependency Estimation versus Density Estimation _ vs _
Theory
11 Feb Generalization Bounds versus Structure Compilation _ vs _
18 Feb Negative Results versus Positive Results _ vs _
25 Feb Learning with Hints versus Multiview Learning _ vs _
Dynamic Programming and Search
04 Mar Factorie versus Compiling Comp Ling _ vs _
11 Mar Minimum Spanning Trees versus Matrix-Tree Theorem _ vs _
18 Mar Contrastive Divergence versus Contrastive Estimation _ vs _
01 Apr Searching over Partitions versus End-to-end Machine Translation _ vs _
Inverse Optimal Control (aka IRL)
08 Apr Learning to Search versus Apprenticeship Learning _ vs _
15 Apr MaxEnt IRL versus Apprenticeship RL _ vs _
22 Apr IO Heuristic Control versus Parsing with IOC _ vs _

Paper Summaries

14 Jan: MEMMs and CRFs (Hal and Piyush)

Maximum Entropy Markov Models

Maximum entropy Markov models are essentially HMMs where the p(x_n|y_n)*p(y_n|y_{n-1}) has been replaced with a general (conditional) exponential model. Another way of thinking of it is to train "classifiers" to predict p(y_n | y_{n-1}, x). Note that this can depend on all of x, not just x_n. This gives major advantages because we can now throw in arbitrary overlapping features, without having to worry about killing ourselves with the naive Bayes assumption that we're forced to make in the generative model. The transition probabilities can also be estimated with classifiers. All this, and the Viterbi algorithm still works for decoding and the forward-backward algorithm still works for computing expectations (useful for un- or semi-supervised learning, or learning with latent variables). Training is straightforward: you just train a maxent classifier, predicting, for instance, y_10 given x and y_9 for all of the training examples. Test is also straightforward: you run Viterbi on the (probabilistic) output of the classifier. (In the original formulation, they trained separate transition models for each "previous state", but this is a bad idea and no one does it any more.) Note that the estimated distribution provides a locally-conditioned transition model, and locally-conditioned classifier, which makes training much more efficient (the normalizing constant only has to sum over #tags possibilities). The typical way in which features are defined are as functions of (previous tag, current tag, entire input) and then a single giant weight vector is learned over these features.

Conditional Random Fields

Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.

Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.

CRFs bring in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to labeling graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.

Additional Optional Reading

See Structured Prediction on Braque.

Past Semesters

Personal tools