MLRG/spring10
From ResearchWiki
(→Paper Summaries) |
m (→Conditional Random Fields) |
||
| Line 112: | Line 112: | ||
Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations. | Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations. | ||
| - | CRFs brings in all the benefits of MEMMs, except for two additional advantages: | + | CRFs brings in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to labeling graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (''but note however that there has been'' [http://www.gatsby.ucl.ac.uk/~ywteh/research/newcost/icml2002.pdf ''some prior work''] ''which addresses the label-bias problem in the MEMM framework too''). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here. |
| - | 1) CRFs can also be applied to structures other than sequences. So they can be used to labeling graph structures where nodes represent labels and edges represent dependencies between labels. | + | |
| - | 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization. | + | |
More to come.. | More to come.. | ||
Revision as of 09:12, 14 January 2010
Contents |
CS7941: Topics in Machine Learning
Time: Thursdays, 10:45-12:05
Location: MEB 3105, except as noted
Topic: Structured Prediction
Expectations
Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:
- 20 mins for presenter A to talk about paper A
- 20 mins for presenter B to talk about paper B
- 10 mins for presenter A to critique paper B
- 10 mins for presenter B to critique paper A
- 20 mins open discussion trying to reach a resolution
The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.
You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.
Participants
- Hal Daumé III, Assistant Professor, School of Computing
Schedule
Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!
| Date | Papers | Presenter |
|---|---|---|
| Introduction to Structured Prediction | ||
| 14 Jan | Maximum Entropy Markov Models versus Conditional Random Fields | Hal vs Piyush |
| 21 Jan | M3Ns versus SVMstruct | _ vs _ |
| 28 Jan | Incremental Perceptron versus Searn | _ vs _ |
| 04 Feb | Dependency Estimation versus Density Estimation | _ vs _ |
| Theory | ||
| 11 Feb | Generalization Bounds versus Structure Compilation | _ vs _ |
| 18 Feb | Negative Results versus Positive Results | _ vs _ |
| 25 Feb | Learning with Hints versus Multiview Learning | _ vs _ |
| Dynamic Programming and Search | ||
| 04 Mar | Factorie versus Compiling Comp Ling | _ vs _ |
| 11 Mar | Minimum Spanning Trees versus Matrix-Tree Theorem | _ vs _ |
| 18 Mar | Contrastive Divergence versus Contrastive Estimation | _ vs _ |
| 01 Apr | Searching over Partitions versus End-to-end Machine Translation | _ vs _ |
| Inverse Optimal Control (aka IRL) | ||
| 08 Apr | Learning to Search versus Apprenticeship Learning | _ vs _ |
| 15 Apr | MaxEnt IRL versus Apprenticeship RL | _ vs _ |
| 22 Apr | IO Heuristic Control versus Parsing with IOC | _ vs _ |
Paper Summaries
14 Jan: MEMMs and CRFs (Hal and Piyush)
Maximum Entropy Markov Models
Coming soon..
Conditional Random Fields
Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.
Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.
CRFs brings in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to labeling graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.
More to come..
Additional Optional Reading
See Structured Prediction on Braque.
Past Semesters
- Fall 2009 (Online Learning)
- Summer 2009 (Assorted Topics)
- Spring 2009 (Learning Theory)
- Fall 2008 (Graphical Models)
- Summer 2008 (Multitask Learning)
- Spring 2008 (Kernel Methods)
- Fall 2007 (Unsupervised Bayes)
- Summer 2007 (Manifolds)