MLRG/spring10
From ResearchWiki
(→25 Feb: Learning with Hints and Multi-view Learning (Ruihong and Adam): (25 Feb 2010) finished my summary) |
m (→Learning to Search) |
||
| (25 intermediate revisions not shown) | |||
| Line 85: | Line 85: | ||
|- | |- | ||
| 11 Mar | | 11 Mar | ||
| - | || [http://www.ryanmcd.com/papers/nonprojectiveHLT-EMNLP2005.pdf | + | || [http://www.ryanmcd.com/papers/nonprojectiveHLT-EMNLP2005.pdf Maximum Spanning Trees] versus [http://people.csail.mit.edu/mcollins/papers/matrix-tree.pdf Matrix-Tree Theorem] |
|| Adam vs Jags | || Adam vs Jags | ||
|- | |- | ||
| Line 108: | Line 108: | ||
| 22 Apr | | 22 Apr | ||
|| [http://www.ri.cmu.edu/pub_files/2009/4/stochastic.pdf IO Heuristic Control] versus [http://www.sztaki.hu/~szcsaba/papers/MLJ-SISP-09.pdf Parsing with IOC] | || [http://www.ri.cmu.edu/pub_files/2009/4/stochastic.pdf IO Heuristic Control] versus [http://www.sztaki.hu/~szcsaba/papers/MLJ-SISP-09.pdf Parsing with IOC] | ||
| - | || | + | || Jiarong vs Jiarong(if no one else) |
|- | |- | ||
|} | |} | ||
| Line 204: | Line 204: | ||
for every i, for every y in Y\y_i: w^T\psi(x_i,y_i) >= w^T\psi(x_i,y) + \delta(y_i,y) - \ksi_i | for every i, for every y in Y\y_i: w^T\psi(x_i,y_i) >= w^T\psi(x_i,y) + \delta(y_i,y) - \ksi_i | ||
Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax y in Y: during prediction (w^T*\psi(x,y)) and while generating output from the Separation Oracle (w^T*\psi(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax. | Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax y in Y: during prediction (w^T*\psi(x,y)) and while generating output from the Separation Oracle (w^T*\psi(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax. | ||
| + | |||
The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=null (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees. | The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=null (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees. | ||
| + | |||
During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall did not occur when LBP was combined with Greedy search. | During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall did not occur when LBP was combined with Greedy search. | ||
| + | |||
They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings. | They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings. | ||
| Line 218: | Line 221: | ||
The authors also give an EM style algorithm for optimizing the regularized loss and show how the method can be used when you only expect partial agreement (like when there are two different tagsets). | The authors also give an EM style algorithm for optimizing the regularized loss and show how the method can be used when you only expect partial agreement (like when there are two different tagsets). | ||
| + | |||
| + | |||
| + | === 4 Mar: Factorie versus Compiling Comp Ling (Jiarong and Youngjun) === | ||
| + | |||
| + | ==== FACTORIE: Probablistic Programming via Imoeratively Defined Factor Graphs==== | ||
| + | |||
| + | FACTORIE is a software library that represents factor graphs which preserves its statistical semantics. It has the ability to combine declarative and procedural domain knowledge and performs efficient inference and learning over large and complex graphs. It is good in speed and accuracy. | ||
| + | |||
| + | Factor graph is a bipartite graph representing the factorization of a function. For inference and imperative constraint preservation, they use a difflist to contain the variables changed by the step and undo and redo capabilities. They also provide methods to compute proposal distribution and use property-preserving proposal functions to avoid deterministic factors. Their imperative structure definition includes unroll methods which finds factor’s other variable neighbors given a single variable from the DiffList. Also, it allows the implementation of the factor templates. For imperative variable coordination, they introduce a value-assignment method to quickly assign values without inference. They support a separation between factor neighbors and sufficient statistics and allow a function to perform variable-statistics mapping. | ||
| + | |||
| + | ==== Compiling Comp Ling ==== | ||
| + | The “parsing as deduction” framework has been provided an elegant notation for specifying a variety of parsing algorithms including algorithms for probabilistic or other semiring-weighted parsing. Weighted deduction is a powerful theoretical formalism that encompasses many NLP algorithms. | ||
| + | |||
| + | Authors generalize modern probabilistic parsing techniques to a broader class of weighted deductive algorithms and mention that deductive inference should play the role of regular expressions and FSMs. | ||
| + | |||
| + | This paper proposes a declarative specification language, Dyna, that replaces Prolog’s Horn clauses with “Horn equations” over terms with values. Authors give a bottom-up “inside” algorithm for general semiring-weighted deduction, based on a prioritized agenda, and a general “outside” algorithm that correctly computes weight gradients even when the inside algorithm is pruned. Then, they discuss Dyna-to-Dyna program transformations and evaluate their first implementation of a Dyna-to-C++ compiler. They hope this compiler will facilitate EMNLP research. | ||
| + | |||
| + | === 11 Mar: Maximum Spanning Trees and Matrix-Tree Theorem (Adam and Jags) === | ||
| + | ==== Maximum Spanning Trees ==== | ||
| + | Dependency trees link each word in a sentence to a single parent (another word) on which it most closely depends syntactically (for instance, determiners and adjectives would be children of the noun that they modify, and the main verb of a sentence would typically be the only child of the dummy root node). A "projective" dependency tree is one that doesn't have crossing edges---that is, one in which a node and its children are all found in a contiguous part of the sentence. Non-projective trees allow crossing edges. | ||
| + | |||
| + | Since all words in the sentence are required to participate in the tree, we can see that any dependency tree will be a spanning tree of the graph formed by connecting each pair of words with edges in both directions and by adding an edge from the dummy root node to each of the word nodes. Dependency parsing is just finding the best such spanning tree. By weighting edges using the dot product of a feature vector of the edge with some weight vector, parsing boils down to finding the Maximum Spanning Tree of the weighted graph, and training the parser becomes the task of learning a good weight vector from parsed data. | ||
| + | |||
| + | The paper outlines the Chu-Liu-Edmonds algorithm, a greedy, recursive algorithm that can run in <math>O(n^2)</math> time to find a maximum spanning tree and, thus, a non-projective dependency parse. The algorithm lets each node greedily choose its favorite parent (the one that gives the highest score). If the resulting graph is a tree, it is the maximum spanning tree. If it isn't a tree, there must be a cycle, so the algorithm instructs to find a cycle, collapse the nodes in the cycle to a single node, update the weights to and from the new node, and then recursively call the algorithm on the new graph. | ||
| + | |||
| + | Training the weight vector used for scoring, is done using two variants of the online Margin Infused Relaxed Algorithm (MIRA) which address the exponential number of constraints in the optimization. One approach, "single-best MIRA" changes weights as little as possible on each update so as to guarantee that the score we give to the new labeled example be at least as large as the score we give to the highest scoring possible dependency tree for the sentence plus the number of wrong parent assignments made in that tree. "Factored MIRA" updates the weights as little as possible while ensuring that the new weights yield score each incoming edge in the new true tree with a score of at least 1.0 more than the score it would give to any other incoming edge to the same node. | ||
| + | |||
| + | Experiments and results are also given in the paper. | ||
| + | ==== Matrix-Tree Theorem ==== | ||
| + | |||
| + | |||
| + | This paper proposes another solution for finding non-projective dependency trees (check the summary of previous paper) of a given sentence. Though the solution is computationally expensive than the previous solution, it is more general and gives better results than the previous approach. Moreover, this technique enables global training on the entire training data rather than online fashion (in averaged perceptron). | ||
| + | |||
| + | The parameter update step in many training algorithms involves computation of the partition function (to compute the loss with existing weight vector) and marginal probabilities (for gradients style update). Unlike projective dependency trees, non-projective tress do not admit dynamic programming based solution for the computation of these quantities. This paper uses Matrix-tree thearom and shows that both the partition function and marginals can be computation can be done using the determinant and inverse of a matrix respectively. | ||
| + | |||
| + | Experiments conduced on different languages with varying level of non-projective property show that this method outperforms averaged perceptron style learning (the previous approach). | ||
| + | |||
| + | === 18 Mar: Contrastive Divergence and Contrastive Estimation (Arvind and Amit)=== | ||
| + | |||
| + | ==== Contrastive Estimation ==== | ||
| + | |||
| + | In this paper, the goal is to use unannotated text to solve sequence labeling | ||
| + | problems like part-of-speech (POS) tagging, named-entity recognition, and parsing. To incorporate domain knowledge into a model structure, we need to handle various kinds of features. Log-linear models allows us to do that. | ||
| + | |||
| + | In general, to handle unannotated text, Expectation-Maximization (EM) algorithms are combined with log-linear models. One of the major problems of using EM is that computing normalization function (partition function) is computationally expensive. For POS tagging, we have to integrate out all possible sentences and all possible POS tagged structures for them. | ||
| + | |||
| + | In this work, authors just not only uses where the leaner pushes the probability mass but also explicitly makes use of the source of the probability mass that is given to an example. Since to find the best POS label sequence for a sentence "S", comparing with all possible sentence constructions is not ideal. So, they restrict sentence constructions to only those sentences which are close to "s". These close sentence constructions define neighborhood for "s". For example, they define neighborhood as deleting one word from a sentence, swapping adjacent words in a sentence, and considering all permutations of words with sentence length of "s". The neighborhood construction define those sentences as neighbors which are very similar to "s". So, they assume that if the model can differentiate between | ||
| + | these close constructions, then it can easily work for cases, where the constructions are totally different from one another. And, neighborhood should be something by which your model could differentiate between best and second best tagged predictions. This idea is very similar to the supervised setting of [http://www.seas.upenn.edu/~taskar/pubs/icml05.pdf M3Ns] except that here authors are working with unlabeled data. They call this framework as contrastive estimation (CE). | ||
| + | |||
| + | In experiments, they show that using CE performs better than EM on POS tagging. CE is faster and more accurate than EM for POS tagging. | ||
| + | In their other paper, they show that using CE is also useful for grammar induction which is identical to parsing. | ||
| + | Selecting a appropriate neighborhood is an crucial step in their setting. They show that always using bigger neighborhoods is not good. | ||
| + | Hence, their model is different from EM which uses all possible sentence constructions as neighborhood. | ||
| + | Neighborhoods should be constructive such that they are discriminative in nature. | ||
| + | |||
| + | ==== Contrastive Divergence ==== | ||
| + | This paper presents another approach to handle the log-partition problem. Log-partition function creates a problem only when one is maximizing the likelihood function : p(x|th) = p(x,th)/Z(th). We know that maximizing the likelihood is equivalent to minimizing the KL divergence between the empirical distribution and the true distribution. This relationship between KL and maximum likelihood (ML) can be used to deal with the log partition problem. | ||
| + | |||
| + | For this to understand, let p_0 be the empirical distribution, and p_inf be the final distribution that is obtained from the ML estimation. Note that if one were to use MCMC method instead of ML, then MCMC method will eventually converge to the true distribution; call it p_inf. MCMC method will also give us a series of intermediate distributions i.e., p_1,p_2...p_inf. Given such a series of distributions, a contrastive divergence is: | ||
| + | CD = KL(p_0||p_{inf}) - KL(p_n||p_{inf}) | ||
| + | |||
| + | This paper suggests that one can minimize the CD instead of maximizing the ML, and achieve the similar solution. Though, the paper does not give any guarantee on the solution (neither in terms of the convergence nor in terms of the accuracy of the solution), it shows that in practice, CD solution is as good as the ML solution. | ||
| + | |||
| + | |||
| + | === 13 Apr: Learning to Search and Apprenticeship Learning via IRL (Kris and Lalindra)=== | ||
| + | |||
| + | ==== Learning to Search ==== | ||
| + | |||
| + | This paper describes the Maximum Margin Planning algorithm [MMP], which refines a cost function to make example trajectories through a Markov Decision Process without Reward function [MDP\R]. In contrast to other methods they are working with one positive example and no negative examples. Therefore they cannot infer a cost function from the examples and must employ an alternate technique. Their MMP algorithm suggests the cost function be increased in regions of the feature space encountered along the planned path and decreased in areas encountered in the example path, on each iteration. | ||
| + | |||
| + | To make training more difficult and thus generalization better, they do loss augmentation. At each iteration, they lower the cost of undesirable state-action pairs to make them more likely to be chosen. This is accomplished by adding a term, which is a generalization of Hamming Loss, and increases loss from 0 at state-action pairs along the example path out to 1 gradually, identifying paths that are “almost correct”. This forces the algorithm to continue making updates until the cost of the example path is less than any other by a margin that scales with the loss of that path. | ||
| + | |||
| + | By defining a policy implicitly in terms of the optimal controller for a cost function created from a set of features, the policy can be applied to generalize to new MDPs outside the training set. The planner is an MDP solver that returns mu*=argmin(mu elt of G)c^T*mu, where G is a set of flow vectors. This transformation gives rise to a convex objective function, which can be efficiently optimized using the subgradient method. Positivity constraints on the costs are upheld in the linear case by the comparison with the minimum cost example path. This results in an MMP objective function which forms an upper bound on the structured loss in a way that generalizes the upper bound formed by the zero-one loss by the SVMs binary hinge-loss. When both the environment and the policies are deterministic the vector matrix product Fi*mu can be implemented efficiently using sparse multiplication. They use A* to find the optimal deterministic policy through the environment. | ||
| + | |||
| + | To generalize the algorithm to non-linear functions, which is necessary to make feature selection tractable, they use Functional Gradient Descent. To use the functional gradients they first project them into a predefined set of hypothesis functions [H] – “the direction set”. Operationally, it is a regression data set, which suggests where and how strongly the function should be increased or decreased at a collection of points in the feature space. This is gradient descent which operates directly on the function f(x) instead of the parameterized function f(x,w) so that the loss is Lf = (Sigma, sub n)l(n)+||f||^2 and the gradient of L w.r.t f = h is in the direction that f is off. To learn this regression function (h) they specify a +1,-1 label for each feature vector and use an out of the box regression algorithm to learn a function with the desired properties. Then f is updated by f = f + eta*h. | ||
| + | |||
| + | In the non-linear case the positivity constraint is upheld by making modifications to the log of the cost function and exponentiating log-costs before planning. Thus the algorithm implements an exponentiated variant of functional gradient descent. | ||
| + | |||
| + | ==== Apprenticeship Learning via IRL ==== | ||
| + | This paper presents an approach to imitation learning (learning by observation) using inverse reinforcement learning. Inverse reinforcement learning is the problem of trying to derive the reward function an agent in optimizing in a given scenario. As it turns out that recovering this underlying reward function is pretty difficult, one way to approach the problem is to mimic the behavior of an expert and thereby, try to come up with policies or behaviors that is close to what the expert is demonstrating. | ||
| + | |||
| + | The way the authors have approached this problem is to assume that the reward function the expert is optimizing can be expressed as a linear combination of the known features. Given that such a feature mapping is available, we can derive a policy whose feature expectations is as close to that of the expert's demonstrated policies. The authors demonstrate that this reformulated problem in fact produces behaviors or policies that are similar to what the expert would produce. They show that their algorithm converges in fewer iterations compared to the contemporary methods and guarantees termination with certain conditions being satisfied. | ||
| + | ===15 Apr: MaxEnt IRL and Apprenticeship RL (Zhan and Abhishek) === | ||
| + | |||
| + | === 22 APR: IO Heuristic Control and Parsing with IOC (Jiarong)=== | ||
| + | ====Inverse Optimal Heuristic Control for Imitation Learning ==== | ||
| + | This paper presents a way to combine MDP and behavioral cloning methods. The advantage of MDP is that it moves beyond a one-step look-ahead and learns a policy correctly predicts entire sequences of actions. However, it is not scalable with increasing state space. On the other hand, behavior cloning provides a good scalability of existing supervised learning algorithms. | ||
| + | |||
| + | The way to combine these two methods together is to change the energy function of the original Gibbs model. In Gibbs model, the energy is represented by the Q value, which is $c(s,a)+J(T_s^a)$. $p(a|s)=\frac{\exp(-c(s,a)-J(T^a_s))}{\sum_{a'\in A_s}\exp(-c(s,a')-J(T^a'_s))}$ This is similar to multi-logistic regression, and one behavior cloning model based on it is using $E(s,a)=w^Tf(s,a)$ instead of $Q^*(s,a)$ as the energy function. | ||
| + | |||
| + | The combination of these two can use the energy function $\tilde{E}(o,a)=E(o,a)+Q_M^*(o,a)$. o here is the observation and $\phi(o,a)$ gives a list of state, action pairs. Applying this energy function and use linear parameterization of the energy, we denote $w_v$ as the weights of value features (given by behavioral cloning) and $w_c$ the weights of cost features (given by MDP). Now, we can use gradient descent to optimize the negative log-likelihood of the data. The authors claim that the objective function is almost-convex and we can directly optimize it. An effective strategy could be 1. set $w_v=0$ and optimize the almost-convex IOC Gibbs model. 2. Fix $w_c$ and optimize $w_v$. 3.jointly optimize $w_c$. $w_v$ again. In the convex approximation part, the authors also show a soft-min version of the MDP energy function which gives some better results. | ||
| + | |||
| + | ==== Training Parsers by Inverse Reinforcement Learning ==== | ||
| + | In this paper, the authors define the PCFG parsing problem as a MDP and propose a unified framework for five different inverse reinforcement learning algorithms. A PCFG parsing a 5-tuple $G=(W,N,S,R,\sigma)$. W is the set of terminal vocabularies. N is the set of nonterminal vocabularies. S is a start symbol. R is a subset of the set of all possible production rules. $\sigma$ is a scoring function and the exponentiated sum of the score of rules starting from the same nonterminal vocabulary is 1. The constituents of a parse tree is a triple $(N_c,start_c,end_c)$ where $start_c$ and $end_c$ are the respective indices of the first and last words forming the constituent. If (N,i,j) has two children, $c_{left}$ and $c_{right}$, then, $c_{left}=(N_{left}$,i,k)$ and $c_{right}=(N_{right},k+1,j)$ and there is a rule $N\rightarrow N_{left}N_{right}$. The probability of a parse tree given the grammar is proportional to the product of the exponentiated product of rule occurrence and score. So finding the most probable parse for a sentence given a grammar G is to find the maximum scoring tree. | ||
| + | |||
| + | This PCFG parsing can be formalized as an MDP. $M_{1k}^G$ = $(\Chi, \{A(x)\}, T, \Chi_T, r)$. $\Chi$ is the state space which represents partial parse trees with root $(S,1,K)$. $A(x)$ is the set of admissible actions in $x$. $A(x)=A_1(x)\cup A_2(x)$ where $A_1(x)$ includes the rules that can split the nonterminals to nonterminals and $A_2$ represents the rules that split a non-terminal to a terminal word. T is a deterministic transition function. $\Chi_T$ is the set of terminal states. $r$ is the reward function which is $-\infty$ if the terminal state is not in the parse tree, otherwise, $r(x,a)=\sigma(R^a)$. | ||
| + | |||
| + | The unified goal of IRL algorithms is to minimize some dissimilarity function $J(r,D)$ which measures the differences of the optimal behavior wrt the selected reward function r and the expert's observed behavior. Using linear parameterization, $r_\theta(x,a)=\theta^T\phi(x,a)$ and $\phi(x,a)$ is a feature extractor. The update of the parameter is $\theta_{k+1}=g(g^{-1}(\theta_k)+\alpha\Delta_k)$ and $g$ is a link function (Here they use g(x)=x or g(x)=exp(x) ). The author discuss the different dissimilarity functions and update term $\Delta_k$ for Projection, multiplicative weights algorithm for apprenticeship learning, max-margin planning, policy matching and maximum entropy IRL. | ||
== Additional Optional Reading == | == Additional Optional Reading == | ||
Latest revision as of 22:04, 22 April 2010
CS7941: Topics in Machine Learning
Time: Thursdays, 10:45-12:05
Location: MEB 3105, except as noted
Topic: Structured Prediction
Expectations
Each week will consist of two related papers, and there will be two presenters. Each presenter is to (a) advocate their approach and (b) critique the other approach (kindly... especially if it's my paper!). The presentations will go roughly as follows:
- 20 mins for presenter A to talk about paper A
- 20 mins for presenter B to talk about paper B
- 10 mins for presenter A to critique paper B
- 10 mins for presenter B to critique paper A
- 20 mins open discussion trying to reach a resolution
The presenters should write short summaries ahead of time (by 11:59pm the day before the meeting) about their papers, posted at the bottom of this wiki.
You may take this course for 1 credit or 2 credits. If you take it for one credit, you only need to do what's above. If you take it for two credits, then you additionally need to help with the development of an implementation of Searn on top of John Langford's VW engine.
Participants
- Hal Daumé III, Assistant Professor, School of Computing
- Jagadeesh Jagarlamudi, PhD Student, School of Computing
- Lalindra De Silva, PhD Student, School of Computing
- Piyush Rai, PhD Student, School of Computing
- Zhan Wang, PhD Student, School of Computing
- Thanh Nguyen, PhD Student, School of Computing
- Amit Goyal, PhD Student, School of Computing
- Adam R. Teichert, MS Student, School of Computing
- Ruihong Huang, PhD Student, School of Computing
- Kristina Doing-Harris, Post Doctoral Fellow, Biomedical Informatics Department
- Jiarong Jiang, PhD Student, School of Computing
- Youngjun Kim, PhD Student, School of Computing
- Seth Juarez, PhD Student, School of Computing
- Sandeep P, MS Student, School of Computing
Schedule
Please do not sign up until the semester starts, except for day 1... we need to give "the youth" a chance to sign up before senior folks to!
| Date | Papers | Presenter |
|---|---|---|
| Introduction to Structured Prediction | ||
| 14 Jan | Maximum Entropy Markov Models versus Conditional Random Fields | Hal vs Piyush |
| 21 Jan | M3Ns versus SVMstruct | _ vs _ |
| 28 Jan | Incremental Perceptron versus Searn | Thanh vs Clifton |
| 04 Feb | Dependency Estimation versus Density Estimation | Zhan vs Amit |
| Theory | ||
| 11 Feb | Generalization Bounds versus Structure Compilation | Avishek vs Jags |
| 18 Feb | Negative Results versus Positive Results | Seth vs Kris |
| 25 Feb | Learning with Hints versus Multiview Learning | Ruihong vs Adam |
| Dynamic Programming and Search | ||
| 04 Mar | Factorie versus Compiling Comp Ling | Jiarong vs Youngjun |
| 11 Mar | Maximum Spanning Trees versus Matrix-Tree Theorem | Adam vs Jags |
| 18 Mar | Contrastive Divergence versus Contrastive Estimation | Arvind vs Amit |
| 01 Apr | Searching over Partitions versus End-to-end Machine Translation | Ruihong vs Matthew O. |
| Inverse Optimal Control (aka IRL) | ||
| 08 Apr | Learning to Search versus Apprenticeship Learning | Kris vs Lalindra |
| 15 Apr | MaxEnt IRL versus Apprenticeship RL | Zhan vs Abhishek |
| 22 Apr | IO Heuristic Control versus Parsing with IOC | Jiarong vs Jiarong(if no one else) |
Paper Summaries
14 Jan: MEMMs and CRFs (Hal and Piyush)
Maximum Entropy Markov Models
Maximum entropy Markov models are essentially HMMs where the p(x_n|y_n)*p(y_n|y_{n-1}) has been replaced with a general (conditional) exponential model. Another way of thinking of it is to train "classifiers" to predict p(y_n | y_{n-1}, x). Note that this can depend on all of x, not just x_n. This gives major advantages because we can now throw in arbitrary overlapping features, without having to worry about killing ourselves with the naive Bayes assumption that we're forced to make in the generative model. The transition probabilities can also be estimated with classifiers. All this, and the Viterbi algorithm still works for decoding and the forward-backward algorithm still works for computing expectations (useful for un- or semi-supervised learning, or learning with latent variables). Training is straightforward: you just train a maxent classifier, predicting, for instance, y_10 given x and y_9 for all of the training examples. Test is also straightforward: you run Viterbi on the (probabilistic) output of the classifier. (In the original formulation, they trained separate transition models for each "previous state", but this is a bad idea and no one does it any more.) Note that the estimated distribution provides a locally-conditioned transition model, and locally-conditioned classifier, which makes training much more efficient (the normalizing constant only has to sum over #tags possibilities). The typical way in which features are defined are as functions of (previous tag, current tag, entire input) and then a single giant weight vector is learned over these features.
Conditional Random Fields
Conditional random fields (CRFs) offer a probabilistic framework for labeling and segmenting structured data (examples: predicting POS tags for sentences, information extraction, syntactic disambiguation, aligning biological sequences, etc.) . CRFs are basically undirected graphical models that define a log-linear distribution over label sequences given a particular observation sequence: p(y|x,w) = \frac{1}{Z_{x;w}}\exp[w^T\phi(x,y)], where the normalization constant Z_{x;w} = \sum_{y' \in \mathcal{Y}}\exp[w^T\phi(x,y')]. The normalization constant here is basically a sum over all hypothesized outputs.
Very much like the maximum entropy markov models (MEMMs), which are a generalization of the logistic regression models to sequences, CRFs too are discriminative (aka conditional) models. The conditional nature helps incorporating richer representation of observations, modeling overlapping features, and capturing long range dependencies in the observations, and at the same time ensuring that inference remains tractable. Such properties of the observations are very hard to capture in generative models such as HMMs, and even if one could, the inference becomes intractable in such models. The conditional models such as CRFs (and MEMMs) circumvent this difficulty by directly modeling the probability distribution of the quantity we care about (in this case, the label sequence), without having to bother with the modeling assumptions of the observations.
CRFs bring in all the benefits of MEMMs, except for two additional advantages: 1) CRFs can also be applied to structures other than sequences. So they can be used to label graph structures where nodes represent labels and edges represent dependencies between labels. 2) CRFs give a principled way to deal with the label-bias problem faced by MEMMs (or any conditional model that does per state normalization). MEMM uses per state normalization in the conditional distributions of the state given the observation. This causes a bias towards states which have very few outgoing transitions and, in the extreme cases, may even result in states completely ignoring the observations. CRF deals with this issue by enforcing a *per-sequence* normalization as opposed to the per-state normalization (but note however that there has been some prior work which addresses the label-bias problem in the MEMM framework too). Maximum-likelihood learning in CRFs becomes harder with such an assumption. Basically, the nasty normalization term becomes difficult to take gradient of, but a straightforward modification of the forward-backward algorithm of HMMs comes in to rescue us here.
21 Jan: M3N and SVMStruct
M3N
SVMStruct
28 Jan: Incremental Perceptron and Searn (Thanh and Clifton)
Incremental Perceptron
Many structured prediction algorithms deal with the argmax problem y^ = \argmax_{y \in GEN(x)}{w^T*\phi(x,y)}. So does structured perceptron (Collin2002) which makes significant assumption that Φ admits efficient search over Y (using dynamic programming). In general case, argmax might not be tractable. Two possible mentioned solutions includes the re-rank (Collins2000) and the incremental perceptron (this paper, Collins 2004). Rerank first produces a “n-best" list of outputs and uses a second model for picking an output from this n-best list which can avoid the intractable problem of argmax. Rerank supports flexible loss functions but it is inefficient and assumes a reasonable performance for the baseline model.
As an alternative approach, the incremental perceptron, which is a variant on the structured perceptron, deals with the issue of the argmax may not be analytically available. It uses heuristic methods for finding argmax, replaces argmax by incremental beam search strategies (which returns a much smaller set of the candidates) y^ = \argmax_{y \in F_n}{w^T*\phi(x,y)}. The paper introduces two refinements including the ‘repeated use of hypothesis’ and the ‘early update’. The first refinement maintains a cache of examples and repeatedly iterate over them to update the model if the gold standard parse is not the best scoring parse from among the stored candidates (dynamically generate the constraints i.e. incorrect parses, and use these constraints to update the model while the original algorithm only looks at one constraint on each sentence and is extremely wasteful with the generated constraints implied by previously parsed sentences). Early-update aborts the search algorithm as soon as it has detected that an error has been made rather than allowing the parser to continue to the end of the sentence which leads to less noisy input to the parameter estimation algorithm; and also improve the efficiency.
Incremental perceptron deals with the issue of the argmax may not be analytically available, simplicity, supports only 0/1 loss but it has minimal requirement on Y and \phi. Note that its incremental beam search is only a heuristic, there is no guarantee that this procedure will find the highest scoring parse, search errors when \argmax_{h \in F_n}{w^T*\phi(h)} \ne \argmax_{h \in H_n}{w^T*\phi(h)}. Therefore, they need to show the improvement in the empirical experiment.
Searn
4 Feb: Dependency Estimation and Density Estimation (Zhan and Amit)
Dependency Estimation
Density Estimation
11 Feb: Generalization Bounds versus Structure Compilation (Avishek and Jags)
Generalization Bounds
The aim of traditional (non-bayesian) generalization theory is to formally bound the expected test error, given the empirical training error and some additional terms (which depend on the complexity of the hypothesis space). These traditional bounds, in many cases, can be further improved by careful refinement using PAC-Bayesian analysis; the high level idea being able to formally incorporate the notion of prior belief which constrains the hypothesis space resulting in tighter bounds.
This paper provides PAC-Bayesian generalization bounds and consistency results for structured prediction. The generalization results demonstrate the efficacy of using Hamming distortions (Theorem 2) over 0-1 loss functions (Theorem 1). The proof of the theorems follow from the PAC-Bayesian theorem and standard concentration arguments to bound magnitude of the margin to a small value with some high probability. In addition, these bounds also demonstrate the advantages of using Hamming distances in the training algorithm irrespective of the loss functions used at test time. At this point it is worth noting that both Theorem 1 and Theorem 2 are non-convex in their parameters.
The above results have been proved for logistic regression which usually applies to binary classification. SVMs, a popular non-probabilistic alternative to logistic regressions, have been previously extended for multiclass structured prediction where the different multiclass-SVMs vary in their formulation of the multiclass hinge loss. This paper shows that the different notions of multiclass hinge loss are non-convex in their parameters. Thereafter, these loss functions are appropriately modified to achieve convexity; but in the process they loose consistency. The paper also shows that the initial non-convex generalization bounds (Th 1 and Th 2) are actually consistent as they converge to the optimal classifier in the limit of infinite training data, thereby suggesting that for structured predicton non-convex generalization bounds might result in superior (consistent) performance as compared to their convex counterparts.
Finally, these bounds are generalized for non-linear cases and for generalized version of "parts", a notion that captures the decoding specifics for structured prediction.
Structure Compilation
Structured compilation is motivated by the need for a "quick response during test time" and is willing to spend more training time as a trade off. Structured models like MEMMs and CRFs, consider the interaction among the labels for each node (example the pos tag of each word in the case of POS tagging) where as ILR (Independent Logistic Regression) decides the label independent of other nodes and hence is faster. Hence structured compilation uses ILR to achieve its goal. To leverage, the information lost, by not considering the interaction among labels, structured compilation proposes to use more features (which we call expanded feature set).
It assumes the availability of small labeled corpus and considerable amount of unlabeled corpus. The main idea is the following: first train the CRF classifier on small labeled data and use this model to predict labels for the unlabeled data. Now use both the original labeled data and the newly predicted data to train the ILR model (with expanded features). Use this ILR model in the test time.
The reason for the need for huge unlabeled data is simple. ILR tend to overfit on small corpus, in order to prevent it from the overfitting it needs to trained on bigger data sets. Theoretical analysis and experiments show that, if the original labeled data is small then structured compilation will catch up the performance of the CRF classifiers.
18 Feb: Positive Results versus Negative Results (Kris and Seth)
Positive Results
In Structural SVMs we are looking to optimize the quadratic problem: Min w,\ksi >= 0 for ½ ||w||^2 + C/n \sigma \ksi_i for every i, for every y in Y\y_i: w^T\psi(x_i,y_i) >= w^T\psi(x_i,y) + \delta(y_i,y) - \ksi_i Solving this with the Cutting Plane algorithm, requires a separation oracle which can determine the most violated constraints with desired precision ε. The existence of a Separation Oracle which can be computed in polynomial time, results in three theoretical guarantees: Polynomial Time Termination, Correctness and an Empirical Risk Bound. There are two points at which one must determine the argmax y in Y: during prediction (w^T*\psi(x,y)) and while generating output from the Separation Oracle (w^T*\psi(x,y) + \delta(y_i,y)). It is sometimes impossible to exactly determine this argmax.
The authors demonstrate two classes of Approximate Inference which retain the theoretical guarantees and thus can be used when exact inference is intractable. On a high level, these algorithms draw new linear separators, which are fewer in number than the exact data requires. The first class of algorithms use Undergenerating Approximations, which are drawn from a subset of the true output set Y. They restrict their approximations to within \epsilon of the true optimum. The authors demonstrate that an Undergenerating Approximation algorithm will still terminate in polynomial time, is correct to within a factor of 1/ \rho and that the risk bound in only increased by a factor of (1-\rho). The second class of algorithms use Overgenerating Approximations, which are drawn from a supraset of the true output set Y. The linear separators defined by these algorithms include all of the original in-class examples (y=1), but some of the original out-of-class examples (y=0) are mapped to a third class. In their examples this third class is designated by y=1/2 (LProg) and by y=null (Cut). LProg and Cut both have Equivalence (maximizing solutions are transmutable) and Persistence (unambiguous labels are optimal labels) as well as the three theoretical guarantees.
During testing of the Approximations, the authors found that Loopy Belief Propagation [LBP] tended to fall into “horrible” local maxima. This finding is probably related to the discussion of the shortcomings of Approximation Algorithms in the following paper. Notably, this fall did not occur when LBP was combined with Greedy search.
They present losses for each combination of separation oracle used in learning and of predictive inference procedure used in classification for all five approximations and the Exact. For the Scene and Yeast Datasets the model trained with LProg is the best or one of the best performers regardless of method used at test. For the Synth1 Dataset LBP, Greedy and their Combine perform best. For the Reuters Dataset Combine used at training outperforms Exact regardless of the method used at test. The same is true for the Mediamill and Synth2 Datasets, if LProg is used at training. Their overall preference is for the Overgenerating Approximations because training with them results in a model that can use either approximation for prediction, while training with an Undergenerating Approximation will cause a relaxed inference to yield too many ambiguous labelings.
25 Feb: Learning with Hints and Multi-view Learning (Ruihong and Adam)
Learning with Hints
Multi-View Learning
The general goal of multi-view learning is to leverage redundant views of the same data. For example, if we have two different, pretty good ways of predicting a label for a new example, when we train the two models, we can avoid over-fitting by looking for parameters for the two models which not only have low training error but which also tend to agree with each other when they try to label available unlabeled training data. So, in short, we can use the amount that the two models would disagree when labeling unlabeled data as a "regularizer"---a penalty term in our calculation of expected loss which encourages our two models not to overfit.
So, rather than using something like "co-training" to explicitly add co-labeled examples to the training data when multiple models agree on how to label the example, the paper today by Ganchev et. al. uses the agreement between two models on unlabeled data as a regularizer when training models.
Our co-regularized objective, then, is the sum of the independent expected loss of each model, plus the expected disagreement between the models on unlabeled training data. They measure this disagreement using the Bhattacharyya distance, which they claim is a natural measure of distance between distributions. It also leads to a slightly more general form for the agreement regularizing term which becomes a minimization of the KL divergence between some agreement joint distribution over label pairs and the product of the individual distributions over labels.
The authors also give an EM style algorithm for optimizing the regularized loss and show how the method can be used when you only expect partial agreement (like when there are two different tagsets).
4 Mar: Factorie versus Compiling Comp Ling (Jiarong and Youngjun)
FACTORIE: Probablistic Programming via Imoeratively Defined Factor Graphs
FACTORIE is a software library that represents factor graphs which preserves its statistical semantics. It has the ability to combine declarative and procedural domain knowledge and performs efficient inference and learning over large and complex graphs. It is good in speed and accuracy.
Factor graph is a bipartite graph representing the factorization of a function. For inference and imperative constraint preservation, they use a difflist to contain the variables changed by the step and undo and redo capabilities. They also provide methods to compute proposal distribution and use property-preserving proposal functions to avoid deterministic factors. Their imperative structure definition includes unroll methods which finds factor’s other variable neighbors given a single variable from the DiffList. Also, it allows the implementation of the factor templates. For imperative variable coordination, they introduce a value-assignment method to quickly assign values without inference. They support a separation between factor neighbors and sufficient statistics and allow a function to perform variable-statistics mapping.
Compiling Comp Ling
The “parsing as deduction” framework has been provided an elegant notation for specifying a variety of parsing algorithms including algorithms for probabilistic or other semiring-weighted parsing. Weighted deduction is a powerful theoretical formalism that encompasses many NLP algorithms.
Authors generalize modern probabilistic parsing techniques to a broader class of weighted deductive algorithms and mention that deductive inference should play the role of regular expressions and FSMs.
This paper proposes a declarative specification language, Dyna, that replaces Prolog’s Horn clauses with “Horn equations” over terms with values. Authors give a bottom-up “inside” algorithm for general semiring-weighted deduction, based on a prioritized agenda, and a general “outside” algorithm that correctly computes weight gradients even when the inside algorithm is pruned. Then, they discuss Dyna-to-Dyna program transformations and evaluate their first implementation of a Dyna-to-C++ compiler. They hope this compiler will facilitate EMNLP research.
11 Mar: Maximum Spanning Trees and Matrix-Tree Theorem (Adam and Jags)
Maximum Spanning Trees
Dependency trees link each word in a sentence to a single parent (another word) on which it most closely depends syntactically (for instance, determiners and adjectives would be children of the noun that they modify, and the main verb of a sentence would typically be the only child of the dummy root node). A "projective" dependency tree is one that doesn't have crossing edges---that is, one in which a node and its children are all found in a contiguous part of the sentence. Non-projective trees allow crossing edges.
Since all words in the sentence are required to participate in the tree, we can see that any dependency tree will be a spanning tree of the graph formed by connecting each pair of words with edges in both directions and by adding an edge from the dummy root node to each of the word nodes. Dependency parsing is just finding the best such spanning tree. By weighting edges using the dot product of a feature vector of the edge with some weight vector, parsing boils down to finding the Maximum Spanning Tree of the weighted graph, and training the parser becomes the task of learning a good weight vector from parsed data.
The paper outlines the Chu-Liu-Edmonds algorithm, a greedy, recursive algorithm that can run in O(n2) time to find a maximum spanning tree and, thus, a non-projective dependency parse. The algorithm lets each node greedily choose its favorite parent (the one that gives the highest score). If the resulting graph is a tree, it is the maximum spanning tree. If it isn't a tree, there must be a cycle, so the algorithm instructs to find a cycle, collapse the nodes in the cycle to a single node, update the weights to and from the new node, and then recursively call the algorithm on the new graph.
Training the weight vector used for scoring, is done using two variants of the online Margin Infused Relaxed Algorithm (MIRA) which address the exponential number of constraints in the optimization. One approach, "single-best MIRA" changes weights as little as possible on each update so as to guarantee that the score we give to the new labeled example be at least as large as the score we give to the highest scoring possible dependency tree for the sentence plus the number of wrong parent assignments made in that tree. "Factored MIRA" updates the weights as little as possible while ensuring that the new weights yield score each incoming edge in the new true tree with a score of at least 1.0 more than the score it would give to any other incoming edge to the same node.
Experiments and results are also given in the paper.
Matrix-Tree Theorem
This paper proposes another solution for finding non-projective dependency trees (check the summary of previous paper) of a given sentence. Though the solution is computationally expensive than the previous solution, it is more general and gives better results than the previous approach. Moreover, this technique enables global training on the entire training data rather than online fashion (in averaged perceptron).
The parameter update step in many training algorithms involves computation of the partition function (to compute the loss with existing weight vector) and marginal probabilities (for gradients style update). Unlike projective dependency trees, non-projective tress do not admit dynamic programming based solution for the computation of these quantities. This paper uses Matrix-tree thearom and shows that both the partition function and marginals can be computation can be done using the determinant and inverse of a matrix respectively.
Experiments conduced on different languages with varying level of non-projective property show that this method outperforms averaged perceptron style learning (the previous approach).
18 Mar: Contrastive Divergence and Contrastive Estimation (Arvind and Amit)
Contrastive Estimation
In this paper, the goal is to use unannotated text to solve sequence labeling problems like part-of-speech (POS) tagging, named-entity recognition, and parsing. To incorporate domain knowledge into a model structure, we need to handle various kinds of features. Log-linear models allows us to do that.
In general, to handle unannotated text, Expectation-Maximization (EM) algorithms are combined with log-linear models. One of the major problems of using EM is that computing normalization function (partition function) is computationally expensive. For POS tagging, we have to integrate out all possible sentences and all possible POS tagged structures for them.
In this work, authors just not only uses where the leaner pushes the probability mass but also explicitly makes use of the source of the probability mass that is given to an example. Since to find the best POS label sequence for a sentence "S", comparing with all possible sentence constructions is not ideal. So, they restrict sentence constructions to only those sentences which are close to "s". These close sentence constructions define neighborhood for "s". For example, they define neighborhood as deleting one word from a sentence, swapping adjacent words in a sentence, and considering all permutations of words with sentence length of "s". The neighborhood construction define those sentences as neighbors which are very similar to "s". So, they assume that if the model can differentiate between these close constructions, then it can easily work for cases, where the constructions are totally different from one another. And, neighborhood should be something by which your model could differentiate between best and second best tagged predictions. This idea is very similar to the supervised setting of M3Ns except that here authors are working with unlabeled data. They call this framework as contrastive estimation (CE).
In experiments, they show that using CE performs better than EM on POS tagging. CE is faster and more accurate than EM for POS tagging. In their other paper, they show that using CE is also useful for grammar induction which is identical to parsing. Selecting a appropriate neighborhood is an crucial step in their setting. They show that always using bigger neighborhoods is not good. Hence, their model is different from EM which uses all possible sentence constructions as neighborhood. Neighborhoods should be constructive such that they are discriminative in nature.
Contrastive Divergence
This paper presents another approach to handle the log-partition problem. Log-partition function creates a problem only when one is maximizing the likelihood function : p(x|th) = p(x,th)/Z(th). We know that maximizing the likelihood is equivalent to minimizing the KL divergence between the empirical distribution and the true distribution. This relationship between KL and maximum likelihood (ML) can be used to deal with the log partition problem.
For this to understand, let p_0 be the empirical distribution, and p_inf be the final distribution that is obtained from the ML estimation. Note that if one were to use MCMC method instead of ML, then MCMC method will eventually converge to the true distribution; call it p_inf. MCMC method will also give us a series of intermediate distributions i.e., p_1,p_2...p_inf. Given such a series of distributions, a contrastive divergence is: CD = KL(p_0||p_{inf}) - KL(p_n||p_{inf})
This paper suggests that one can minimize the CD instead of maximizing the ML, and achieve the similar solution. Though, the paper does not give any guarantee on the solution (neither in terms of the convergence nor in terms of the accuracy of the solution), it shows that in practice, CD solution is as good as the ML solution.
13 Apr: Learning to Search and Apprenticeship Learning via IRL (Kris and Lalindra)
Learning to Search
This paper describes the Maximum Margin Planning algorithm [MMP], which refines a cost function to make example trajectories through a Markov Decision Process without Reward function [MDP\R]. In contrast to other methods they are working with one positive example and no negative examples. Therefore they cannot infer a cost function from the examples and must employ an alternate technique. Their MMP algorithm suggests the cost function be increased in regions of the feature space encountered along the planned path and decreased in areas encountered in the example path, on each iteration.
To make training more difficult and thus generalization better, they do loss augmentation. At each iteration, they lower the cost of undesirable state-action pairs to make them more likely to be chosen. This is accomplished by adding a term, which is a generalization of Hamming Loss, and increases loss from 0 at state-action pairs along the example path out to 1 gradually, identifying paths that are “almost correct”. This forces the algorithm to continue making updates until the cost of the example path is less than any other by a margin that scales with the loss of that path.
By defining a policy implicitly in terms of the optimal controller for a cost function created from a set of features, the policy can be applied to generalize to new MDPs outside the training set. The planner is an MDP solver that returns mu*=argmin(mu elt of G)c^T*mu, where G is a set of flow vectors. This transformation gives rise to a convex objective function, which can be efficiently optimized using the subgradient method. Positivity constraints on the costs are upheld in the linear case by the comparison with the minimum cost example path. This results in an MMP objective function which forms an upper bound on the structured loss in a way that generalizes the upper bound formed by the zero-one loss by the SVMs binary hinge-loss. When both the environment and the policies are deterministic the vector matrix product Fi*mu can be implemented efficiently using sparse multiplication. They use A* to find the optimal deterministic policy through the environment.
To generalize the algorithm to non-linear functions, which is necessary to make feature selection tractable, they use Functional Gradient Descent. To use the functional gradients they first project them into a predefined set of hypothesis functions [H] – “the direction set”. Operationally, it is a regression data set, which suggests where and how strongly the function should be increased or decreased at a collection of points in the feature space. This is gradient descent which operates directly on the function f(x) instead of the parameterized function f(x,w) so that the loss is Lf = (Sigma, sub n)l(n)+||f||^2 and the gradient of L w.r.t f = h is in the direction that f is off. To learn this regression function (h) they specify a +1,-1 label for each feature vector and use an out of the box regression algorithm to learn a function with the desired properties. Then f is updated by f = f + eta*h.
In the non-linear case the positivity constraint is upheld by making modifications to the log of the cost function and exponentiating log-costs before planning. Thus the algorithm implements an exponentiated variant of functional gradient descent.
Apprenticeship Learning via IRL
This paper presents an approach to imitation learning (learning by observation) using inverse reinforcement learning. Inverse reinforcement learning is the problem of trying to derive the reward function an agent in optimizing in a given scenario. As it turns out that recovering this underlying reward function is pretty difficult, one way to approach the problem is to mimic the behavior of an expert and thereby, try to come up with policies or behaviors that is close to what the expert is demonstrating.
The way the authors have approached this problem is to assume that the reward function the expert is optimizing can be expressed as a linear combination of the known features. Given that such a feature mapping is available, we can derive a policy whose feature expectations is as close to that of the expert's demonstrated policies. The authors demonstrate that this reformulated problem in fact produces behaviors or policies that are similar to what the expert would produce. They show that their algorithm converges in fewer iterations compared to the contemporary methods and guarantees termination with certain conditions being satisfied.
15 Apr: MaxEnt IRL and Apprenticeship RL (Zhan and Abhishek)
22 APR: IO Heuristic Control and Parsing with IOC (Jiarong)
Inverse Optimal Heuristic Control for Imitation Learning
This paper presents a way to combine MDP and behavioral cloning methods. The advantage of MDP is that it moves beyond a one-step look-ahead and learns a policy correctly predicts entire sequences of actions. However, it is not scalable with increasing state space. On the other hand, behavior cloning provides a good scalability of existing supervised learning algorithms.
The way to combine these two methods together is to change the energy function of the original Gibbs model. In Gibbs model, the energy is represented by the Q value, which is $c(s,a)+J(T_s^a)$. $p(a|s)=\frac{\exp(-c(s,a)-J(T^a_s))}{\sum_{a'\in A_s}\exp(-c(s,a')-J(T^a'_s))}$ This is similar to multi-logistic regression, and one behavior cloning model based on it is using $E(s,a)=w^Tf(s,a)$ instead of $Q^*(s,a)$ as the energy function.
The combination of these two can use the energy function $\tilde{E}(o,a)=E(o,a)+Q_M^*(o,a)$. o here is the observation and $\phi(o,a)$ gives a list of state, action pairs. Applying this energy function and use linear parameterization of the energy, we denote $w_v$ as the weights of value features (given by behavioral cloning) and $w_c$ the weights of cost features (given by MDP). Now, we can use gradient descent to optimize the negative log-likelihood of the data. The authors claim that the objective function is almost-convex and we can directly optimize it. An effective strategy could be 1. set $w_v=0$ and optimize the almost-convex IOC Gibbs model. 2. Fix $w_c$ and optimize $w_v$. 3.jointly optimize $w_c$. $w_v$ again. In the convex approximation part, the authors also show a soft-min version of the MDP energy function which gives some better results.
Training Parsers by Inverse Reinforcement Learning
In this paper, the authors define the PCFG parsing problem as a MDP and propose a unified framework for five different inverse reinforcement learning algorithms. A PCFG parsing a 5-tuple $G=(W,N,S,R,\sigma)$. W is the set of terminal vocabularies. N is the set of nonterminal vocabularies. S is a start symbol. R is a subset of the set of all possible production rules. $\sigma$ is a scoring function and the exponentiated sum of the score of rules starting from the same nonterminal vocabulary is 1. The constituents of a parse tree is a triple $(N_c,start_c,end_c)$ where $start_c$ and $end_c$ are the respective indices of the first and last words forming the constituent. If (N,i,j) has two children, $c_{left}$ and $c_{right}$, then, $c_{left}=(N_{left}$,i,k)$ and $c_{right}=(N_{right},k+1,j)$ and there is a rule $N\rightarrow N_{left}N_{right}$. The probability of a parse tree given the grammar is proportional to the product of the exponentiated product of rule occurrence and score. So finding the most probable parse for a sentence given a grammar G is to find the maximum scoring tree.
This PCFG parsing can be formalized as an MDP. $M_{1k}^G$ = $(\Chi, \{A(x)\}, T, \Chi_T, r)$. $\Chi$ is the state space which represents partial parse trees with root $(S,1,K)$. $A(x)$ is the set of admissible actions in $x$. $A(x)=A_1(x)\cup A_2(x)$ where $A_1(x)$ includes the rules that can split the nonterminals to nonterminals and $A_2$ represents the rules that split a non-terminal to a terminal word. T is a deterministic transition function. $\Chi_T$ is the set of terminal states. $r$ is the reward function which is $-\infty$ if the terminal state is not in the parse tree, otherwise, $r(x,a)=\sigma(R^a)$.
The unified goal of IRL algorithms is to minimize some dissimilarity function $J(r,D)$ which measures the differences of the optimal behavior wrt the selected reward function r and the expert's observed behavior. Using linear parameterization, $r_\theta(x,a)=\theta^T\phi(x,a)$ and $\phi(x,a)$ is a feature extractor. The update of the parameter is $\theta_{k+1}=g(g^{-1}(\theta_k)+\alpha\Delta_k)$ and $g$ is a link function (Here they use g(x)=x or g(x)=exp(x) ). The author discuss the different dissimilarity functions and update term $\Delta_k$ for Projection, multiplicative weights algorithm for apprenticeship learning, max-margin planning, policy matching and maximum entropy IRL.
Additional Optional Reading
See Structured Prediction on Braque.
Past Semesters
- Fall 2009 (Online Learning)
- Summer 2009 (Assorted Topics)
- Spring 2009 (Learning Theory)
- Fall 2008 (Graphical Models)
- Summer 2008 (Multitask Learning)
- Spring 2008 (Kernel Methods)
- Fall 2007 (Unsupervised Bayes)
- Summer 2007 (Manifolds)