MLRG/fall07

From ResearchWiki

Revision as of 16:21, 28 November 2007 by Schizoid (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Time: Friday, 1:00-2:00 (except as noted below)

Topic: Unsupervised Bayesian

Location: MEB Graphics Annex

Contents

Schedule

Date Papers Presenter
31 Aug Tutorial on variational approximation methods (Jaakola, 2000) Hal
11 Sep (2pm) An Introduction to MCMC for Machine Learning (Andrieu, de Freitas, Doucet and Jordan, 2003) Piyush
21 Sep A Correlated Topic Model of Science (Blei and Lafferty, 2007)
Gibbs likelihoods for Bayesian tracking (Roth, Sigal and Black, 2004)
Nathan
Arvind
28 Sep Understanding Belief Propagation and its Generalizations (Yedidia, Freeman, and Weiss, 2001) Piyush
5 Oct Language evolution by iterated learning with Bayesian agents (Griffiths and Kalish, 2007)
Gene function classification using Bayesian models with hierarchy-based priors (Shahbaba and Neal, 2006)http://ba.stat.cmu.edu/journal/2007/vol02/issue01/shahbaba.pdf Improving Classification When a Class Hierarchy is Available](Shahbaba and Neal, 2007)
Cecily
Amit
16 Oct (2pm) Markov chain sampling methods for Dirichlet process mixture models (Neal, 1998)
Background: A Bayesian analysis of some nonparametric problems (Ferguson, 1973); Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems (Antoniak, 1974); Ferguson distributions via Pólya urn schemes (Blackwell and MacQueen, 1973); A constructive definition of Dirichlet priors (Sethuraman, 1994). See also Yee Whye's tutorial.
Hal
19 Oct Variational inference for Dirichlet process mixtures (Blei and Jordan, 2004) Avishek
26 Oct Hierarchical Dirichlet Processes (Teh, Jordan, Beal and Blei, 2006) Nathan
30 Oct (2pm) Bayesian nonparametric latent feature models (Griffiths and Ghahramani, 2006) Amit
6 Nov (2pm) Stick-breaking Construction for the Indian Buffet Process (Teh, Gorur and Ghahramani, 2007) Avishek
16 Nov Introduction to Gaussian Processes (MacKay, 2000)
Other intro material: Gaussian processes for machine learning (Seeger, 2004); Gaussian processes (Williams, 2002)
Nathan
20 Nov (2pm) The Infinite PCFG using Hierarchical Dirichlet Processes (Liang, Petrov, Jordan and Klein, 2007)
Other related material:Talk on Infinite PCFG using HDP(Liang, Petrov, Jordan and Klein, 2007) ; The infinite tree (Finkel, Grenager and Manning, 2007) ; Talk on Structured Bayesian Nonparametric Models with Variational Inference(Liang and Klein, 2007)
Amit
27 Nov (2pm) Bayesian agglomerative clustering with Coalescents (Teh, Daume III, Roy, 2007) Arvind

Paper Summaries

31 Aug 07: Variational (Hal)

The Bayesian inference problem is typically to estimate a posterior distribution p(θ,z | x), where θ is a set of parameters, z is a set of latent variables and x is observed data. Variational inference is an approximation method where we seek to approximate this posterior with a distribution q(θ,z), where q \in Q and Q is a set of well-behaved distributions. Typically we assume q(θ,z) factorizes at least with respect to θ and z, yielding q(θ)q(z). If we choose both to be point masses, we get Hard EM; if we choose the first to be a point mass and the latter to be the true distribution, we obtain EM; if we allow both to be distributions, we get mean-field variational EM.

Now, the big question is: what do we mean by p being similar to q? We typically measure this by KL(q | | p), which is typically referred to as mode-finding (switching the order on the KL yields mean-finding, but this is something else entirely). Note that by the definition of KL, p needn't be normalized in order to perform a minimization over q. The next step is to relax the optimization problem so that it becomes tractable, done by shrinking Q. The mean-field approximation fully factorizing q so that
q(θ) = qi)
i
. Once factorized as such, we obtain the following algorithm. For each i, optimize qii) holding all other coordinates θi fixed. (When we talk about Gibbs sampling in two weeks this will look very familiar.) Next, we update qi according to q_i(\theta_i) \propto \exp[ E_{q_{-i}} \log p(\theta_i | \theta_{-i}, x) ] \propto \exp[ E_{q_{-i}} \log p(\theta, x) ]. This is a coordinate descent algorithm; more complex algorithms are possible, but beyond the scope.

The important thing to remember is: Variational means turning an inference problem into an optimization problem with respect to KL(q | | p). Mean-field means fully factorizing q. Both can be changed. Switching the KL yields message passing algorithms. Not factorizing q yields collapsed algorithms. There are a continuum of options.

(Suresh) I did some digging on this projection business I was talking about. Here's the first observation. The Chow-Liu result I was referring to is a method for "factoring" a general joint distribution into a dependence tree with only pairwise dependence relationships. They show, very elegantly, that this problem can be solved optimally via an MST construction. The idea is that you weight each edge of the graph of variables by the mutual information of these variables, and compute a maximum spanning tree under this weight function.

Here's where I think the compare-and-contrast kicks in. In the original variational formulation, you want to minimize the KL distance between p and q, and replace q by the mean-field approximation, because of computational ease. You know that the minimizer in the original space is p itself, and in the mean-field space, it's some other q' that you compute. In order to estimate the error of the final solution, you'd need to analyze how "far" q' is from p, and since these spaces are not metric in the usual sense, it's not clear how to do that.

With the Chow-Liu method, what you're doing is making a controlled perturbation of q to a q' that is guaranteed to be closest among all possible q'. In other words, you're projecting onto a larger space (because you allow pairwise dependencies) and so you'd expect the answer to be better. However, whether the computation can still proceed is not clear to me: clearly, the space is not factorable in the same way, and maybe this goes to the "collapsed algorithms" Hal referred to above. But it's a tree, and that screams dynamic programming, or even divide and conquer. I think it's an interesting question to ask whether using this more powerful projection gives a better answer, and if it can yield efficient algorithms in the same way.

(Hal) The wikipedia entry is a bit on the short side, but the general approach seems cool. In general, for those who don't know, if your graphical model is tree structured, then inference is tractable. (Formally, complexity scales roughly as N^W, where W is the tree-width of the graph. Trees have tree-width 1. There's been a recent set of papers by Wainwright and others that proposes approximating dense graphs with effectively random spanning trees; they are in some ways optimal (Kolmogorov and Wainwright have a paper on this). They don't cite Chow and Liu, though, so I'm not sure if anyone remembers this connection anymore :).

11 Sep 07: MCMC Techniques for sampling (Piyush)

Many of the problems that we typically encounter in Bayesian inference, optimization, statistical mechanics, model selection, etc, deal with complex forms of the probability distributions of interest, and/or high dimensionality of state space (continuous/discrete) on which they are defined. This leads to the following issues:

  1. Sampling from such target distributions is very difficult since their form is usually unknown.
  2. Computations of various quantities of interest, defined over such target distributions (e.g. marginalization, expectation or posterior calculations), are rendered intractable since they typically involve integrals over very high dimensional spaces.

MCMC techniques have emerged as useful tools to deal with such problems. MCMC techniques are based on the basic idea of Monte Carlo integration and state space exploration using Markov chains. Their popularity is also attributed to their applicability in a wide range of problems, where we are faced with large state spaces.

Monte Carlo integration: The idea behind Monte Carlo integration is to approximate an exact integral by an empirical integral. It's similar to integration by summation, but instead of evenly spaced sample points, we choose randomly spaced sample points. The simplest example would be uniform sampling (sampling points at the toss of a coin) from the target distribution p(x). However, the underlying assumption is that p(x) has some known form (e.g. Gaussian). If the form of p(x) is such that it isn't easy to sample from, we resort to other techniques such as rejection/importance sampling, or MCMC.

Rejection and importance sampling: These approaches take into account some prior knowledge about the target distribution p(x). This prior knowledge comes in form of a proposal distribution q(x) which we expect to be in some sense "similar" to the p(x). We also expect the proposal distribution to have the same support as p(x) (by "support", we mean the smallest set of p whose complement has zero probability; in general, q shouldn't be zero too often).

Rejection sampling is based on sampling from q(x) which acts as an "envelop" for p(x) (p(x) < Mq(x)) where M is some multiplication constant. Samples generated from q(x) are then accepted with an acceptance probability (p(x) / Mq(x)). The algorithm is shown below.

Set i = 1
while i <= N
  1. Sample x^i\  \sim \  q(x) and u \  \sim \ U(0,1) 
  2. if u < \frac{p(x^i)}{Mq(x^i)}
         accept xi and increment i
     else
         reject

This can be shown to be equivalent to sampling from p(x). However, fining a good M is not always easy, especially for high dimensions, and thus rejection sampling is not suited for such cases. Importance sampling chooses a proposal distribution q(x) such that p(x) = w(x)q(x), samples are drawn from q(x), and we weigh each example by p(x) / q(x) (importance weight). Unfortunately, importance sampling too isn't suited for high dimensions.

Another drawback of rejection as well as importance sampling is that they require proposals that are "good" everywhere. This is not always possible. We instead want a proposal distribution which is good "locally". Most of the MCMC techniques have been devised keeping this requirement in mind.

Markov Chain Monte Carlo techniques: MCMC techniques offer a more powerful framework to sample from a large class of distributions. Essentially, MCMC methods are based on maintaining a random walk of parameters over parameter space, constructing a Markov Chain whose stationary distribution is the same as the target distribution we want to sample from. The chain is constructed in a manner that it spends more time in the most important regions. As with rejection and importance sampling, MCMC techniques too draw samples from a proposal distribution. However, we also maintain a record of the current state xi and the proposal distribution q(x | xi) depends on the current state. In contrast, the samples drawn in rejection sampling, for example, are statistically independent from each other.

A Markov chain is characterized by the following properties:

  1. Given the current state, the future state is independent of all other states of the system: p(xi + 1 | xi,x(i − 1)....,x1) = p(xi + 1 | xi)
  2. Transition probabilities T(xi | x(i + 1)) = p(xi + 1 | xi) denoting the probability of switching from xi to xi + 1. All transition probabilities are the same for a homogeneous Markov chain, with an additional condition \sum_{x^i }T(x^{i+1}|x^i) = 1.

A Markov chain will converge to its invariant distribution if:

  1. The Markov chain has a transition kernel (transition matrix for discrete case) which ensures its irreducibility (i.e. the transition graph is connected) and aperiodicity.
  2. It fulfils a reversibility condition, also known as detailed balance: p * (xi + 1)T(xi | xi + 1) = p * (xi)T(xi + 1 | xi) (for any distribution p * (x))

The Markov chain converges to the required invariant distribution, regardless of the choice of initial distribution. This property is called ergodicity.

The Metropolis Hastings algorithm: The MH algorithm is a generic (and probably the most popular) MCMC technique based on drawing samples from a proposal distribution. In fact, many other algorithms can be seen as special cases of MH algorithm. The MH algorithm makes use of a proposal distribution q(x * | x) which depends on the current value of x. Unlike rejection and importance sampling, q(x * | x) need not look similar to the target distribution p(x). The only requirement is that it should be some fixed density from which it is easy to sample. q(x * | x) is used to draw a sample x * . Then we move from x to x * according to an acceptance probability A(x,x * ). The algorithm is sketched below.

1. Initialize x0.
2. for i = 0 : N-1
   - Sample u\ \ \sim U(0,1)
   - Sample x^*\ \  \sim q(x^*|x)
   - A(x^i,x^*)\ \  =\ \  min\{1, \frac{p(x^*)q(x^i|x^*)}{p(x^i)q(x^*|x^i)}\}
   - if (u\ \  <\ \  A(x^i,x^*)
         x^{i+1}\ \  = \ \ x^*
      else
         x^{i+1}\ \  = \ \ x^i

Note that unlike rejection sampling, the rejected points are not discarded. Instead, the rejection causes the previous state to be also rewritten as the new one.

Intuitively, we can interpret the above acceptance probability as follow: We want to move to x * if p(x * ) has a high probability. We want to stay at xi if p(xi) has a high probability. If q(x * | xi) unnecessarily favors x * , we don't want to move there. Also, if q(xi | x * ) unnecessarily favors xi, we again want to move away.

We can see other samplers such as independent sampler and the Metropolis algorithm as special cases of MH algorithm. For independent sampler, the proposal distribution is independent of the current state, q(x * | x) = q(x * ). The Metropolis algorithm, on the other hand, assumes a symmetric proposal distribution such that q(x | x * ) = q(x * | x).

Simulated annealing: Sometimes, instead of approximating a target distribution p(x), we are interested in finding a global maximum (e.g. while doing a maximum likelihood, or maximum a posteriori estimate). Although we could still run a Markov chain, sample for p(x), and search for the sample which maximizes p(x), this would be very inefficient since the sampled values very rarely come from around the actual mode of the distribution. Simulated annealing, on the other hand, uses a variant of the MH sampler in which the invariant distribution is p1 / T(x), instead of p(x), where Ti is a decreasing cooling schedule such that \lim_{i\to\infty}T_i = 0. This works because p^{\infty}(x) is the density which concentrates itself around the global maxima of p(x). Careful selection of the proposal distribution q and an appropriate cooling schedule is very important.

Mixture and cycles of MCMC kernels: Given two different transition kernels K1 and K2, we can construct mixture hybrid kernels K1 + (1 − ν)K2) or cycle hybrid kernels (K1K2) to devise more sophisticated sampling schemes. Mixture kernels can be used to sample at different granularity levels (e.g. first exploring a vast state space of target distribution using a global proposal, and then narrowing down the search with a local proposal to explore the finer details). Cycle hybrid kernels are typically used when the state space is a multivariate vector. We split the vector into various blocks and sequential updates can be made to each of these blocks. Gibbs sampling is a common example of cycles of MCMC kernels.

Gibbs sampling: Gibbs sampling is used when the joint distribution is not known but full conditional distributions p(xj | x1,...,xj − 1,xj + 1,...,xn) are known (assume x to be an n-dimensional vector here). Gibbs sampler, in fact, can be obtained as a special case of the MH sampler for which the proposal distribution q(x^*|x^i) = p(x_j^*|x_{-j}^i). The acceptance probability A(xi,x * ) comes out to be 1 in this case, which means that the MH steps are always accepted (hence the Gibbs sampler is deterministic).

The sampling scheme is as follows:

1. Initialize x_i^0  i = 1,..,n
2. for j = 0 : T
   - Sample x_1^{j+1} \sim p(x_1 | x_2^j, x_3^j... , x_n^j)
   - Sample x_2^{j+1} \sim p(x_2 | x_1^{j+1}, x_3^j... , x_n^j)
   ........
   - Sample x_k^{j+1} \sim p(x_k | x_1^{j+1}, x_2^{j+1}... , x_{k-1}^{j+1}, x_{k+1}^j,..., x_n^j)
   .........
   - Sample x_n^{j+1} \sim p(x_1 | x_1^{j+1}, x_2^{j+1}... , x_{n-1}^{j+1})

There are cases in which the conditional distributions don't belong to the standard family of distributions. In such cases, we can also embed MH steps into the Gibbs sampler. Gibbs sampler has some variants as well. In particular, we can use a blocked Gibbs sampler simultaneously (jointly) drawing more than one components of x. Another variant is the collapsed Gibbs sampler which can be used if it is possible to integrate out some components of x (often used when we don't care about the component which can be integrated out).

Auxiliary variable samplers: Sometimes, it's convenient to introduce an auxiliary variable into the target distribution and then sample from the extended target density p(x,u), where u is the auxiliary variable. We draw samples (xi,ui) from the joint distribution p(x,u) and ignore ui to obtain the marginal samples xi. Examples include Hybrid Monte Carlo and slice sampling. Sometimes it is also useful to introduce more than one auxiliary variable.

Reversible jump MCMC: Often, we are faced with the problem of selecting the best model from a collection of models having different dimensionality. The general MCMC techniques aren't suitable since it would involve switching between models of different dimensionality. Reversible jump MCMC is an example of trans-dimensional approach to MCMC which can be used in model selection problems. It can move between models of different dimensions (and hence the name trans-dimensional).

Combination of MCMC with other techniques (variational optimization, exact inference) can also be used for efficient sampling. In particular, it is often possible to divide a hidden variable into two groups, sample for one first, and use exact inference to compute the other (by marginalizing the posterior, for example).

Some Monte Carlo demonstrations

21 Sep 07: A Correlated Topic Model of Science (Nathan)

A shorter version of the paper can be found here: Correlated Topic Models (It's only 8 pages compared to 30+)

Topic models, in the context of document collection analysis, refers to an explicit model of the correlation between the words in the document and latent (hidden) semantic themes. This normally corresponds to some generative probabilistic model that uses a number of distributions over a vocabulary V to describe the collection of documents.

This paper builds upon the Latent Dirichlet allocation model which assumes that words in a document come from a mixture of topics. These topics are shared by all documents, but the topic-proportion (i.e. how much of topici does documentj get) are document specific and are drawn from a Dirichlet distribution. LDA is a bag of words model, which means that the words in a document are assumed to be interchangeable. While this may seem to be a bad idea from a linguistic point of view, this assumption allows for the model to represent documents as vectors of word counts in a high dimensional space without regard for word order. This in turn allows for efficient computation of the semantic themes, although at the same time disregarding the possibility that one topic may correlate to the presence of another.

The authors show that by drawing topic proportions from a Dirichlet distribution, one rules out the hopes of utilizing latent information about correlating topics. They propose a correlated topic model (CTM), which replaces the Dirichlet with a logistic normal distribution. This change incorporates a covariance structure across the components of the model allowing for more realistic model of the latent topic structure where the presence of one latent topic may be correlated with the presence of another.

The generative process for CTM :

Given topics Β1:K, a K-vector μ and a K \times K covariance matrix Σ:

  1. For each document d = 1 \dots D
    1. Draw \eta_d \vert \{\mu, \Sigma\} \sim Normal(\mu,\Sigma).
    2. For n \in \{1, \dots, N_d\}:
      1. Draw a topic assignment z_{d,n} \vert \eta_d from Mult(fd)).
      2. Draw a word w_{d,n} \vert \{z_{d,n},\Beta_{1:K}\} from Mult(\Beta_{z_{d,n}}).

Of course there is a catch, by throwing out the Dirichlet in favor of a logistic normal distribution, you are also losing the conjugacy property of the Dirichlet and Multinomial in LDA which is the basis for the efficient approximate posterior inference achieved with that model. Thus, since the CTM is no longer tractable, the authors provide a variational inference method to facilitate the posterior approximation.

21 Sep 07: Gibbs Likelihoods for Bayesian Tracking (Arvind)

This paper addresses the problems of modeling the appearance of humans and distinguishing human appearance from the appearance of general scenes. Given a 3D model of the person projected into an image we model the likelihood of observing various image cues(filter responses i.e. edge, ridge and motion) given the locations and orientations of the limbs. More specifically, we would want to sample some pixels from the image (from both regions foreground and background) that contains the human appearance and compute the likelihood p(f | x) on these points.

There have been work in the past to model this likelihood using Bayesian methods where it is assumed that all features (filter responses) are conditionally independent. Given that we have three different kind of filter responses, edge, ridge and motion; joint conditional density using Bayesian method, can be approximated by p(f | φ) = p(fe | φ)p(fr | φ)p(fm | φ) It is well known and can also be guessed that these filter responses are not conditionally independent. Beside the direct violation of this assumption, modeling joint conditional density (probability that we observe all filter responses) based on this assumption has some other difficulties as well i.e. many features or measurements are required for robust tracking which makes the dimensionality of joint data space high, training data may be too limited to fully populate the probability space.

To address these problems, we model the likelihood using a Gibbs model

p(f|x)=\frac{1}{Z(\Lambda)}exp(-\sum_i{\langle\lambda^{(i)},\phi^{(i)}(f,x)\rangle})

where λ(i) are weight functions and φ(i) can be thought of as "selector functions" that choose bins of marginal histograms along the various directions. This basically means if feature fi takes values from \{1,\cdots,N\} and we choose to discretize them into 32 bins, then the selector function φ(f,x)(i) would be equal to j if fi falls into jth bin. Good thing about selector function is that it gives us the freedom of choosing its values along different directions. We can choose values along the direction of marginals which actually gives us the original Bayesian model because it does not account for any correlation among features or we can choose them along other directions i.e. diagonal of f1,f2 (selector function φ(f,x) = j if (f1 + f2) / 2 falls into jth bin) which models some kind of correlation between features f1 and f2. (Refer section 3 of the paper for more technical details)

Experiments show that modeling joint conditional density based on Gibbs likelihoods which lets us model the dependencies between features, captures the distribution that underlies the data well while not suffering from overfitting.

28 Sep 07: Inference using Belief Propagation (Piyush)

Graphical models such as Bayesian networks and pairwise Markov Random Fields (MRFs) let us encode the statistical relationships among some given set of variables. We can use these to represent the joint probability of a set of variables. For inference problems, we are interested in the marginal probabilities of some hidden (i.e. unobserved) variables of interest, which essentially means taking the joint probability function and summing over all the states of all other variables. The marginalization sum can be done directly for small graphs where the number of variables is small. However, the complexity of marginalization compounds exponentially with the number of hidden nodes in the graphs.

Belief Propagation (BP) is an inference technique, based on two main concepts called "belief" of nodes and "message passing" between nodes (or group of nodes) in a graph. It turns out that belief of a node i (generally denoted as bi(xi) ) is essentially the marginal probability of that node. It gives us a way to do efficient (exact/approximate) marginalization such that the complexity is only linear in terms of the hidden nodes in the graphical network.

Standard BP algorithms use a graph structure called "factor graph" which is used to represent joint probability function of a set of variables in terms of a product of functions which encode some relationship among various groups of nodes in the graph. In particular, in a factor graph with N variables and M functions (called compatibility functions), the joint probability can be written as:

p(x) = 1 / Z * ψa({x}a)
a

Factor graphs are in fact a generalization of a more specific graph structure called Tanner Graph used in error correcting codes where the compatibility functions are basically parity check functions for a group of bits.

BP has been used for doing inference in a variety of problems including Bayesian networks, pairwise MRFs and factor graphs. Applying it on pairwise MRFs is easier as compared to applying it directly on Bayesian networks and factor graphs since the latter two are more complicated structurally (Bayesian nets are directed graphs, and factor graphs consists of two types of nodes -- one corresponds to hidden nodes and other corresponds to the compatibility functions). However, the BP algorithm derived by using pairwise MRF on converted Bayesian net or factor graph is mathematically equivalent to the BP algorithm described for the other graphical model.

Belief propagation algorithm defines belief equations for each hidden node in terms of a "local evidence" function at that node and all the messages being passed to that node from the neighboring nodes.

bi(xi) = kφi(xi)mji(xi)
j

φi(xi) is the local evidence function at node i and mji(xi) is the message passed from node j to node i. mji(xi) intuitively means a message from hidden node j to hidden node i about what state i should be in. mji(xi) is a vector of same dimension as the number of states j (and i) can be in. So essentially, each component of message denotes how likely node j thinks node i would be in the corresponding state. The messages are defined by self-consistent message update rules.

mij(xj) = φi(xiij(xi,xj)mki(xi)
ik

ψij(xi,xj) is the compatibility function of two hidden nodes xi and xj. k denotes all the neighbors of node i (excluding j).

By reorganizing the sums, it turns out that belief at node i is its exact marginal probability in a singly connected graph. We observe that each message needs to be computed once for a singly connected graph and that's what makes BP efficient since now we can perform the marginalization for all hidden nodes in time linear to the number of links in the graphs.

One starts with an unbiased states of messages and the hope is that the simple iterations of message updates would lead to the convergence of the algorithm. Convergence can be an issue with graphs having loops but still BP has been successful in a wide variety of problems.

It turns out that the fixed point in BP can be shown to be equivalent to the stationary point (minima) in Bethe approximation of free energy, a concept from statistical mechanics. The free energy minimizations, although slower, are at least guaranteed to converge. Another concept is the Kikuchi approximation to the free energy which is based on cluster variational method. This method defines free energies for different regions of a graph and uses them to compute the total free energy of the system (which is essentially the sum of local energies of all local regions, modulo the energies of regions over-counted in the summation). Bethe approximation is a special case of Kikuchi approximation when we choose the regions to be the set of all pairs of hidden nodes.

In Generalized Belief Propagation, instead of single node passing a message to another single node, we have one group of nodes passing messages to another group of nodes. It can be shown that fixed points in GBP are equivalent to the stationary points of a corresponding Kikuchi approximation to the free energy.

GBP is based on a sub-region, message and belief construction scheme which I will talk about.


04 Oct 07: Language Evolution by Iterated Learning With Bayesian Agents(Cecily)

The aim of the paper is to model human langauge learning as an iterative learning process. The authors introduce a Markhov chain in which each the data from a learner in generation n-1 helps a learner in generation n formulate hpotheses and generate data, and then the cycles repeats. The paper then applies linear algebra and describes some general limitations such as ergodicity. The paper then gives two trivial examples using binary scenarios. In the first scenario, the authors assume a two language situation, and they show that sinks appear and ergodicity is violated. In the second scenario, they extend a trivial example from a different publication in which the learner must decide whether or not to utter a nonsense word Gavagai, and they show that Bayesian inference provides a theoretical framework describing how learners alter their hypotheses based on data, but it does not indicate which hypothesis a learner will select. The authors then give two learning algorithms for hypothesis selection. The first uses Gibbs sampling and samples from the posterior distribution; the authors go into a fair amount of detail on using this approach in the paper. The second choice for hypothesis selection is called Maxiumum A Posteriori estimation, and it makes uses of the EM algorithm. This approach also favors languages with higher probability, but it is more vulnerable to noise. The paper then applies this new theoretical advance to the old research question of compositional versus holistic languages. The discussion includes a number of simulated experiments and the paper shows graphs and charts from simulated experiments. The authors avoid adopting a particular position on a couple of linguistic issues (e.g. holistic versus compositional), but they suggest that their general approach could be used to generate evidence for either, and more empirical data is needed to determine which theory better mimics human language learning. The authors conclude by revisiting the limitations of their claims and discussing the relationship to prior work.


04 Oct 07: Improving Classification Using a Hierarchy-Based Prior (Amit)

Given: Prior knowledge of how the classes can be arranged in a hierarchy, based on how easily they can can be distinguished. This information can be used to improve classification.

Big Picture:

The new method uses a Bayesian form of the multinomial logit (MNL, a.k.a. softmax") model, with a prior that introduces correlations between the parameters for classes that are nearby in the tree. They compare the performance on simulated data of the new method, the ordinary MNL model, and a model that uses the hierarchy in a different way. They also test the new method on page layout analysis and document classification problems, and that it performs better than the other methods.

Motivation:

Their original motivation for studying hierarchical classification schemes was prediction of the biological functions of genes. Functions are usually presented in a hierarchical form starting with very general classes (eg, cell processes) and becoming more specific in lower levels of the hierarchy (eg, cell division).

Ordinary MNL model:

When there are three or more classes, we can use a generalization known as the multinomial logit (MNL) model (called \softmax" in the machine learning literature):

P(y = j|x,\alpha,\beta) = \frac {exp(\alpha_j + x\beta_j)} {\sum_{j'=1}^c \exp(\alpha_{j'} + x \beta_{j'})}

where c is the number of classes.

Tree MNL:

The formula remains the same. But we do classification say in two steps. The canonical example can be of named-entity recognition. First we identify whether a particular entity is a named entity or not and at the next step we look at the name entities and classify them into person, location and organization. Here we could have directly classify entities as person, organization or location but we use hierarchy.

Co-related MNL:

Here we try to include hierarchy-based prior.

β1 = φ11 + φ21

β2 = φ11 + φ22

β3 = φ12 + φ23

β4 = φ12 + φ24

In the above equations pair (β1 and β2), (β3 and β4) have some common prior. The reason is classes 1,2 and 3,4 are close to each other. Thats why they have some common prior in them.

In this paper they show that co-related MNL model performs better than ordinary MNL and tree MNL.

16 Oct 07: Dirichlet processes (Hal)

I'm going to be pretty fast and loose here, because I don't want to deal with measure-theoretic issues. Let X be a probability space, G_0 a distribution over X and alpha a positive real. We say that G is distributed DP(alpha, G_0) if the following holds. For all partitions {X_1, ..., X_K} of X (with K random), [ G(X_1), ..., G(X_K) ] is distributed Dir(alpha G_0(X_1), ..., alpha G_0(X_K)), where Dir is the usual Dirichlet distribution. What this means is that however we partition the space X, applying G to this partition makes it look Dirichlet. Note that X could be infinite (even uncountable). What we're doing is defining a distribution over an uncountable space (X) by looking at what happens to finite pieces of it. This is something we'll see again when we talk about other "processes." The general story will be that a Z-Process will be a distribution over something that looks infinite, but when restricted to be finite, behaves as if it were just Z.

Let's think about what G looks like with respect to G_0 and alpha. First, note that G is a distribution. This means that a DP is a distribution over distributions. (This is exactly the same way that a multinomial is a distribution, and so a dirichlet is a distribution over distributions.) Suppose that alpha tends toward infinity. Now, we know that however we partition X, we have [ G(X_1), ..., G(X_K) ] distributed Dir(alpha G_0(X_1), ..., alpha G_0(X_K)). As alpha tends toward infinity, the Dirichlet tends toward a point mass at [ G_0(X_1), ..., G_0(X_K) ]. For this reason, G_0 is called the mean of the DP -- the expected draw from the DP is exactly G_0. As we talked about for the Dirichlet, the alpha parameter essentially controls the variance. High alpha means low variance and vice versa.

Now, let's think about what draws from G look like. In other words, let G ~ DP(alpha, G_0). Since G is a distribution over X, we can draw from it. Suppose we draw x ~ G. Now, what's the posterior distribution of G given x? It's just DP(alpha+1, (G_0 + delta_x) / (alpha+1)). If we instead draw x_1, ..., x_N, then the posterior is DP(alpha+n, (G_0 + sum_n delta_{x_n}) / (alpha+n)). In other words, as we observe more data, the variance of the DP is going down. The mean of the DP becomes a convolution between the original mean G_0 and a bunch of point masses at the drawn locations.

What follows from here is a way of thinking about drawing x_1, ..., x_N from G ~ DP(alpha, G_0), with G marginalized out. It turns out that, conditioned on the N previous observations, the predictive distribution for x_{N+1} is (alpha G_0 + sum_n delta_{x_n}) / (alpha + N). This leads to an interpretation of a DP as a Polya urn, due to Blackwell and MacQueen. Let's associate each x_n with a different color, so that a draw x~G draws a new color. We have an urn which begins empty. In the first step, we draw a color from G_0, paint a ball with that color, and put it in the urn. Now, at step N+1, we have two choices. We either draw a ball from the urn, or draw a new color from G_0. We do the former with probability proportional to N, the latter with probability proportional to alpha. If we draw a new color from G_0, we do the same thing as before. On the other hand, if we draw a ball from the urn, we paint a new ball the same color, and put BOTH back into the urn. The sequence of draws obtained this way is exactly the sequence of draws from a DP. This also shows that the process is exchangable: it's order-independent.

Now, suppose we want to use a DP in a mixture model. Why would we want to do that? Well, we're going to associate each color with a cluster. So the model will be something like. G ~ DP(alpha, G_0) ; th_n ~ G ; x_n ~ F(th_n). Think about mixture of Gaussians. Then, we just let G be a broad zero-mean Gaussian and F will be again a Gaussian. The ths will be means. In the DP case, we let G_0 be a broad zero-mean Gaussian (because this is the mean of G) and keep F as a Gaussian centered at (th_n). We know (because of probability theory magic) that draws from G will be discrete (with probability one), so we're going to get re-use of our ths, i.e., many data points will have the same th, which means they have the same color and hence are in the same cluster.

The simplest way to sample here is to introduce a cluster assignment variable c (so that c_n = cluster data point n is in) and sample over c and the corresponding th_k, for k in the range of c. Neal has written out how to do this; I won't repeat it here. One thing we can notice is that if F is conjugate to G (iow, if F is conjugate to G_0), then we can actually integrate out the ths and sample only over the cs. This is the most common sampler for DPs, so I'll write it here. Suppose we have a vector c of length N and we want to resample the nth component of it. There are two choices. Either c_n = k, for 1 <= k <= K, where K is the total number of clusters in c (after removing c_n), or c_n is a new color. In the first case, the p(c_n = k) is proportional to n_k * prob(x_n | x_{c == k}). In the latter case, p(c_n = K+1) is proportional to alpha * prob(x_n | {}). The two probabilities are just the marginal probability of x_n given the other elements in the cluster (in the latter case, the other elements are nothing). In both cases, when conjugate priors are used, this can be computed easily.

19 Oct 07: Variational Inference for Dirichlet Process Mixtures (Avishek)

This paper presents a variational inference algorithm for Dirichlet Process (DP) mixture model. Given a sample {x_1, \ldots, x_N} from a DP mixture, the goal is to compute an approximation to the predictive density: p(x|x_1,\dots,x_N,\alpha,G_0) = \int p(x|\eta)p(\eta|x_1,\ldots,x_N,\alpha,G_0) d\eta

The above posterior distribution is difficult to compute and MCMC sampling techniques, namely Gibbs sampling, exists for approximating these posterior distributions in DP mixtures. However, MCMC sampling techniques are prohibitively slow. In addition, they are not deterministic and although they provide theoretical guarantees of accuracy, its very difficult to assess their convergence. Variational inference, on the other hand, is a fast and deterministic approach towards posterior approximation. They are based on reformulation of the posterior computation problem as an optimization problem and then solving the relaxed version of the optimization problem. The only drawback that variationals suffer from are chances of getting trapped in local minima.

The variational inference algorithm in this paper is based on the stick breaking representation of DP:

 \pi_i (v) = v_i \prod_{j=1}^{i-1} (1-v_j)

where πi(v) denotes the mixing proportions obtained by successively breaking a unit stick into an infinite number of pieces and each successive piece is proportional to the rest of the stick. Initially, a distribution q (called the variational distribution) is proposed to approximate the actual posterior distribution p. q has a lot of free parameters called variational parameters. The aim of the inference algorithm is to iteratively adjust these variational parameters so that q resembles p as closely as possible. However, it is to be noted that this paper uses a finite stick-breaking representation (obtained by truncation) of q to approximate an infinite-dimensional representation of p.

The variational inferencing algorithm in this paper is based on mean-field methods which essentially imply that variational distributions (q) should be represented in a fully factorized form to yield computationally efficient inferencing schemes. Mean-field methods aim to optimize the KL divergence between the proposed distribution qv(w) and the actual distribution p(w | x,θ) and is given by:

D(qv(w) | | p(w | x,θ)) = Eq[logqv(W)] − Eq[logp(W,x | θ)] + logp(x | θ)

where v is the variational parameter, θ is the hyperparameter, W is the set of latent variables and x is the set of observed variables. It is to be noted that the marginal probability does not depend on the variational parameters and hence can be ignored in the optimization. The mean-field framework has been utilized to propose a coordinate ascent algorithm which tries to optimize the lower bound on log marginal likelihood of the data with respect to the variational parameters. Subsequently, the variational algorithm has been compared with the collapsed Gibbs sampler and the blocked Gibbs sampler. Experimental results demonstrate the superior performance of the variational approach in terms of convergence time and standard error.

26 Oct 07: Hierarchical Dirichlet Processes (Nathan)

The paper for this week builds upon the Dirichlet process framework we have been exploring over the past fews weeks. Hierarchical models take the notion that a parameter is a random variable and have many applications, especially to areas of "multi-task learning" or when working with grouped or relational data sets. The canonical example of a hierarchical model is the Gaussian means problem. A "grand mean" μ0 is drawn from some distribution, and a set of K means are independently drawn from a Gaussian with mean μ0. The data is then drawn from these K Gaussian distributions.

Dirichlet Process recap

This is probably old news at this point...

Let (Θ,B) be a measurable space (sometimes called a σ − field)

Let G0 be a probability measure on (Θ,B), this is called the base measure of G which is become apparent shortly.

Let α0 be a positive real number. This is called the concentration parameter.

A Dirichlet process G is a random probability measure over (Θ,B) such that for any finite partition (A_1, A_2, \dots, A_r) of Θ the random vector (G(A_1), G(A_2), \dots, G(A_r)) is distributed as a finite dimensional Dirichlet distribution. i.e.

(G(A_1), G(A_2), \dots, G(A_r)) \sim Dir(\alpha_0 G_0(A_1), \alpha_0 G_0(A_2), \dots, \alpha_0 G_0(A_r))

Dirichlet processes are utilized in mixture models in this paper, so how is this done?

Dirichlet Mixture Models: Given a set of exchangeable data x = (x_1, x_2, \dots, x_n) and given a draw from a Dirichlet process G˜DP0,G0), independently draw n latent factors φn from G. For each i = 1, \dots, n draw xi˜Fi) for some distribution F.

Hierarchical Dirichlet Processes

The paper describes the Hierarchical Dirichlet Processes (HDP) framework in the context of using it with grouped data. Assuming that you have J data groups, each with nj exchangeable data points (x_{j1}, \dots, x_{jn_{j}}) where each point is also modeled by a Dirichlet mixture model.

The HDP is a distribution over random probability measures over (Θ,B), with one probability measure Gj for each group j and one global probability measure G0. The global measure is itself distributed as a DP with concentration parameter μ and base measure H. Each xji is associated with a factor φji with distributions given by Fji) and Gj.

Chinese Restaurant Franchise

There is a Chinese Restaurant Franchise which has a shared menu across all of it's restaurants. At each table of each restaurant, one dish (hopefully sesame chicken) is ordered from this menu by the first customer who sits there. All subsequent customers to sit at this table share this dish. The same dish can be served across multiple tables in multiple restaurants.

6 Nov 07: Stick-breaking Construction for the Indian Buffet Process (Avishek)

The Indian Buffet Process is a Bayesian non-parametric distribution used in latent feature modeling. It can be expressed as a standard IBP or in the stick-breaking form. This paper derives the stick breaking representation for the Indian Buffet Process (IBP). It starts with a general introduction to the IBP and then moves onto a Gibb's sampling based posterior inferencing scheme for the IBP. One drawback of the Gibb's sampling based approach is its conjugacy requirement to efficiently compute the integral which marginalizes some of the unknown parameters. This conjugacy requirement severely limits the applicability of the Gibbs sampler to a wide variety of distribution functions. The stick breaking construction, on the other hand, has no such conjugacy requirement and hence scores over the Gibbs sampler.

The stick breaking representation has been constructed by imposing a specific ordering on the feature probabilities. It is to be noted that the stick breaking representation doesn't integrate out the feature probabilities. This paper also establishes a relation between the stick breaking representations of the Indian Buffet Process and the Dirichlet Process (DP). This relationship implies that a range of techniques for and extensions to the DP can be directly applied to the IBP.

The stick breaking representation of the IBP has been utilized to develop a slice sampler. This slice sampler is useful in cases where the conjugacy requirement of the Gibbs sampler is not satisfied. Thus, the slice sampler is more generally applicable than the Gibbs sampler.

This paper also explores the conversion of the standard IBP to its stick breaking representation and vice-versa. Such conversions allow us to take advantage of efficient MCMC approaches in both the representations.

16 Nov 07: Gaussian Processes (Nathan)

A stochastic process is an abstraction of a probability distribution to functions. They allow us to caste supervised learning problems into the framework of learning a function given examples, instead of learning a weight vector. When we focus on processes that are Gaussian in nature, inferences in our model become relatively easier for certain applications. Gaussian Processes (GP) can be thought of as a long (infinitely so) vector where every cell in that vector specifies the function value f(x) for some input x. GPs also have the special property that if you only ask for f(x) values from a finite number of points, then you will receive the proper answer regardless of the infinite number of other values.

19 November 2007: The infinite PCFG using Hierarchical Dirichlet Processes (Amit)

Background:

In this paper the author tries to solve the most well known problem in Natural Language Processing i.e parsing. People have shown accuracy upto 90% in parsing by using different grammars or different reppresentations. The well known parsing structures are head-driven phrase structure grammar(HPSG),PCFG(probabilisic context free grammar), lexicalized PCFG, dependency parsing and unsupervised approaches and manual refinement of grammar learning. In parsing, everyone uses penn tree bank data to train and give a parse tree. Most of them uses the same symbols(like NP, VP, PP etc) or manually try to refine the grammar.

Motivation and Technique used:

The motivation behind this paper is to convert the manual refinement to an automatic refinement so that the grammar can grow itself. Since the data speaks itself, the best approch to tackle the problem can be non other than non-parametric. Hence the author uses HDP to solve infinite PCFG by using variational inference technique (Structured Mean field approximation True posterior over parameters.)

Overview of PCFG:

S-> NP VP (0.8) S-> NP S' (0.1) S-> W NP VP (0.1) NP-> DT NP (0.5) NP-> adj n (0.15) NP->adj adj (0.05) NP->n(0.3)

In the above example S is the head of the tree or sentence. There are three rules using which S can have different tree structures and probability should be 1 over each non-terminal on left hand side(i.e S and NP in above grammar). I have tried to used chomsky normal form in the above grammar since its easy to handle binary trees(and also in this paper they use CNF).

By grammar to grow I mean that the number of symbols(like NP, VP, PP) can be as large or as small as possible depending on amount of data available.

Grammar induction: How many grammar symbols (NP, VP, etc.)?

Grammar refinement: How many grammar subsymbols (NP-loc, NP-subj, etc.)?

Previous Work:

The question of how to select the appropriate grammar complexity has been studied in earlier work. It is well known that more complex models necessarily have higher likelihood and thus a penalty must be imposed for more complex grammars. Examples of such penalized likelihood procedures include Stolcke and Omohundro (1994), which used an asymptotic Bayesian model selection criterion and Petrov et al. (2006), which used a split-merge algorithm which procedurally determines when to switch between grammars of various complexities. These techniques are model selection techniques that use heuristics to choose among competing statistical models; in contrast, the HDP-PCFG relies on the Bayesian formalism to provide implicit control over model complexity within the framework of a single probabilistic model. Johnson et al. (2006) also explored nonparametric grammars, but they do not give an inference algorithm for recursive grammars, e.g., grammars including rules of the form A -> BC and B -> DA. Recursion is a crucial aspect of PCFGs and our inference algorithm does handle it. Finkel et al. (2007) independently developed another nonparametric model of grammars. Though their model is also based on hierarchical Dirichlet processes and is similar to ours, they present a different inference algorithm which is based on sampling. Kurihara and Sato (2004) and Kurihara and Sato (2006) applied variational inference to PCFGs. Their algorithm is similar to ours, but they did not consider nonparametric models.

Structure of the Paper:

Since this paper is targeted towards NLP audience hence in Section 2 the author has talked about various models like Bayesian finite mixture model, DP mixture model, HDP-HMM, HDP-PCFG and HDP-PCFG for grammar refinement, variational inference, Structured mean field approximation and coordinate-wise ascent. I will be just focusing on HDP-PCFG and then using it to give the idea of HDP-PCFG for grammar refinement. The Section 3 contains experiments and I will be talking about that section since its interesting to see how HDP-PCFG is better than normal PCFG.

HDP-PCFG

β ˜ GEM(α ) [generate distribution over symbols]
For each symbol z ε {1, 2, . . . }:
 \phi_z^E \sim Dirichlet(\alpha^E) [generate emission probs]
 \phi_z^B \sim  DP(\alpha^B, \beta \beta^T ) [binary production probs]
For each nonterminal node i:
 (z_{L(i)}, z_{R(i)}) \epsilon Multinomial(\phi_{z_i}^B) ) [child symbols]
For each preterminal node i:
 x_i \epsilon Multinomial(\phi_{z_i}^E) [terminal symbol]

Suggested Readings

Variational Algorithms for Approximate Bayesian Inference : Matthew Beal (PhD thesis)

Probabilistic Inference using Markov Chain Monte Carlo Methods : Radford Neal (PhD thesis)

Personal tools