MLRG/fall08

From ResearchWiki

Revision as of 09:09, 5 December 2008 by Piyush (Talk | contribs)
Jump to: navigation, search

Contents

CS7941: Inference in Graphical Models

Time: Fri 2:00-3:20pm (see schedule for each week)

Location: MEB 3105

Abstract

Graphical models (both Bayesian networks and Markov random fields) are a convenient method for encoding and reasoning about probability distributions. This seminar will focus initially on representational issues and then move to computational methods for inference in graphical models. Our interest will primarily be on approximate inference, since exact inference in most models is intractable. The focus will be on non-Bayesian approaches, though many things we discuss are applicable to Bayesian problems as well.

Students with a working knowledge of probability are well suited for this seminar. Each student should expect to present at least one week during the semester.

The rough plan is to begin with classical approaches to graphical models, including exact inference via the junction tree algorithm. We will quickly move on to approximate methods, focusing mostly on message passing and linear programming approaches.

The structure of the semester will be roughly as follows:

  • Introduction to Bayes' nets and graphical models: semantics, variable elimination, sum/max product algorithm, and factor graphs
  • Computational complexity: junction trees, triangulations and tree-width, belief propagation
  • Exponential families: maximum entropy principle, cumulant generating functions, conjugate duality
  • Variational inference: exact inference on trees, mean-field and structured mean-field, inner bounds of the marginal polytope
  • Statistical physics: Bethe free energy, log-determinant relaxation and outer bounds of the marginal polytope
  • Linear programming methods: tree-reweighted message passing, more outer bounds

This is a lot of material. Things will be dropped or skimmed depending on time and interest levels. Students should only come if they are prepared to do the required reading before the seminar.

Expected Student Involvement

Presentations will be given by pairs of students. The logistics of this will depend somewhat on the make-up of the participants, but the goal would be to pair CS students with non-CS students in most cases. Sign-ups will happen quickly, so please take care to make sure you're signed up for at least one week, though probably two. The expectations, with timing, are:

  • At least 3 days before a presentation, the presenters should meet with Hal for one hour to discuss the material and decide what to present.
  • By class time, everyone should have submitted at least 2 questions to the Wiki for discussion. (Click on the "discussion" tab along the top of the wiki to enter your questions.)
  • Everyone should actively participate in the discussions.

If you're interested in participating in the programming parts, please visit the programming subpage.

Participants

Schedule and Readings

Most of our initial readings will be from the technical report Graphical models, exponential families, and variational inference by Martin Wainwright and Mike Jordan. I know this tome is a bit daunting, but it's not as bad as it looks at first glance! We refer to this in the readings as WJ.

When videos are listed, they are stored in Theora video format; viewers are available for pretty much every platform. Please please please don't distribute these or give away the password (which was sent to the ml@cs mailing list).

Date Papers Presenters
Introduction to Bayes' nets and graphical models
F 29 Aug Introduction, semantics and applications (WJ, 1-2.4) Hal
F 5 Sep Factor graphs, sum-product and max-product (KFL01, section 1,2, example 6 from section 3 B, section 4 A, appendix A and B; Optional reading for some more foundational stuff: WJ, 2.5-2.5.1)

video A and video B

Piyush (CS), Zheng(BI)
F 19 Sep Junction tree algorithm and triangulations (WJ, 2.5.2; HD96, 4)

video

Nathan (CS), Seth (CS)
Variational Inference
F 26 Sep Math camp: exponential families and conjugate duality (WJ, 3)

video

Amit (CS), Piyush (CS)
T 30 Sep, 5:15pm Variational principle, exact inference on trees (WJ, 4)

video

Arvind (CS), Senthil (BMI)
F 3 Oct Mean-field and structured mean-field (WJ, 5)

video

Jagadeesh (CS), Stephen (BMI)
Statistical Physics
F 10 Oct Bethe free energy and sum-product (WJ, 6)

video

Anthony (BMI), Jiarong (CS)
F 24 Oct Kikuchi free energy and generalized sum-project (WJ 7)

video

Seth (CS)
Convex Relaxations
F 31 Oct Convex relaxations (WJ 8) Avishek (CS)
F 7 Nov Semi-definite relaxations (WJ 9) John M. (CS), Nathan (CS)
F 14 Nov Mode computations (WJ 10) Zhan Wang (CS), Jagadeesh
T 2 Dec Exact max-product and outer bounds (GJ07) Ruihong(CS), Jiarong
F 5 Dec Outer bounds on the marginal polytope (SJ07 and SMGJW08) Piyush

Additional Readings

These readings are listed just for reference if people are interested.


Reading Summaries

29 Aug: Introduction, semantics and applications (Hal)

High-level summary: graphical models = graph theory + probability theory.

Applications:

  • Hierarchical Bayesian Models. Almost all hierarchical Bayesian models can be (and often are!) represented by directed graphs, with nodes as r.v.s and edges denoting conditional dependence. The standard problems in Bayesian modeling (computing marginal likelihoods, computing expectations, etc.) are graphical model inference problems. (Cautionary note: most hierarchical Bayesian models involve rvs with continuous density; the stuff we'll talk about this semester doesn't really address this. Our focus is basically entirely on discrete rvs, only.)
  • Bioinformatics and Language: Biology problems and language problems can often be seen as inference problems over relatively simple structures (like sequences or trees). These are naturally encoded as, say, HMMs or PCFGs, which are amenable to graphical model analysis.
  • Image Processing: A standard image representation is as a 2D lattice (undirected!) model, with the nodes representing pixels.

Background:

In most of the applications listed above, the underlying graph structures are relatively "dense." This is not promising, because non-probabilistic algorithms on graphs typically have trouble with dense graphs. It turns out (as we shall see soon) that the same is true in graphical model inference. The denser the graph, the harder (computationally!) the inference problem. For most real world problems, we're talking NP-hard.

Since these problems are NP-hard (and usually the really bad variety of NP-hard), we're going to go for approximations. One standard approximation method is to construct an MCMC algorithm. We won't discuss MCMC here (perhaps some other semester, or perhaps take one of the Stochastics classes in the Math department.

The approach we take instead is variational: that is, we attempt to convert the inference problems into optimization problems, and solve the optimization problems. The optimization problem will, of course, also be computationally intractable. However, by relaxing the optimization problem, we will obtain an (efficient) approximate solution, which will lead to an (efficient) approximate solution to the inference problem.

Graphical Models:

Let G = (V,E) be a graph. We will associate each vertex s \in V with a random variable x_s \in X_s. The graphical model consists of a collection of probability distributions over the random variables defined by the vertices of the graph that factorizes according to the edges. (Notation: if A \subseteq V, define x_A = \{ x_s : s \in A \}. There are two types of graphical models, just as there are two types of graphs:

  • Directed graphical models: here, let π(s) be the set of parents of a node and then we have p(\vec x) = \prod_{s \in V} p(x_s | x_{\pi(s)})
  • Undirected graphical models: here, the distribution factorizes according to functions defined on the cliques of the graph (aka fully-connected subsets of V). For every clique C, let ψC be some function that maps the values of the random variables in C (i.e., xC) to a positive real value. Then, p(\vec x) = \frac 1 Z \prod_C \psi_C(x_C).

Inference Problems:

Given a probability distribution p(\cdot) defined by a graphical model, we might want to:

  • compute the likelihood of some observed data (observed data = fixed values of a subset of the rvs)
  • compute the marginal distribution p(xA) for some fixed A \subset V (test question: why not A \subseteq V?)
  • compute the conditional distribution p(xA | xB) for disjoint A,B with A \cup B \subset V (same question!)
  • compute the mode of the density; i.e., solve \arg\max_{\vec x} p(\vec x)

Note that the first three are essentially the same: they involve one or more marginalizations. The last one is fundamentally different.

05 Sept: Factor Graphs and the Sum-Product Algorithm (Piyush and Zheng)

The Problem:

Given a function g(x1,...,xn) over a collection of random variables, compute the marginals g_i(x_i) = \sum_{\sim \{x_i\}}g(x_1, ..., x_n). Here, a marginal gi(xi) means we are summing the function g(x1,...,xn) over all possible values (i.e. all 'configurations') of all random variables except xi. We assume each xi belongs to some domain (or alphabet) Ai. The global function g(x1,...,xn) is assumed to factorize into a product of local functions: g(x_1, ..., x_n) = \prod_{j \in J}f_j(X_j), where J is some discrete index set. The precise nature of this factorization depends on the kind of graphical model we have (for example, Bayes nets and MRF, we saw last week).


The Technique:

It turns out that we can visualize such factorization using a bipartite graph called a factor graph and use message-passing algorithms to do inference on probability distributions associated with these graphs. In particular, we shall consider the problem of finding marginals for such distributions and see how a message-passing technique called the sum-product algorithm helps us do this efficiently. The sum-product algorithm operates on factor graphs. To apply it for other graphical models (such as Bayes net or MRF), we first need to convert them into an equivalent factor graph.

Note that our focus will be on tree-structured factor graphs (a majority of applications deal with such graphs only) for which the sum-product algorithm gives exact results but it's also applicable to factor graphs with cycles (for which the results will be approximate).

Factor Graphs:

A factor graph is a bipartite graph that consists of a set of factor nodes fj (one for each local function) and a set of variable nodes xis. There is an edge between fj and xi only if xi is an argument of fj. It turns out that various marginalization algorithms can be described as consisting of a set of message passing operations in a factor graph.

  • A incoming message at node xi is a vector of length | Ai | where | Ai | is the cardinality of the domain Ai of xi.

We can think of each vertex in a factor graph as a processor and edges representing channels using which the processors can communicate (i.e. pass messages). Using message passing operations, factor graphs provide us a way to efficiently compute the marginals. The versatility of factor graphs lies in the fact that they provide a way to represent various types of graphical models (e.g., Bayes nets, MRF) into a common framework under which inference in such models can be efficiently carried out. We shall see examples on how to come up with such representations.

The Sum-Product Algorithm:

Repeatedly invoking the simple message passing for each node would give us marginals for each individual node. However, it neglects to share the intermediate terms arising in the individual computations. The sum-product algorithm is a dynamic algorithm that effectively shares the intermediate terms in the computations and computes all the marginals simultaneously and in parallel.

Message passing starts on the leave nodes in the factor graphs. Each node v waits to receive messages on all but one of the incident edges on v, at completion of which it uses these messages to compute a message to be sent on the remaining edge. The algorithm terminates once two messages have been passed along both directions on each edge in the factor graph. Thus, on a factor graph with n edges, 2n messages must be passed before the sum-product algorithm terminates. On termination, the marginal function gi(xi) is simply the product of all incoming messages destined to xi. The message passing operations involve computing various sums and products and hence the procedure is called the sum-product algorithm.

Since there are two type of nodes in a factor graph, two type of messages are passed in the sum-product algorithm:

  • variable node x to factor node f: \mu_{x \rightarrow f}(x) = \prod_{h \in n(x) \backslash \{f\} }\mu_{h \rightarrow x}(x)
  • factor node f to variable node x: \mu_{f \rightarrow x}(x) = \sum_{\sim \{x\}}(f(X)\prod_{y \in n(f) \backslash \{x\} }\mu_{y \rightarrow f}(y))

In all of the above, n(v) denotes the set of neighbors for node v. f(X) = n(f) is the set of arguments of the function f. Important: Notice that the expression for the message from a variable node to a factor node looks significantly simpler than the case of a message from factor node to variable node. This is because a) at the message origin, there is no local function involved in the former case, and b) summary operator is not required since the final message is destined to a factor node.

Special Instances of the Sum-Product Algorithm:

The sum-product algorithm is quite general and it subsumes a wide variety of practical algorithms such as the forward/backward algorithm, the Viterbi algorithm, and the Belief Propagation algorithm in Bayes nets. In particular, we shall see examples of the forward/backward algorithm used for inference in HMMs, and the Belief Propagation in Bayes nets, and see how they turn out to be special instances of a sum-product algorithm operating on a factor graph.

19 Sept: Junction Tree Algorithm and Triangulations (Nathan and Seth)

Motivation

D-Separation of Graphs: In order to get a better understanding of the method used to convert DAGS to undirected trees, it will be useful to understand d-separation. There seems to be some disagreement over what the 'd' stands for in d-separation, with one camp claiming directional and another claiming dependence. Both conventions have value in understanding what the separation entails.

If two variables, X and Y are d-separated relative to a variable Z in a directed graph, then they are independent w.r.t Z in all probability distributions that this graph can represent. X and Y are independent conditional on Z if observing X gives you no extra information about Y once you have observed Z, i.e. once you know Z, X adds no information to what you know about Y.

Consider all directed graphs of three variables, A,B,C, where C is observed:

  • A -> C -> B - The path from A to B is blocked by conditioning on C. A and B are d-separated in this case.
  • A <- C <- B - The path from B to A is blocked by conditioning on C. A and B are d-separated in this case.
  • A <- C -> B - Here we are conditioning on a common cause, the effects are independent. A and B are d-separated in this case.
  • A -> C <- B - In this orientation, C is a called a collider. A and B are not conditionally independent in this case, thus A and B are d-connected. (Note: If C is not observed, then it is possible for this case to be blocking as well.)

In short, the rules for blocking and thus d-separation are:

  • the arrows on the path meet head-to-tail or tail-to-tail at a node, whether observed or not.
  • the arrows meet head-to-head at a node that is not observed.

If these two conditions hold for ALL paths between the nodes, A and B, then A is d-separated from B by C (A and B are then also conditonally independent.)

Converting DAGS to Undirected Trees

  • Moralization:
  • Triangulation:
  • Construct Junction Tree:

The Junction Tree Algorithm:

Essentially, this algorithm is very similar to the sum-product message passing algorithm from last meeting. There are some neat properties, notably the running intersection property which asserts inference about variables will be consistent across the graph, i.e. that local consistency ensures global consistency. This property is a consequence of the triangulation step above and states that if a variable is contained in two cliques, then it must also be contained in every clique on the path that connects them.

26 Sept: Exponential Families and Convex Analysis (Amit and Piyush)

It turns out that a wide variety of message-passing algorithms (some of which we have already seen, and others we will be seeing in the coming weeks) can also be understood as solving exact or approximate variational problems. Exponential families (and tools from convex analysis) provide a unified framework to develop these variational principles.

Exponential Families: An exponential family defines a parametrized collection of (probability) density functions: p(x;\theta) =exp\{\langle\theta,\phi(x)\rangle-A(\theta)\}. The quantity A acts as a normalization constant and is commonly known as the log partition function: A(\theta)=log \int_{X^n} exp \langle\theta,\phi(x)\rangle dx. \phi_\alpha : X^n \rightarrow R are functions known as sufficient statistics. \phi = \{\phi_\alpha : \alpha \in I \} is a vector of sufficient statistics. It is easy to see that p(x;θ) depends on x only through φ(x) (and hence the name sufficient statistics). Another way to think of φ is that it is sufficient to describe the family. \theta = \{\theta_\alpha : \alpha \in I \} is a vector which is called exponential or canonical parameter, and is used to identify members of the family associated with φ. \Theta := \{\theta\ \in R^d | A(\theta) < \infty \} is the set θ comes from. We will show that A is a convex function of θ which in turn implies that Θ is a convex set.

We will see some examples of distributions (Bernoulli, Gaussian, and another distribution defined as a graphical model -- the Ising model) and show that each of these can be represented as exponential families.

Representations: An exponential family admits either of the following two representations:

  • Minimal: We define exponential family with collection of functions φ = {φα} such that there exists no linear combination \langle a,\phi(x)\rangle=\sum_{\alpha \in I}a_\alpha \phi_\alpha(x) equal to a constant. This condition leads to a minimal representation, in which there is a unique parameter vector θ with each distribution.
  • Overcomplete: Rather than using a minimal representation we can use an overcomplete representation. Here there exists an entire affine subset of parameters vector θ each associated with the same distribution.

Properties of the log partition function: The log partition function A(θ) is both smooth and convex in θ. It can also be seen as the cumulant generating function of the random vector φ(x). In particular, it is easy to see that the first cumulant gives us the mean of φ(x):

\frac{\partial A}{\partial \theta_{\alpha}}(\theta) = E_{\theta}[\phi_{\alpha}(x)] := \int \phi_{\alpha}(x)p(x;\theta)dx

Higher cumulants can be similarly defined.

Mean (expectation) parameters: Let's consider the set of vectors defined by taking expectation of φ(x) w.r.t. an arbitrary distribution:

\mathcal{M} := \{\mu \in R^d |\ \exists p(.)\ s.t. \int \phi(x)p(x)dx = \mu \}

Now let's take an arbitrary member of an exponential family p(x;θ) defined by φ(x) and define a mapping:

\Lambda(\theta) := E_\theta[\phi(x)] = \int \phi(x)p(x;\theta)dx

The mapping Λ associates to each θ a vector of mean parameters μ: = Λ(θ) belonging to the set \mathcal{M}. It can also be seen easily that \Lambda(\theta) = \nabla A(\theta).

The mapping Λ between the set Θ of exponential parameters θ and the set \mathcal{M} of mean parameters μ is interesting and worth remembering. In particular, the following conditions hold:

  • The mapping Λ is one-to-one if and only if the exponential representation is minimal.
  • The mapping Λ is onto the relative interior of \mathcal{M} (i.e.,\Lambda(\Theta) = ri \mathcal{M}).

For minimal or overcomplete representations, we say that the pair (θ,μ) is dually coupled if μ = Λ(θ) and hence \theta \in \Lambda^{-1}(\mu).

Some conjugate duality

Following the duality between θ and μ, we define the Legendre-Fenchel dual of the log partition function A(θ) as:

A^*(\mu) := \sup_{\theta \in \Theta}\{\langle\mu,\theta\rangle - A(\theta)\}

It turns out that the dual function A * (μ) is actually the negative entropy of p(x;θ(μ)) where θ(μ) is an element of the inverse image Λ − 1(μ), for \mu \in \mathcal{M}. In terms of this dual, we can define the log partition function in a variational representation:

A(\theta) =  \sup_{\mu \in \mathcal{M}}\{\langle\theta,\mu\rangle - A^*(\mu)\}

So it turns out that the variational representation of A(θ) just reduces to an optimization problem over the domain \mathcal{M}.

The duality between A(θ) and A * (μ) in also very important. For example, this duality gives us several alternative ways to compute the KL divergence between two exponential family members.

The take-home message: It's important to keep in mind the properties of the log partition function A(θ) and the mean parameters μ = Eθ[φ(x)]. As we will see in the subsequent weeks, in the variational framework for doing inference (in particular, the problems of computing marginals), it's essentially these quantities that we need to deal with.

30 Sept: Variational principles, exact inferences on trees (Arvind and Senthil)

Given a graphical model, we are interested in finding some probability distribution p(x;θ) where x is the vector of random variables and θ is the set of parameters associated with the graph. Note that the size of x is equal to the number of nodes in the graph while size of θ could be anything, it depends on the graph structure.

As stated in the last meeting, we will deal only with the special classes of probability distributions called exponential families. Exponential families possess special properties which not only make them nice and easy to handle but they also represent most of the real world models Despite of having many convenient properties, it may not always be possible to do the exact inference unless graph possesses some special structure e.g. tree. In order to do the inference for the complicated graphs, we resort to approximate methods. One of these approximate methods is known as variational methods.

Recall, problem of doing the inference for the exponential family boils down to computing the log partition function A(θ) and the mean parameters μ = Eθ(x)] for a given distribution p(x;θ). We can compute log partition function and mean parameters using the Fenchel-Legendre theorem. Theorem states:

A(\theta) =  \sup_{\mu \in \mathcal{M}}\{<\theta,\mu> - A^*(\mu)\}

Above theorem is important in two senses. First, it converts the problem of computing the mean parameters into an optimization problem and gives a way to compute these mean parameters. Second, it provides the log partition function A(θ) as solution to the above optimization problem.

Having stated the above theorem, it is tempting to assume that that problem of computing mean parameters and log partition function is now solved since it is now reduced to a convex optimization problem. This is true for some exponential families for which above equation can be written in an explicit form but not for general exponential families. Above problem has two main challenges which makes this optimization problem hard to solve for general exponential families.

  1. In many cases, the constraint set \mathcal{M} of realizable mean parameters is extremely difficult to characterize in an explicit manner.
  2. the negative entropy function A * is defined indirectly so that it too lacks an explicit form.

Both of these issues will be handled by doing some kind of approximation to the set of mean parameters and dual function A * . As we will see in the coming meetings, different algorithms exist to do the approximate inference using variational method depending on the kind of approximation we do.

Set of realizable mean parameters

We show two important classes of exponential families for which \mathcal{M} is straightforward to characterize namely, arbitrary Gaussian distribution and multinomial distribution on junction trees. In multinomial example, \mathcal{M} is given by a convex polytope whose boundaries are formed by the constraints on the marginals. We will see the construction of this polytope for two representations of the tree, minimal representation and overcomplete representation.


Nature of the dual function A* (from last lecture)

Fenchel-Legendre Conjugate of the log partition function A, denoted by A* is defined as:

A^*(\mu) := \sup_{\theta \in \Theta}\{<\mu,\theta> - A(\theta)\}

This guarantees that A* is convex. But, A* lacks a closed form expression which makes the solution computationally challenging. However, exact characterizations can be provided for Gaussian and Tree-structured problems.

  • General properties of A *

As given in Table 2 of WaiJor (page 24), closed form expressions for A * are an exception rather than the rule. So, it follows that A * is usually defined implicitly via the composition of two functions:

  1. First, compute the exponential parameter θ(μ) in the inverse image of Λ − 1
  2. Then calculate the negative entropy of the distribution p(x;θ(μ))

This approach follows from:

A^*(\mu) = -H (p(x;\theta(\mu))) \qquad if \mu \isin ri M

Even though A * is not in a closed form, various properties can be inferred from its variational definition.

Properties of A *

  • A * is always convex and lower semi-continuous.

Other properties depend on the nature of its exponential family.

In a minimal representation,

  1. A * is differentiable on intM, and  \nabla A^*(\mu) = \Lambda^{-1}
  2. A * is strictly convex

Despite these desirable properties, A * still poses substantial computational challenges. Both operations in the decomposition of A * are challenging:

  1. The inverse image Λ − 1 of the mean parameter mapping does not have a closed form expression though it is well defined mathematically. Iterative methods are typically necessary to computer this mapping. However, the iterative algorithms presuppose that it is possible to do exact inference, which is the original problem that we are trying to solve.
  2. Even if we are able to compute the inverse image of the mean parameter mapping, it is generally not possible to compute the entropy for a large problem.

As with M, there are two important cases where A * can be characterized in closed form: Gaussian distributions and Tree-structured problems. Only Tree-structured problems are discussed below.

Tree structured problems

A consequence of the junction tree representation is that A * always has a closed form representation. Let us consider the simple tree T = (V,E(T)) to illustrate this. Consider the canonical representation of the overcomplete tree-structured multinomial distribution discussed above under M. The mean parameters μ = {μsst} correspond to local marginals associate with single nodes and single edges. In particular, we will use the local marginal functions μs(xs) and μst(xs,xt).

By a special case of junction tree decomposition, a tree-structured distribution factorizes in terms of local marginal distributions as:

p(x) = \prod_{s \in V}\mu_s(x_s) \prod_{(s, t) \in E(T)} \frac{\mu_{st}(x_s, x_t)}{\mu_s(x_s) \mu_t(x_t)}

We substitute the above value of p(x) in the following integration

H(p(x;\theta)) = - \int_{X^n} p(x; \theta ) log[p(x;\theta)] \nu dx

and then integrate it. We get:

A_{tree}^* = - \sum_{s \in V}{H_s (\mu_s)} + \sum_{(s,t) \in E(T)}{I_{st}(\mu_{st})}

where Hs is the single node entropy and Is is the mutual information terms.

Combining this with the expression for our characterization of the marginal polytope MARG(T), we obtain the following:

A(\theta) = \max_{\mu \in MARG(T)}{\{ <\theta, \mu> + \sum_{s \in V}{H_s (\mu_s)}  - \sum_{(s,t) \in E(T)}{I_{st}(\mu_{st})} \}}

This has a simple structure, and is convex. The constraint set MARG(T) is a polytope defined by a small O(n) number of constraints. Thus, we find that A * has a closed form representation for tree-structured problems.

3 Oct: Mean-field and structured mean-field (Jagadeesh and Stephen)

In the previous class, we saw how to do exact inference on trees using variational principle. Given an arbitrary graph this can be converted into its junction tree representation and thus inferencing can be done. But, as noted at the end of the previous class, the junction tree representation can result in graphs with large clique size (treewidth). Hence the computation of a marginal probability of a individual node includes marginalizing over a significant set of variables and hence may be very expensive. Instead of transforming into junction tree, in this class we will see a technique to perform approximate inference on the original graph itself.

The two main difficulties associated with variational principle are: the nature of the the constraint set \mathcal{M} and the lack of tractable form for the dual function A * . Mean field theory relaxes some of the constraints in the original graph resulting in a subset of distributions (tractable distributions) for which A * is relatively easy to characterize.


Tractable families A tractable subgraph is subgraph H of G over which it is feasible to perform exact calculations. If \mathcal{I}(H) denote the subset of indices associated with cliques in H, then the set of exponential parameters corresponding to distributions structured according to H is given by: \mathcal{E}(H) := \{ \theta \in \Theta | \theta_{\alpha} = 0 \; \forall \alpha \in \mathcal{I}\backslash \mathcal{I}(H) \}

Example: The subgraph H_0 = (V,\emptyset) is the simplest subgraph with parameters being \mathcal{E}(H_0) := \{\theta \in \Theta | \theta_{st}=0 \; \forall \;(s,t) \in E \}, where θst refers to the parameters associated with the edge (s,t). And the associated distributions are the product of probabilities of the nodes (s) p(x;\theta) = \prod_{s\in V}p(x_s;\theta_s).

Instead of dropping all the edges, a more structured approximation can be considering a subgraph which forms the minimal spanning tree of the oringial graph. Accordingly the subspace of distributions is given by \mathcal{E}(T) := \{\theta | \theta_{st}=0 \; \forall \;(s,t) \not\in E(T) \}

For a given subgraph H, the set of all possible mean parameters that are realizable by tractable distributions : \mathcal{M}(G;H):=\{\mu \in R^d \;|\;\mu = E_{\theta}[\phi(x)] for some  \theta \in \mathcal{E}(H)\}. Since any μ that arises from this tractable distribution is certainly a valid mean parameter, \mathcal{M}_{tract}(G;H)\subseteq \mathcal{M}(G) holds.

Optimization and Lower bounds

From the variational principle A(\theta) = \sup_{\mu \in \mathcal{M}}\{\langle\mu,\theta\rangle - A^*(\mu)\}, it is clear that any valid mean parameter specifies a lower bound on the log partition function, i.e. A(\theta) \geq \langle\theta,\mu\rangle - A^*(\mu). Since the dual function A * lacks an explicit form, it is not always possible to compute the lower bound. The mean field approach overcomes this difficulty by the choice of μ to a tractable subset \mathcal{M}_{tract} for which the dual function has an explicit form A^*_H. Moreover, mean field method chooses the best approximation to the mean parameters μMF measured interms of the tightness of the bound. So the mean field approximation is the solution of \sup_{\mu \in \mathcal{M}_{tract}(G;H)}\{\langle\mu,\theta\rangle - A^*_H(\mu)\}


Relation to KL-divergence

The mean field theory aims to find the approximate probability distribution contrained by the subgraph by minimizing the KL-divergence with respect to the probability distribution represented by the original graph.

The KL-divergence between different probability distributions governed by parameters \theta^1,\; \theta^2 is D(\theta^1 \| \theta^2) = A(\theta^1) - A(\theta^2) - \langle \mu^1, \theta^2-\theta^1\rangle. For the dually coupled pair 11), the Fenchel's inequality will become equality, yeilding D(\theta^1 \| \theta^2) = D(\mu^1\|\theta^2) = A(\theta^2) + A^*(\mu^1) - \langle \mu^1, \theta^2\rangle

When the mean field theory is trying to find the solution which maximizes \sup_{\mu \in \mathcal{M}_{tract}(G;H)}\{\langle\mu,\theta\rangle - A^*_H(\mu)\}, it aims to find the best probability approximation (under the subgraph) to the original probability distribution by minimizing the KL-divergence between them.

Naive mean field updates

The naive mean field approach corresponds to the trivial subgraph with out any edges and hence the joint probability fully factorizes over the node probability distributions.

Example

With the fully disconnected graph, the tractable set \mathcal{M}_{tract}(G;H_0) := \{ (\mu_s,\mu_{st}) | 0\leq\mu_s\leq1, \; \mu_{st} = \mu_s \mu_t\} and A^*_H(\mu) = \sum_{s\in V} [\mu_s \log \mu_s + (1-\mu_s) \log (1-\mu_s)] .

The naive mean field problem becomes max_{\mu \in \mathcal{M}_{tract}(G;H_0)}\{\langle\mu,\theta\rangle - A^*_{H_0}(\mu)\}. By replacing μst = μsμt we will get \max_{\{\mu_s\} \in [0,1]^n}\Big\{\sum_{s\in V} \theta_s \mu_s + \sum_{(s,t)\in E} \theta_{st} \mu_s \mu_t - \sum_{s\in V} [\mu_s \log \mu_s + (1-\mu_s) \log (1-\mu_s)] \Big\}.

By computing the derivative and equating it to 0 we can find the iterative updates for each \mu_s \leftarrow \sigma(\theta_s + \sum_{t\in N(s)} \theta_{st} \mu_t.

Structured mean field

Instead of considering a fully disconnected graph, we can consider a arbitrary subgraph H of the original graph G. Let \mathcal{I}(H) be the subset of indices corresponding to the sufficient statistics associated with H, and let \mu(H) := \{\mu_{\alpha} | \alpha \in \mathcal{I}(H)\} be the associated set of mean parameters, then:

1) the sub vector μ(H) can be an arbitrary member of \mathcal{M}(H), the set of realizable mean parameters defined by the subgraph H

2) the dual function A * H actually depends on μ(H) and not on the mean parameters μβ for indices β in the compliment L^c(H):=\mathcal{I}(G)\backslash\mathcal{I}(H)

The mean parameters μβ occur in the dot product and are constrained in a nonlinear choices of μβ = gβ(μ(H)). So the final optimization problem can be rewritten in the form: \sup_{\mu \in \mathcal{M}_{tract}(G;H)}\{\langle\mu,\theta\rangle - A^*_H(\mu)\} = \sup_{\mu(H) \in \mathcal{M}(H)}\Big\{\sum_{\alpha \in \mathcal{I}(H)} \theta_{\alpha}\mu_{\alpha} + \sum_{\alpha \in \mathcal{I}^c(H)} \theta_{\alpha} g_{\alpha}(\mu(H)) - A^*_H(\mu(H)) \Big\}

Again taking its derivative and setting it to 0 will result in mean parameter updates \gamma_{\beta}(H) \leftarrow \theta_{\beta} + \sum_{\alpha \in \mathcal{I}(G)
\backslash\mathcal{I}(H)} \theta_\alpha  \frac{\partial g_{\alpha}}{\partial \mu_\beta}(\mu(H)) where γβ(H) is the exponential parameter associated with μβ(H)


Geometric view of mean field

The variational problem may be non-convex, so that there may be local minima and the mean field updates can have multiple solutions. One way to understand this non-convexity is in terms of the set of tractable mean parameters. It can be shown that the the set \mathcal{M}_{tract}(G;H) is non-convex. A practical consequence of this non-convexity is that the mean field updates are often sensitive to the initial conditions.

Parameter Estimation and Variation EM

If we view the variational problem in the EM setting, the E-step (computing the expectation of the hidden variables) corresponds to the estimation of the mean parameters. While the M step ensures that the likelihood of the data increases with each iteration.

10 Oct: Bethe entropy approximation and sum-product algorithm (Anthony and Jiarong)

Previously, we saw how a local message-passing algorithm, specifically, sum-product algorithm was used to perform exact inference in tree-like graphs. The algorithm guarantees that it will converge to an optimal solution. However, this is not the case with graph with cycles. The sum-product algorithm for cyclical graphs can only converge to a point of an approximate free energy known as Bethe free energy.

What about mean field?

Both mean field and sum-product algorithms are message-passing algorithms although their variational interpretations are different. Mean field restricts optimization to a subset of distributions for which A* (negative entropy) and mean parameters are easily characterized. On the contrary, the sum-product algorithm expands the constraint set and approximates the entropy function.

Bethe entropy approximation

What do we need?

Let's consider a pairwise Markov random field defined by a graph G=(V,E) and represent it in the following form based on canonical overcomplete representation:

p(x;\theta) \propto exp\{\sum_{s \in V} \theta_s(x_s) + \sum_{(s,t) \in E} \theta_{st}(x_s,x_t)\}

where θs(xs) and θst(xs,xt) are the marginal functions.

Also, let's denote the marginal polytope associated with this exponential representation as MARG(G).

For tree graphs, the negative entropy A* can be represented as a sum of single node entropy (Hss)) and edgewise mutual information (Istst).

H_s(\mu_s) := \sum_{x_s}\mu_s(x_s)log\mu_s(x_s)
I_{st}(\mu_{st}) := \sum_{x_s,x_t}\mu_{st}(x_s,x_t)log{}\frac{\mu_{st}(x_s, x_t)}{\mu_s(x_s) \mu_t(x_t)} = H_s(\mu_s) + H_t(\mu_t) - H_{st}(\mu_{st})

And as for cyclical graphs, the Bethe approximation to the negative entropy is defined as:

-A^*(\mu) \approx H_{Bethe}(\mu) := \sum_{s \in V}H_s(\mu_s) - \sum_{(s,t) \in E}I_{st}(\mu_{st})

Next, we specify a subset of constraints such that it will create an outer bound on MARG(G). Also, the usual μ notation will be changed to τ to signify marginal distributions on this new boundary - LOCAL(G). This notation is also known as pseudomarginals.

LOCAL(G) = \{\tau \geq 0 | \sum_{x_s}\tau_s(x_s)=1, \sum_{x_s}\tau_{st}(x_s,x_t)=\tau_t(x_t)\}

Note that LOCAL(G) \supseteq MARG(G)

Bethe variational problem

By combining LOCAL(G) with the entropy approximation HBethe, we get:

\max_{\tau \in LOCAL(G)} \{ <\theta,\tau> + \sum_{s \in V}H_s(\tau_s) - \sum_{(s,t) \in E}I_{st}(\tau_{st}) \}

Solving the Bethe variational problem

The solution to the Bethe variational problem using sum-product algorithm can be derived as an iterative method for solving a Lagrangian formulation.

Let's consider the following partial Lagrangian corresponding to the Bethe variational problem:

L(\tau;\lambda) := <\theta,\tau> + \sum_{s \in V}H_s(\tau_s) - \sum_{(s,t) \in E}I_{st}(\tau_{st}) + \sum_{(s,t) \in E}[\sum_{x_s}\lambda_{ts}(x_s)C_ts(x_s) + \sum_{x_t}\lambda_{st}(x_t)C_{st}(x_t)]

Take the derivatives of the Lagrangian w.r.t τs and τst, set the partial derivatives to zero, and simplifying it yields:

\tau_s(x_s) = k exp(\theta_s(x_s))\prod_{t \in N(s)}M_{ts}(x_s)
\tau_{st}(x_s,x_t) = k' exp(\theta_{st}(x_s,x_t)+\theta_s(x_s)+\theta_t(x_t) \prod_{u \in N(s)\backslash t}M_{us}(x_s)\prod_{u \in N(t)\backslash s}M_{ut}(x_t)

In the end, we get a familiar sum-product message update:

M_{ts}(x_s) = k \sum_{st}exp\{\theta_{st}(x_s,x_t)+\theta_t(x_t)\} \prod_{u \in N(t)\backslash s}M_{ut}(x_t)

It is locally consistent, but does not guarantee the global consistency.

Nature of fixed points

In the junction tree algorithm for exact inference, we get a reparameterization:

p(x)=\Pi_{s\in V}\mu_s(x_s)\Pi_{(s,t)\in E(T)}\frac{\mu_{st}(x_s,x_t)}{\mu_s(x_s)\mu_t(x_t)}

If we apply sum-product algorithm to a cyclic graph, we can still have the similar result:

p(x;\theta)\equiv p(x;\tau^*)=\frac{1}{Z(\tau^*)}\Pi_{s\in V}\tau^*_s(x_s)\Pi_{(s,t)\in E(T)}\frac{\tau^*_{st}(x_s,x_t)}{\tau^*_s(x_s)\tau^*_t(x_t)}

It ensures that whatever method we use to solve BVP, the iteration will converge in the affine subset C(θ)

28 Oct: Kikuchi free energy and generalized sum-product (Seth)

Previously, the solution to the Bethe variational problem was locally consistent but not ensured to be globally consistent because the marginal polytope constructed only ensured pairwise consistency of vertices. A useful extension, therefore, would be to contract the polytope in order to better approximate MARG(G) by ensuring local consistency across a set of interconnected vertices or cliques.

Hypertrees

Hypergraphs (similar to the junction trees) G = (V,E) consists of a Vertex set V and a set of hyperedges E where each hyperedge h is a particular subset of the powerset of V. These edges can be viewed as a patially ordered set (poset) using a type of containment relation. In other words, hyperedge g < h if g is contained within h and vice-versa. Hypertrees are acyclic hypergraphs that can specify a junction tree using its maximal hyperedges and their intersections.

Hypertree Factorization and Entropy

Hypertree factorization is defined as:

\log \mu_h(x_h) = \sum_{g \in \mathcal{D}^{+}(h)} \log \varphi_g(x_g)

where

\log \varphi_h(x_h) = \sum_{g \in \mathcal{D}^{+}{0}} \omega (g, h) \log \mu_g(x_g)

and \mathcal{D}^{+}(h) represents the descendants of h including itself. The Mobius function ω is a function that ensures that things are not "overcounted" when calculating the approximation.

The entropy is approximated as the sum of entropies of the basic cluster of nodes minus the entropies of the over-counted cluster intersections, minus the over-counted intersections of intersections, and so on.

Approximate variational principle

Construction the approximation using these augmented hypertrees leads to the following approximation of MARG(G):

\mbox{LOCAL}_t(G) = \left\lbrace  \tau \geq 0 ~ \vert ~ \sum_{x'_h} \tau_h(x'_h) = 1 ~\forall h \in E, ~~\sum_{\{x'_h| x'_g = x_g\}} \tau_h(x'_h) = \tau_g(x_g) ~~\forall g < h \right\rbrace

where τ is the collection of marginals associated with the hyperedges. With this new formulation we can write the exact variational principle:

\max_{\tau \in LOCAL_t(G)} \left\lbrace \langle \theta, \tau \rangle + H_{app}(\tau) \right\rbrace

where Happ(τ) is the approximate entropy calculated as described above.

Generalized sum product

This approach uses the same message passing of the previous algorithms with the exception that the messages are passed parent-to-child. The compatibility function ψh is defined as:

\psi_h(x_h) := \psi'_h(x_h) \prod_{g \in \mathcal{S}(h)} \psi'_g(x_g)

where \mathcal{S}(h) := \left\lbrace g \in E' \setminus E ~|~ g < h \right\rbrace and the pseudomarginal τh takes the following form:

\tau_h(x_h) = \kappa \left[ \prod_{g \in \mathcal{D}^{+}(h)} \psi_g(x_g) \right] ~\left[ \prod_{g \in \mathcal{D}^{+}(h)}  \prod_{f \in Par(g) \setminus \mathcal{D}^{+}(h)} M_{f \rightarrow g} (x_g)  \right]

31 Oct: Convex relaxations (Avishek)

Associated with any MRF is a log partition function which is used to normalize the distribution. It is also useful for parameter estimation and large deviation bounds. We know that for any acyclic graph (i.e., trees), the log partition function can be carried out by recusrsive message passing. For graphs with cycles, the exact computation of log partition if computationally intractable and we resort to approximations on this important quantity. In the last few lectures, we came across two broad classes of approximations based on variational approaches for MRFs: Mean field methods and Bethe/Kikuchi approaches. Mean field methods provide a lower bound on log partition function whereas Bethe/Kikuchi approaches provide neither an upper bound nor an lower bound. In addition, the variational formulations in the aforementioned approaches are non-convex which leads to substantial algorithmic diffculties due to existence of local optimas. In this talk, we look into an alternate formulation of the variational problem (using convex relaxations) which preserves the conevexity of the variational. In addition, it also provides an upper bound on the log partition function.


Convex Relaxations:

The aim of an optimization problem is to find a global optimum. However, in many cases, it might be hard to compute the global optima or even charaterize the set of feasible solutions. One way to approximate the optimization problem is to relax it. But main disadvantage is that one can get trapped in a local optima. To avoid getting trapped in a local minima the objective function along with the contraints should be convex (actually the objective function should be convex/concave and constraints should definitely be convex). This allows the efficient computation of the global minimum of the approximation.


Upper bounds:

The aforementioned log partition function can be upper bounded using tree structures distributions. First, we show that for any \mu \in ri MARG(G) a spanning tree T = (V,E(T)) can be used to obtain the upper bound  - A^*(\mu) \leq - A^*(\mu(T)) . Thereafter,we show that we can achieve similar results for a convex combination of trees and hence the upper bound holds for any spanning tree arbitrarily chosen from an underlying distribution \mathcal{\rho}.


Variational Formulation:

Using the upper bound arguments we show that the variational problem can be formulated as a convex problem of the following form:

A( \theta ) \leq \max_{\tau \in LOCAL(G)} \left\lbrace  \langle \theta, \tau \rangle + \sum_{s\in V} H_s(\tau_s) - \sum_{(s,t) \in E} \rho_{st} I_{st}(\tau_{st}) \right\rbrace

In addition, if  \rho_{st} > 0 \quad \forall edges e \in E, then the above formulation has an unique maxima.


Link to BVP (Bethe Variational Problem):

A natural approach to solve the above variational problems (for both BVP and convex relaxation) is to attach Lagrange multipliers to the constraints and then solve the Lagrangian. The key insight by Yedidia, Freeman and Weiss was that the sum-product updates are a particular technique to find such Lagrangian stationary points. Given the similarity in the variational formulations of BVP and Convex-relaxed version, we can again use the sum-product updates to find the fixed points in the Lagrangian formulation of the convex variational problem. The updates take place according to the following recursion:

M_{ts}^{n+1}(x_s) = k \sum_{x'_t \in \mathcal{X}^t} exp\{\frac{1}{\rho_{st}}\theta_{st}(x_s,x'_t)+\theta_t(x'_t)\} \left\lbrace  \frac{\prod_{v \in N(t)\backslash s}[M_{vt}^n(x'_t)]^{\rho_{vt}}}{[M_{st}^n(x'_t)]^{(1-\rho_{ts})}} \right\rbrace

It is to be noted that for the case of ρ = 1 for all edges, the above expression to reduces to the familiar expression of the sum-product updates for BVP. Again, ρ = 1 for all edges is possible if every edge belongs to every spanning tree which is possible only if the underlying graph is a tree. Thus, the message-passing algorithm for convex relaxation in a way generalizes the BVP for graphs with cycles. However, the vector ρ = 1 for the BVP case, lies outside the spanning tree polytope and hence the BVP formulation no more remains convex.


Link to Kikuchi method (using hypergraphs):

For simplicity of exposition, we first show the convex relaxation procedure for pairwise Markov Random Field (MRF) where the maximal cliques of the underlying graph can have size two. Thereafter we work out an example which show that the analysis also applicable to higher order graphs (hypergraphs) obtained from arbitrary graphs (with cycles) via the Kikuchi method.


7 Nov: Semi-definite relaxations (John and Nathan)

Introduction

We impose positive semidefiniteness on covariance and other moment matrices to establish tightness of bounds on the marginal polytope.

Semi-definite constraints

Linear Matrix Inequalities (LMIs):

Let F(\mu) = F_0 + \sum_{i=1}^d\mu_i F_i, where Fi,F(μ) are matrices

If we constrain F(μ) to be positive semidefinite, then the constraint is a LMI.

Moment Matrices:

Given a random vector \mathbf{y} with n components, \lambda_{st} = \mathbb{E}[y_s y_t] are its second-order moments. Then we have the moment matrix M[\lambda] = \mathbb{E}[\mathbf{y}\mathbf{y}^T].

We can generate other moments simply by making \mathbf{y} a function of \mathbf{x} (for example, \mathbf{y} = [1\ \mathbf{x}] or \mathbf{y} = [1\ x_1\ x_2\ x_3\ x_1x_2\ x_2x_3\ x_1x_3]).

For a moment matrix to be valid, it must be P.S.D.

Multi-index notation

Multi-index notation is just a shorthand for representing monomials. If we have n-vectors \mathbf{x} and α, then we say that \mathbf{x}^\alpha := \prod_{s=1}^n {x_s}^{\alpha_s}, and α is called a multi-index.

For two multi-indices α and β, α + β is just the component-wise sum of the two vectors.

There's a result for multinomial distributions of \mathbf{x} that allows us to simplify the math for high individual powers of xs. It's a handy trick, but probably not worth proving in the presentation.

First-order outer bound

Let μα = E[xα] denote a mean parameter (or moment) for a given multi-index α. Let Kk,n denote the hypergraph on k edges with n nodes. (For example: K2,n is a complete graph on n nodes.) Then the associated marginal polytope is MARG(K_{k,n}) := \{\mu_{\alpha} \in R^{\vert I_k \vert} \vert \alpha \in I_k\}, which corresponds to all the valid moments μα. In short, for any hypergraph G, we can use MARG(G) to be its related marginal polytope.

During the presentation, we will be presenting an example focusing on a pairwise Markov random field.

Projections

In order to move from constraints on K2,n to constraints on G, we define a projection from K2,n onto the multi-index sets of G. A consequence is that ΠG(SDEF1) is an outer bound on MARG(G). SDEF1 is also a strict outer bound on MARG(K2,n); we work an example demonstrating this.

Higher order constraints

Here we just move to higher-order moment matrices of dimension |I_t|\times|I_t| to bound hypergraphs, and build a hierarchy of bounds. We demonstrate that SDEFn provide an exact characterization of MARG(G) for any hypergraph G.

How does this tie into graphs?

Recall that the complexity of a given marginal polytope (MARG) depends on the structure of the underlying graph. Just like with others methods we have seen so far, if your (hyper)graph is a (hyper)tree, then the complexity of MARG can be fairly reasonable. This translates to a small set of constraints to characterize MARG for a given hypertree.

For any hypertree: SDEF(G) = MARG(G), which essentially just means that a much lower order of constraints is all that is needed to fully characterize the marginal polytope. Compare/Contrast this with Proposition 1 on Junction Trees if you have extra time.

There is more we can say about hypertrees: Given a hypertree G, then LOCAL(G) = MARG(G) = SDEF(G). This tells us that LOCAL(G) can be seen as an intersection of locally defined semidefinite constraint sets:  LOCAL(G) = \cap_{g \in E} SDEF(H) where H is a subhypergraph with a vertex set equal to g.


Log-det relaxations

This is a short example of using semidefinite constraints in approximate inference. This example will be a Gaussian-based approximation, whereas before with the Bethe variational principle, we saw a tree-based approximation.


14 Nov: Approximate computation of modes (Zhan and Jagadeesh)

Mode computation

We have talked about the variational methods for approximate computation for log partition function and mean parameters. Another method for doing these is mode computation of p(x;θ).

The mode computation is to find a configuration x^* \in X^n that maximizes p(x;θ). Recall (pp.18) the p(x;θ) in form of p(x;θ) = exp{ < θ,φ(x) > − A(θ)},this problem equals to find a configuration x^* \in argmax_{x}<\theta,\phi(x)>.

We introduce a scalar β > 0 to rescale the exponential parameter θ to βθ so that the log partition function becomes A(βθ). \beta\theta \in \Theta. To avoid the case that A(βθ) diverges when \beta \rightarrow \infin, we use A(βθ) / β for computation. Here comes the Theorem 4 (support function representation):

For all \theta \in \Theta: \max_{x \in X^n} {<\theta, \phi(x)>} = sup_{\mu \in clM} <\theta,\mu> = lim_{\beta \rightarrow +\infin}\frac{A(\beta\theta)}{\beta}.

Theorem 4 tells that mode computation equals to maximizing a linear function over M (recall the definition of M on pp.22). Theorem 2 (pp.24) helps to explain this conclusion: lim_{\beta \rightarrow +\infin}\frac{A(\beta\theta)}{\beta} = lim_{\beta \rightarrow +\infin}\frac{1}{\beta}sup_{\mu \in M} \{<\beta\theta,\mu>-A^*(\mu)\} = lim_{\beta \rightarrow +\infin}\frac{1}{\beta}sup_{\mu \in M}\{<\theta,\mu>-\frac{1}{\beta}A^*(\mu)\}.

This means that the limit above equals to the support function of M.


Exact mode computation on multinomial case

In the finite discrete case, the set of realizable mean parameters M could be defined as a polytope. We use MARG(G) to denote the set of realizable marginals associated with potentials on clique G, the marginal polytope.

The support function representation above in Theorem 4 shows that the mode computation is a linear programming problem for multinomial case. The mode computation task is to maximize < θ,μ > over MARG(G), where vector θ specifies a direction in the geometric space. We use a hyperplane with normal θ to point outwards the polytope and find the configuration in the intersection area between the hyperplane ant the polytope. The configuration maybe only one or many, depending on the direction of θ.


Tree-structured case

For a tree-structured graph T = (V, E(T)), let μs(xs) and μst(xst) denote a set of marginal functions of T. The corresponding exponential parameters are:  \theta_{s}(x_{s}) := \sum_{j \in X_{s}}{\theta_{s;j}I_{j}(x_{s})} and \theta_{s}(x_{s}) := \sum_{(j,k) \in X_{s} \times X_{t}}{\theta_{st;jk}I_{jk}(x_{s},x_{t})}.

The local constraints set of the tree marginals in overcomplete form is (for Graph T):

LOCAL(T) := \{\mu \ge 0 | \sum_{x_{s}}{\mu_{s}(x_{s})} = 1, \sum_{x_{t}}{\mu_{st}(x_{s},x_{t}) = \mu_{s}(x_{s})}\}

Combine this formula with Theorem 4 we get:

\max_{\mu \in LOCAL(T)}{<\mu,\theta>} = \max_{\mu \in LOCAL(T)}{\{\sum_{s \in V}{\sum_{x_{s}}{\mu_{s}(x_{s})\theta_{s}(x_{s})}} + \sum_{(s,t) \in E(T)}{\sum_{x_{s},x_{t}}{\mu_{st}(x_{s},x_{t})\theta_{st}(x_{s},x_{t})}}\}}

Here we get the mode computation of the tree-structured case.

Recall the message-passing algorithm in the tree. Here we use Theorem 4 to modify the message to:

M_{st}(x_{s}) = \kappa\max_{x_{t} \in X_{t}}{[exp{\{\theta_{st}(x_{s},x_{t}) + \theta_{t}(x_{t})\}} \prod_{u \in N(t)/s}{M_{ut}(x_{t})}]}.

Compare it with the original message expression:

M_{st}(x_{s}) \leftarrow \kappa\sum_{x_{t}^\prime}{\{\Psi_{st}(x_{s},x_{t}^\prime)\Psi_{t}(x_{t}^\prime) \prod_{u \in N(t)/s}{M_{ut}(x_{t}^\prime)}\}}.

Relaxations of the exact principle

We saw that the zero-temperature limit of the variational principle leads to the exact mode computation. Similarly, the zero-temperature limit of relaxed variational principle gives the corresponding mode.

Given OUT(G) a compact and convex outer bound on MARG(G), and a B * convex approximation to the dual function A *  : \max_{\mu \in MARG(G)} \langle \theta,\mu \rangle \leq \max_{\tau \in OUT(G)} \langle \theta,\tau \rangle

Consider the case of pairwise MRF. In the canonical overcomplete representation, the set of constraints on the mean parameters μ, i.e. LOCAL(G) forms a strict outer bound on MARG(G), unless G is tree-structured graph. So \max_{x \in \mathcal{X}^n} \langle \theta, \phi(x) \rangle = \max_{\mu \in MARG(G)} \langle \theta,\mu \rangle \leq \max_{\tau \in LOCAL(G)} \langle \theta,\tau \rangle.

Since LOCAL(G) is a polytope, we have a LP problem again. And the optimum must be attained at a vertex. Let say a vertex of LOCAL(G) is integral if all of its components are zero or one and fractional otherwise.

An interesting property is, all the vertices of MARG(G) are also vertices of the relaxed polytope LOCAL(G). And when G is not-tree structured LOCAL(G) will contain additional vertices (outer approximation) that lie outside of MARG(G) and are fractional.

The distinction between integral and fractional is important because it determines whether the LP relaxation specified by LOCAL(G) is tight or not. That is, if the optimum is obtained at one or more fractional vertices, then it is a vertex of LOCAL and not MARG. So it lies strictly outside MARG(G). And hence the approximation to the mode not tight.

2 Dec: Fixing Max-Product: Convergent Message Passing Algorithm for MAP LP-Relaxations (Jiarong and Ruihong)

MPLP is a coordinate descent algorithm in the dual of MAPLPR which are structurally similar to max-product. Given a graph G = (V,E) with n vertices and potentials θij(xi,xj) for all edges ij\in E the MAP problem is defined as finding an assignment xM that maximizes the function

f(x;\theta)=\sum_{ij\in E}\theta_{ij}(x_i,x_j)

Define

M_L(G)=\{\mu\ge0 | \sum_{\hat{x}_i}\mu_{ij}(\hat{x}_i,x_j)=\mu_j(x_j), \sum_{\hat{x}_j}\mu_{ij}(x_i,\hat{x}_j)=\mu_i(x_i), \sum_{x_i}\mu_i(x_i)=1, \forall ij\in E, \forall i\in V\}

Then the solution to the linear program MAPLPR \mu^{L_*}=arg \max_{\mu\in M_L(G)}\mu\cdot\theta yields an upper bound on the MAP value. Moreover, it has an equivalent convex dual, shown as follows, which allows the derivation of simple message passing algorithm.

(DMAPLPR) \quad \quad \min \sum_i max_{x_i} \sum_{k\in N(i)} \max_{x_k}\beta_{ki}(x_k,x_i),

s.t. βji(xj,xi) + βij(xi,xj) = θij(xi,xj)

where the dual variables are βij(xi,xj) for all ij, ji\in E and values of xi and xj.

Block Coordinate Descent in the Dual

At every iteration, fix all variables except a subset, and optimize over this subset, we can get two kinds of update rules for the above dual form which have similar forms with max-product messages and can make the algorithms to converge.

Edge-based MPLP: fixing all the β variables except those coresponding to one edge ij \in E and minimize DMAPLPR over the non-fixed variables βijji, we get the update rule for edge ij,ji:

\lambda_{ji}(x_i) = -1/2 \lambda_i^{-j}(x_i)+1/2 max_{x_j}[\lambda_j^{-i}(x_j)+\theta_{ij}(x_i,x_j)],

where \lambda_i^{-j}(x_i) = \sum_{k \in N(i)-j}\lambda_{ki}(x_i) and \lambda_{ki}(x_i) = max_{x_k}\beta_{ki}(x_k,x_i).

Node-based MPLP: fixing all the β variables except those βijwho have a given node i \in V, the update rule for the local star is:

\gamma_{ij}(x_j) = max_{x_i}[\theta_{ij}(x_i,x_j) - \gamma_{ji}(x_i) + 2/(|N(i)|+1) \sum_{k \in N(i)}\gamma_{ki}(x_i)],

where \gamma_{ji}(x_i) = max_{x_j}[\theta_{ij}(x_i,x_j)+\lambda_j^{-i}(x_j)].

Convergence Properties

The MPLP algorithm improves the dual objective at every iteration and converge to a local optimum in the end, but it doesn't guarentee to converge to the global optimum. In some cases, it do converge to the global optimum: Proposition 2: If the resulting beliefs bi(xi) has a unique maximizer x_i^*, then x * is the global optimum. It's a somewhat natural property of the dual form and the updating process. Proposition 3: When all xi are binary variables, MPLP will converge to the dual optimum. Now, the linear programming relaxation is exactly the integer programming.

Experimental Results compares favorably with previous approaches in two measures: hit time(mesuring convergence time) and belief change (measuring convergence speed).

5 Dec: Some new results on estimating MAP and marginals (Piyush)

Message Passing based tighter LP relaxations for MAP

The MAP problem amounts to solving an integer program and is thus known to be NP hard. Linear Programming (LP) based relaxations give us a work-around to this problem and are known to also give optimality guarantees. It is also well known that LP relaxations can be solved efficiently using message passing algorithms such as Belief Propagation (see, for example, this paper). When the relaxation is tight, LP can provably find the MAP solution. However, standard LP relaxations (using off-the-shelf solvers) are usually not tight enough for many real world problems as the generic solvers do not make use of the graph structure. This has led to the exploration of higher-order cluster based LP relaxations where local consistency is enforced between cluster marginals. In cluster based LPR, messages are sent between clusters. As the size of clusters grows, tighter and tighter relaxations are possible. Unfortunately, at the same time, the computational cost can be prohibitively expensive due to the size and type of clusters (e.g., 100 states/node, 3 node cluster would have 106 states). So we clearly cannot have too many such clusters in our approximation.

The paper proposes a cluster-pursuit method that which allows adding clusters incrementally and helps determine which clusters are guaranteed to improve the approximation the most. There are essentially two formulations we can solve an LP in: primal and dual. Although adding the clusters can be done in the primal LP, this is expensive so the paper proposes working with the dual formulation. There are also other benefits of working with the dual formulation:

  • Dual LP provides an upper bound on the MAP value so we can categorically seek those clusters that minimize this bound the most. Such an assessment is difficult to obtain in the primal.
  • Dual LP allows a "warm-start" after adding each cluster so we don't have to solve the optimization problem again from scratch.

The dual LP is solved using the generalized max-product LP (MPLP) message passsing algorithm we saw last time.

To summarize, the paper gives an algorithm based on MPLP that achieves the following objectives:

  • Monotonically decreases the upper bound on MAP.
  • Chooses clusters which give a guaranteed bound improvement.
  • Allows a "warm-start" after adding each cluster so that we don't have to solve the optimization problem again from scratch.
  • Scales to very large problems.

The experimental results show that the proposed method can find MAP configuration in protein side-chain prediction, protein design, and stereo problems, where standard LP relaxations fail.

New Outer Bounds on the Marginal Polytope

Inference in graphical models such as Markov Random Fields requires us to solve for the marginal probabilities of a subset of variables, or find the maximum-a-posteriori (MAP) assignment of the variables. Both these problems usually require approximate methods to solve them. The variational approach to solving these problems is to cast them as non-linear optimization problems over what is called the "marginal polytope", which is essentially the set where the marginals can potentially come from. It is well known that the marginal polytope is extremely hard to characterize and various simplifications to it have thus been proposed. For example, some message passing algorithms (such as BP or the tree-reweigthed sum-product) for estimating marginals operate within what we call the "local consistency polytope", which is characterized by pairwise consistent marginals. Because of the relatively smaller number of constraints, this is an "outer bound" (OUTER) on the marginal polytope.

The paper explores the possibility of obtaining tighter outer bounds on the marginal polytope and proposes a cutting-plane algorithm that iterates between solving a relaxed problem and adding additional constraints to the set being optimized over. The cutting-plane algorithm is a well-known technique for solving integer programs. Given a (potentially) infeasible solution, the cutting-plane algorithm quickly finds the violating constraints from a very large class of valid constraints on the set of integral solutions. The violated constraints are then added to the current set being optimized over and the process continues until no more violating constraints are found.

Cycle inequalities: We know that the marginal polytope can be defined by the intersection of a large set of linear inequalities. A basic set of such inequalities defines what we know as LOCAL(G). The paper focuses on another class of inequalities called the cycle inequalities and the proposed cutting-plane algorithm works by iteratively finding all violated cycle inequalities until no more are found.

Cycle inequalities are based on the idea of cut-edges. Given a graph G = (V,E) and an assignment \mathbf{x} \in \{0,1\}^n over its vertices, (i,j) \in E is cut if x_i \neq x_j. A cycle inequality arises due to the fact that a cycle (i.e. starting at xi and traversing back to itself) must have an even number (or zero) of cut-edges. For a cycle C and any F \subseteq C such that |F| is odd, this constraint can be written as:

\sum_{(i,j) \in CF}\delta(x_i \neq x_j) + \sum_{(i,j) \in F}\delta(x_i = x_j) >=1.

Since this constraint is valid for all assignments \mathbf{x} \in \{0,1\}^n, it must be true in expectation as well. Thus:

\sum_{(i,j) \in CF}(\mu_{ij;10} + \mu_{ij;01}) + \sum_{(i,j) \in F}(\mu_{ij;00} + \mu_{ij;11}) >=1

would be valid for any \mu \in \mathcal{M}_{\{0,1\}}.

So this gives us a way to incorporate valid constraints that are violated by the candidate pseudo-marginals. Using a graph algorithm for finding shortest paths, the set of violated inequalities can be found in time O(n2logn + n | E | ).

At each step, the feasible set of constraints is intersected with the set OUTER to give a tigher outer bound. The algorithm is sketched below:

1. OUTER = LOCAL(G)

2. repeat

3. \mu^* = \arg\max_{\mu \in OUTER}{\langle \theta,\mu \rangle - B^*(\mu)}

4. Choose projection graph Gπ, e.g. single, k, or full

5. \mathcal{C} = find_Violated_Inequalities(Gππ * ))

6. OUTER = OUTER \cap \mathcal{C}

7. until \mathcal{C} = R^d (did not find any violated inequalities)

Step 4 is not needed for binary pairwise MRFs. For non-binary and/or non-pairwise MRFs, the paper proposes a technique based on projection graphs that can be used to project the marginal polytope of non-binary and non-pairwise MRFs onto different binary marginal polytopes and still use the procedure outlined above.

The experimental results shown on computing marginals demonstrate that the proposed method does well even on tightly coupled graphs whereas the accuracies of tree-reweighted and log-det schemes become worse as the coupling increases. Also shown are the results on a MAP problem (protein side-chain prediction) where the method is shown to solve the MAP problem for some very difficult datasets where the LP relaxation based on LOCAL(G) does not give satisfactory solutions.

Past Semesters

Personal tools