MLRG/fall08

From ResearchWiki

Revision as of 06:54, 26 September 2008 by Piyush (Talk | contribs)
Jump to: navigation, search

Contents

CS7941: Inference in Graphical Models

Time: Fri 2:00-3:20pm (see schedule for each week)

Location: MEB 3105

Abstract

Graphical models (both Bayesian networks and Markov random fields) are a convenient method for encoding and reasoning about probability distributions. This seminar will focus initially on representational issues and then move to computational methods for inference in graphical models. Our interest will primarily be on approximate inference, since exact inference in most models is intractable. The focus will be on non-Bayesian approaches, though many things we discuss are applicable to Bayesian problems as well.

Students with a working knowledge of probability are well suited for this seminar. Each student should expect to present at least one week during the semester.

The rough plan is to begin with classical approaches to graphical models, including exact inference via the junction tree algorithm. We will quickly move on to approximate methods, focusing mostly on message passing and linear programming approaches.

The structure of the semester will be roughly as follows:

  • Introduction to Bayes' nets and graphical models: semantics, variable elimination, sum/max product algorithm, and factor graphs
  • Computational complexity: junction trees, triangulations and tree-width, belief propagation
  • Exponential families: maximum entropy principle, cumulant generating functions, conjugate duality
  • Variational inference: exact inference on trees, mean-field and structured mean-field, inner bounds of the marginal polytope
  • Statistical physics: Bethe free energy, log-determinant relaxation and outer bounds of the marginal polytope
  • Linear programming methods: tree-reweighted message passing, more outer bounds

This is a lot of material. Things will be dropped or skimmed depending on time and interest levels. Students should only come if they are prepared to do the required reading before the seminar.

Expected Student Involvement

Presentations will be given by pairs of students. The logistics of this will depend somewhat on the make-up of the participants, but the goal would be to pair CS students with non-CS students in most cases. Sign-ups will happen quickly, so please take care to make sure you're signed up for at least one week, though probably two. The expectations, with timing, are:

  • At least 3 days before a presentation, the presenters should meet with Hal for one hour to discuss the material and decide what to present.
  • By class time, everyone should have submitted at least 2 questions to the Wiki for discussion. (Click on the "discussion" tab along the top of the wiki to enter your questions.)
  • Everyone should actively participate in the discussions.

If you're interested in participating in the programming parts, please visit the programming subpage.

Participants

Schedule and Readings

Most of our initial readings will be from the technical report Graphical models, exponential families, and variational inference by Martin Wainwright and Mike Jordan. I know this tome is a bit daunting, but it's not as bad as it looks at first glance! We refer to this in the readings as WJ.

When videos are listed, they are stored in Theora video format; viewers are available for pretty much every platform. Please please please don't distribute these or give away the password (which was sent to the ml@cs mailing list).

Date Papers Presenters
Introduction to Bayes' nets and graphical models
F 29 Aug Introduction, semantics and applications (WJ, 1-2.4) Hal
F 5 Sep Factor graphs, sum-product and max-product (KFL01, section 1,2, example 6 from section 3 B, section 4 A, appendix A and B; Optional reading for some more foundational stuff: WJ, 2.5-2.5.1)

video A and video B

Piyush (CS), Zheng(BI)
F 19 Sep Junction tree algorithm and triangulations (WJ, 2.5.2; HD96, 4)

video

Nathan (CS), Seth (CS)
Variational Inference
F 26 Sep Math camp: exponential families and conjugate duality (WJ, 3) Amit (CS), Piyush (CS)
T 30 Sep, 5:15pm Variational principle, exact inference on trees (WJ, 4) Arvind (CS), Senthil (BMI)
F 3 Oct Mean-field and structured mean-field (WJ, 5) Jagadeesh (CS), Stephen (BMI)
Statistical Physics
F 10 Oct Bethe free energy and sum-product (WJ, 6) Anthony (BMI), Jiarong (CS)
F 24 Oct Kikuchi free energy and generalized sum-project (WJ 7) Seth (CS)
Convex Relaxations
F 31 Oct Convex relaxations (WJ 8) Avishek (CS)
F 7 Nov Semi-definite relaxations (WJ 9) John M. (CS)
F 14 Nov Mode computations (WJ 10) Zhan Wang (CS)
F 21 Nov Exact max-product and outer bounds (GJ07) Ruihong(CS)
F 5 Dec Outer bounds on the marginal polytope (SJ07 and SMGJW08)

Additional Readings

These readings are listed just for reference if people are interested.


Reading Summaries

29 Aug: Introduction, semantics and applications (Hal)

High-level summary: graphical models = graph theory + probability theory.

Applications:

  • Hierarchical Bayesian Models. Almost all hierarchical Bayesian models can be (and often are!) represented by directed graphs, with nodes as r.v.s and edges denoting conditional dependence. The standard problems in Bayesian modeling (computing marginal likelihoods, computing expectations, etc.) are graphical model inference problems. (Cautionary note: most hierarchical Bayesian models involve rvs with continuous density; the stuff we'll talk about this semester doesn't really address this. Our focus is basically entirely on discrete rvs, only.)
  • Bioinformatics and Language: Biology problems and language problems can often be seen as inference problems over relatively simple structures (like sequences or trees). These are naturally encoded as, say, HMMs or PCFGs, which are amenable to graphical model analysis.
  • Image Processing: A standard image representation is as a 2D lattice (undirected!) model, with the nodes representing pixels.

Background:

In most of the applications listed above, the underlying graph structures are relatively "dense." This is not promising, because non-probabilistic algorithms on graphs typically have trouble with dense graphs. It turns out (as we shall see soon) that the same is true in graphical model inference. The denser the graph, the harder (computationally!) the inference problem. For most real world problems, we're talking NP-hard.

Since these problems are NP-hard (and usually the really bad variety of NP-hard), we're going to go for approximations. One standard approximation method is to construct an MCMC algorithm. We won't discuss MCMC here (perhaps some other semester, or perhaps take one of the Stochastics classes in the Math department.

The approach we take instead is variational: that is, we attempt to convert the inference problems into optimization problems, and solve the optimization problems. The optimization problem will, of course, also be computationally intractable. However, by relaxing the optimization problem, we will obtain an (efficient) approximate solution, which will lead to an (efficient) approximate solution to the inference problem.

Graphical Models:

Let G = (V,E) be a graph. We will associate each vertex s \in V with a random variable x_s \in X_s. The graphical model consists of a collection of probability distributions over the random variables defined by the vertices of the graph that factorizes according to the edges. (Notation: if A \subseteq V, define x_A = \{ x_s : s \in A \}. There are two types of graphical models, just as there are two types of graphs:

  • Directed graphical models: here, let π(s) be the set of parents of a node and then we have p(\vec x) = \prod_{s \in V} p(x_s | x_{\pi(s)})
  • Undirected graphical models: here, the distribution factorizes according to functions defined on the cliques of the graph (aka fully-connected subsets of V). For every clique C, let ψC be some function that maps the values of the random variables in C (i.e., xC) to a positive real value. Then, p(\vec x) = \frac 1 Z \prod_C \psi_C(x_C).

Inference Problems:

Given a probability distribution p(\cdot) defined by a graphical model, we might want to:

  • compute the likelihood of some observed data (observed data = fixed values of a subset of the rvs)
  • compute the marginal distribution p(xA) for some fixed A \subset V (test question: why not A \subseteq V?)
  • compute the conditional distribution p(xA | xB) for disjoint A,B with A \cup B \subset V (same question!)
  • compute the mode of the density; i.e., solve \arg\max_{\vec x} p(\vec x)

Note that the first three are essentially the same: they involve one or more marginalizations. The last one is fundamentally different.

05 Sept: Factor Graphs and the Sum-Product Algorithm (Piyush and Zheng)

The Problem:

Given a function g(x1,...,xn) over a collection of random variables, compute the marginals g_i(x_i) = \sum_{\sim \{x_i\}}g(x_1, ..., x_n). Here, a marginal gi(xi) means we are summing the function g(x1,...,xn) over all possible values (i.e. all 'configurations') of all random variables except xi. We assume each xi belongs to some domain (or alphabet) Ai. The global function g(x1,...,xn) is assumed to factorize into a product of local functions: g(x_1, ..., x_n) = \prod_{j \in J}f_j(X_j), where J is some discrete index set. The precise nature of this factorization depends on the kind of graphical model we have (for example, Bayes nets and MRF, we saw last week).


The Technique:

It turns out that we can visualize such factorization using a bipartite graph called a factor graph and use message-passing algorithms to do inference on probability distributions associated with these graphs. In particular, we shall consider the problem of finding marginals for such distributions and see how a message-passing technique called the sum-product algorithm helps us do this efficiently. The sum-product algorithm operates on factor graphs. To apply it for other graphical models (such as Bayes net or MRF), we first need to convert them into an equivalent factor graph.

Note that our focus will be on tree-structured factor graphs (a majority of applications deal with such graphs only) for which the sum-product algorithm gives exact results but it's also applicable to factor graphs with cycles (for which the results will be approximate).

Factor Graphs:

A factor graph is a bipartite graph that consists of a set of factor nodes fj (one for each local function) and a set of variable nodes xis. There is an edge between fj and xi only if xi is an argument of fj. It turns out that various marginalization algorithms can be described as consisting of a set of message passing operations in a factor graph.

  • A incoming message at node xi is a vector of length | Ai | where | Ai | is the cardinality of the domain Ai of xi.

We can think of each vertex in a factor graph as a processor and edges representing channels using which the processors can communicate (i.e. pass messages). Using message passing operations, factor graphs provide us a way to efficiently compute the marginals. The versatility of factor graphs lies in the fact that they provide a way to represent various types of graphical models (e.g., Bayes nets, MRF) into a common framework under which inference in such models can be efficiently carried out. We shall see examples on how to come up with such representations.

The Sum-Product Algorithm:

Repeatedly invoking the simple message passing for each node would give us marginals for each individual node. However, it neglects to share the intermediate terms arising in the individual computations. The sum-product algorithm is a dynamic algorithm that effectively shares the intermediate terms in the computations and computes all the marginals simultaneously and in parallel.

Message passing starts on the leave nodes in the factor graphs. Each node v waits to receive messages on all but one of the incident edges on v, at completion of which it uses these messages to compute a message to be sent on the remaining edge. The algorithm terminates once two messages have been passed along both directions on each edge in the factor graph. Thus, on a factor graph with n edges, 2n messages must be passed before the sum-product algorithm terminates. On termination, the marginal function gi(xi) is simply the product of all incoming messages destined to xi. The message passing operations involve computing various sums and products and hence the procedure is called the sum-product algorithm.

Since there are two type of nodes in a factor graph, two type of messages are passed in the sum-product algorithm:

  • variable node x to factor node f: \mu_{x \rightarrow f}(x) = \prod_{h \in n(x) \backslash \{f\} }\mu_{h \rightarrow x}(x)
  • factor node f to variable node x: \mu_{f \rightarrow x}(x) = \sum_{\sim \{x\}}(f(X)\prod_{y \in n(f) \backslash \{x\} }\mu_{y \rightarrow f}(y))

In all of the above, n(v) denotes the set of neighbors for node v. f(X) = n(f) is the set of arguments of the function f. Important: Notice that the expression for the message from a variable node to a factor node looks significantly simpler than the case of a message from factor node to variable node. This is because a) at the message origin, there is no local function involved in the former case, and b) summary operator is not required since the final message is destined to a factor node.

Special Instances of the Sum-Product Algorithm:

The sum-product algorithm is quite general and it subsumes a wide variety of practical algorithms such as the forward/backward algorithm, the Viterbi algorithm, and the Belief Propagation algorithm in Bayes nets. In particular, we shall see examples of the forward/backward algorithm used for inference in HMMs, and the Belief Propagation in Bayes nets, and see how they turn out to be special instances of a sum-product algorithm operating on a factor graph.

19 Sept: Junction Tree Algorithm and Triangulations (Nathan and Seth)

Motivation

D-Separation of Graphs: In order to get a better understanding of the method used to convert DAGS to undirected trees, it will be useful to understand d-separation. There seems to be some disagreement over what the 'd' stands for in d-separation, with one camp claiming directional and another claiming dependence. Both conventions have value in understanding what the separation entails.

If two variables, X and Y are d-separated relative to a variable Z in a directed graph, then they are independent w.r.t Z in all probability distributions that this graph can represent. X and Y are independent conditional on Z if observing X gives you no extra information about Y once you have observed Z, i.e. once you know Z, X adds no information to what you know about Y.

Consider all directed graphs of three variables, A,B,C, where C is observed:

  • A -> C -> B - The path from A to B is blocked by conditioning on C. A and B are d-separated in this case.
  • A <- C <- B - The path from B to A is blocked by conditioning on C. A and B are d-separated in this case.
  • A <- C -> B - Here we are conditioning on a common cause, the effects are independent. A and B are d-separated in this case.
  • A -> C <- B - In this orientation, C is a called a collider. A and B are not conditionally independent in this case, thus A and B are d-connected. (Note: If C is not observed, then it is possible for this case to be blocking as well.)

In short, the rules for blocking and thus d-separation are:

  • the arrows on the path meet head-to-tail or tail-to-tail at a node, whether observed or not.
  • the arrows meet head-to-head at a node that is not observed.

If these two conditions hold for ALL paths between the nodes, A and B, then A is d-separated from B by C (A and B are then also conditonally independent.)

Converting DAGS to Undirected Trees

  • Moralization:
  • Triangulation:
  • Construct Junction Tree:

The Junction Tree Algorithm:

Essentially, this algorithm is very similar to the sum-product message passing algorithm from last meeting. There are some neat properties, notably the running intersection property which asserts inference about variables will be consistent across the graph, i.e. that local consistency ensures global consistency. This property is a consequence of the triangulation step above and states that if a variable is contained in two cliques, then it must also be contained in every clique on the path that connects them.

19 Sept: Exponential Families and Convex Analysis (Amit and Piyush)

It turns out that a wide variety of message-passing algorithms (some of which we have already seen, and others we will be seeing in the coming weeks) can also be understood as solving exact or approximate variational problems. Exponential families (and tools from convex analysis) provide a unified framework to develop these variational principles.

Exponential Families: An exponential family defines a parametrized collection of (probability) density functions: p(x;θ) = exp{ < θ,φ(x) > − A(θ)}. The quantity A acts as a normalization constant and is commonly known as the log partition function: A(\theta)=log \int_{X^n} exp <\theta,\phi(x)> dx. \phi_\alpha : X^n \rightarrow R are functions known as sufficient statistics. \phi = \{\phi_\alpha : \alpha \in I \} is a vector of sufficient statistics. It is easy to see that p(x;θ) depends on x only through φ(x) (and hence the name sufficient statistics).\theta = \{\theta_\alpha : \alpha \in I \} is a vector which is called exponential or canonical parameter, and is used to identify members of the family associated with φ. \Theta := \{\theta\ \in R^d | A(\theta) < \infty \} is the set θ comes from. We will show that A is a convex function of θ which in turn implies that Θ is a convex set.

We will see some examples of distributions (Bernoulli, Gaussian, and another distribution defined as a graphical model -- the Ising model) and show that each of these can be represented as exponential families.

Representations: An exponential family admits either of the following two representations:

  • Minimal: We define exponential family with collection of functions φ = {φα} such that there exists no linear combination <a,\phi(x)>=\sum_{\alpha \in I}a_\alpha \phi_\alpha(x) equal to a constant. This condition leads to a minimal representation, in which there is a unique parameter vector θ with each distribution.
  • Overcomplete: Rather than using a minimal representation we can use an overcomplete representation. Here there exists an entire affine subset of parameters vector θ each associated with the same distribution.

Properties of the log partition function: The log partition function A(θ) is both smooth and convex in θ. It can also be seen as the cumulant generating function of the random vector φ(x). In particular, it is easy to see that the first cumulant gives us the mean of φ(x):

\frac{\partial A}{\partial \theta_{\alpha}}(\theta) = E_{\theta}[\phi_{\alpha}(x)] := \int \phi_{\alpha}(x)p(x;\theta)dx

Higher cumulants can be similarly defined.

Mean (expectation) parameters: Let's consider the set of vectors defined by taking expectation of φ(x) w.r.t. an arbitrary distribution:

\mathcal{M} := \{\mu \in R^d |\ \exists p(.)\ s.t. \int \phi(x)p(x)dx\}

Now let's take an arbitrary member of an exponential family p(x;θ) defined by φ(x) and define a mapping:

\Lambda(\theta) := E_\theta[\phi(x)] = \int \phi(x)p(x;\theta)dx

The mapping Λ associates to each θ a vector of mean parameters μ: = Λ(θ) belonging to the set \mathcal{M}. It can also be seen easily that \Lambda(\theta) = \nabla A(\theta).

The mapping Λ between the set Θ of exponential parameters θ and the set \mathcal{M} of mean parameters μ is interesting and worth remembering. In particular, the following conditions hold:

  • The mapping Λ is one-to-one if and only if the exponential representation is minimal.
  • The mapping Λ is onto the relative interior of \mathcal{M} (i.e.,\Lambda(\Theta) = ri \mathcal{M}).

For minimal or overcomplete representations, we say that the pair (θ,μ) is dually coupled if μ = Λ(θ) and hence \theta \in \Lambda^{-1}(\mu).

Some conjugate duality

Following the duality between θ and μ, we define the Legendre-Fenchel dual of the log partition function A(θ) as:

A^*(\mu) := \sup_{\theta \in \Theta}\{<\mu,\theta> - A(\theta)\}

It turns out that the dual function A * (μ) is actually the negative entropy of p(x;θ(μ)) where θ(μ) is an element of the inverse image Λ − 1(μ), for \mu \in \mathcal{M}. In terms of this dual, we can define the log partition function in a variational representation:

A(\theta) =  \sup_{\mu \in \mathcal{M}}\{<\theta,\mu> - A^*(\mu)\}

So it turns out that the variational representation of A(θ) just reduces to an optimization problem over the domain \mathcal{M}.

The duality between A(θ) and A * (μ) in also very important. For example, this duality gives us several alternative ways to compute the KL divergence between two exponential family members.

The take-home message: It's important to keep in mind the properties of the log partition function A(θ) and the mean parameters μ = Eθ[φ(x)]. As we will see in the subsequent weeks, in the variational framework for doing inference (in particular, the problems of computing marginals), it's essentially these quantities that we need to deal with.

Past Semesters

Personal tools