Talk:MLRG/fall08

From ResearchWiki

< Talk:MLRG
Revision as of 05:35, 5 September 2008 by Piyush (Talk | contribs)
Jump to: navigation, search

29 Aug

Hal

This is just an example of a question?
And another example?
Hal's alter-ego: The answer is blah.

Seth

What is the general difference between the MCMC methods and the variational methods presented in this paper?
(Piyush): On a high level, both MCMC and Variational methods are approximate inference techniques where the goal is to approximate some target probability distribution. The difference lies in the kind of approximations they provide. MCMC gives a sampling based stochastic approximation (i.e. which can vary for different runs) whereas variational inference is an optimization based deterministic approximation (i.e. same across runs). The important thing to remember however is that, in the limit of infinite computational resources, MCMC can produce exact samples (i.e. if the chain is run long enough). So the approximation arises only because of the finite limit on resources. On the other hand, variational methods solve an optimization problem of minimizing the difference between the true distribution and a candidate distribution. Exact optimization is computationally intractable so we usually solve a relaxed problem (thereby yielding an approximate solution).
On page 3 it mentions a compatibility function \psi_C : \mathcal{X}^n \rightarrow \mathbb{R}_{+} when describing the factorization of a probability distribution in an undirected graphical model. What does this function represent?
(Piyush): The compatibility function for each clique can be though of as an (unnormalized) probability distribution over the set of nodes in that clique. A high value of the compatibility function, for a certain configuration of the nodes, would indicate that such a configuration is highly likely.

John

Given the factorization of a directed graph model, it seems obvious to me that it would need to be acyclic. How would a non-acyclic directed graph model work? Would it be anything like the undirected graph model (working with SCC's perhaps)? Or are non-acyclic directed graph models just rare enough that they don't need to be defined?
(hal) Cyclic (or "non-acyclic"!) graphs definitely exist and they will be a fairly major problem for the simple algorithms! I usually think of cycles as a sort of "feedback loop." However, in practice, it's fairly rare that we end up with cyclic (aka "loopy") directed models. Almost all examples of loopy models that we'll see are undirected.
(John) Ok, fine, I'll just call them "loopy." "Cyclic" implies exclusivity to me. :-p

Suresh

Why is the undirected graphical model formulated in terms of cliques instead of in terms of neighbors as the directed model is ?
(Piyush) In the parent-child relationship of directed graphical models, each term in the factorized distribution represents the probability of a node given all its parents. In undirected graphical models, however, we don't have such parent-child relationships. Thus we define it in terms of a set of cliques, with each clique representing a probability distribution of how likely a particular configuration of the set of nodes in that clique is. Also, during factorization, we use maximal cliques since smaller cliques must be subsets of the maximal cliques.

John

Factorizing distributions: this seems to be (especially in the directed case) a variation of the chain Bayes' rule. Is my characterization very far off?
(hal) It's basically chain rule + independence assumptions. Order the variables topologically and apply the chain rule. Now, remove from the conditioning side anything that is not a parent. Viola.
(sam) Where does the independence assumption come into play? Isn't the topology of the graph implying dependencies? In the directed case a node depends on it's parents, in the undirected case a node depends on it's cliques and the compatibility functions on them.
(John) If I understand correctly, you'd get the chain rule if you had a graph where every vertex had as parents each vertex that preceded it topologically, thus implying maximal dependency (because that's a maximal DAG). Removing edges from such a graph allows you to remove terms from the individual factors in the product. I have to admit that I'm a still a little confused about the clique relationships (though Hal indicated that more clarity on cliques in GM's was forthcoming).
Is p(\vec x) the probability of all the vertices' r.v.'s matching the values in \vec x? (i.e., does [x_s] = \vec x?)
(hal) \vec x is just the collection of all the rvs.
(John) Ok. I think that's what I was getting at. I was getting hung up on notation a little.


5 September

Samuel

Page 501 KFL01, clarification question of second column. The whole goal of this procedure is to "compile" one function for a particular marginalization. Which I assume is meant by not literally compute the numerical value. So that in the end a value into the compiled function can be plugged in and the result can be immediately computed without having to go through the whole factor graph. Is this right?
(Piyush): I think an example would help here: For our graphical model settings, we assume that we are in a discrete land (i.e. each r.v. x_i \in A_i, where Ai ={0,...,m-1} is a set of discrete values). For this discrete case, each marginal distribution gi(xi) would just be a probability mass function and can be represented as a vector X for size m. Since it's a pmf, sum of the entries in X would be 1. The sum-product algorithm, upon termination, produces one such vector for each r.v. Now, let's say the pmf vector for xi is Xi. Then the probability p(xi = c) is just Xi(c). So what you mean by a "compiled" function is actually the pmf vector and the individual probabilities for different values of the r.v. can be read off straightaway from this vector. There isn't any function evaluation going on (or needed). Note that if our marginal is a function of 2 random variables., we would have a 2D table (instead of a 1D vector) for the pmf. It's straightforward to generalize it to higher number of random variables.

Stephen P

1. Page 506-7 KFL01 (Example 6), Figure 12d. How can you find f(x1) or f(y1 | x1) if you can't observe x1 in the hidden Markov model?
(Piyush): Yes, you don't observe x and just observe y but we have models that define the quantities above. In HMM (or similar models), f(y|x) - probability of y given x - is usually called an observation model and p(xi | xi − 1) - probability of xi given xi − 1 - is usually called a transition model. Given these models and the observed data (y), the inference procedure estimates the hidden variables x (as p(x|y)) for us. To use these models in the inference procedure, you don't need to know x.
2. KFL01 (general question). The authors show that Bayesian Networks can be converted into factor graphs, which have better properties for generalized inference. Why not forget expressing models as directed acyclic graphs and just do everything as factor graphs?
(Piyush): This is a modeling issue. The reason why you would want to use a Bayes net (and not directly a factor graph) is that there could be causal relationships in your model that you might need to express. This can be done only using parent-child relationships in a Bayes net formalism. Well, yes, you could also try doing it by being "direct" as you suggest but actually it would be much more difficult and much less intuitive than using a Bayes net in this setting.
Personal tools