Talk:MLRG/fall08
From ResearchWiki
(Difference between revisions)
(→30 September) |
(→30 September) |
||
| Line 131: | Line 131: | ||
Again, these are probably obvious to most, but: | Again, these are probably obvious to most, but: | ||
: In Proposition 4, what does the ''ri'' before the big italicized M stand for? And what does it mean? | : In Proposition 4, what does the ''ri'' before the big italicized M stand for? And what does it mean? | ||
| + | :: (Piyush) ''ri''(<math>\mathcal{M}</math>) means the ''relative interior'' of <math>\mathcal{M}</math>. For all our purposes, we can just think of it as the interior of <math>\mathcal{M}</math>. | ||
: Near the top of page 27, <math>A*</math> is referred to as the ''negative entropy function'' and the ''dual function''. Are these the same thing? If not, when should one or other be used? | : Near the top of page 27, <math>A*</math> is referred to as the ''negative entropy function'' and the ''dual function''. Are these the same thing? If not, when should one or other be used? | ||
| + | :: (Piyush) They are the same. | ||
: What does it mean for a function to be ''Lipschitz'' (pardon my French)? | : What does it mean for a function to be ''Lipschitz'' (pardon my French)? | ||
| + | :: (Piyush) If a function is Lipschitz, it just means that it has a bounded first derivative. Mathematically, for a function f, you can write this condition as: <math>|f(x) - f(y)| \le C|x-y|</math> where C is a (finite) constant. We want this to be true for our log partition function A (so that the mean parameters, which are the first derivatives of A, be finite as well) so A is Lipschitz as well. | ||
Revision as of 20:46, 30 September 2008
Contents |
29 Aug
Hal
- This is just an example of a question?
- And another example?
- Hal's alter-ego: The answer is blah.
Seth
- What is the general difference between the MCMC methods and the variational methods presented in this paper?
- (Piyush): On a high level, both MCMC and Variational methods are approximate inference techniques where the goal is to approximate some target probability distribution. The difference lies in the kind of approximations they provide. MCMC gives a sampling based stochastic approximation (i.e. which can vary for different runs) whereas variational inference is an optimization based deterministic approximation (i.e. same across runs). The important thing to remember however is that, in the limit of infinite computational resources, MCMC can produce exact samples (i.e. if the chain is run long enough). So the approximation arises only because of the finite limit on resources. On the other hand, variational methods solve an optimization problem of minimizing the difference between the true distribution and a candidate distribution. Exact optimization is computationally intractable so we usually solve a relaxed problem (thereby yielding an approximate solution).
- On page 3 it mentions a compatibility function
when describing the factorization of a probability distribution in an undirected graphical model. What does this function represent?
- (Piyush): The compatibility function for each clique can be though of as an (unnormalized) probability distribution over the set of nodes in that clique. A high value of the compatibility function, for a certain configuration of the nodes, would indicate that such a configuration is highly likely.
John
- Given the factorization of a directed graph model, it seems obvious to me that it would need to be acyclic. How would a non-acyclic directed graph model work? Would it be anything like the undirected graph model (working with SCC's perhaps)? Or are non-acyclic directed graph models just rare enough that they don't need to be defined?
- (hal) Cyclic (or "non-acyclic"!) graphs definitely exist and they will be a fairly major problem for the simple algorithms! I usually think of cycles as a sort of "feedback loop." However, in practice, it's fairly rare that we end up with cyclic (aka "loopy") directed models. Almost all examples of loopy models that we'll see are undirected.
- (John) Ok, fine, I'll just call them "loopy." "Cyclic" implies exclusivity to me. :-p
- (hal) Cyclic (or "non-acyclic"!) graphs definitely exist and they will be a fairly major problem for the simple algorithms! I usually think of cycles as a sort of "feedback loop." However, in practice, it's fairly rare that we end up with cyclic (aka "loopy") directed models. Almost all examples of loopy models that we'll see are undirected.
Suresh
- Why is the undirected graphical model formulated in terms of cliques instead of in terms of neighbors as the directed model is ?
- (Piyush) In the parent-child relationship of directed graphical models, each term in the factorized distribution represents the probability of a node given all its parents. In undirected graphical models, however, we don't have such parent-child relationships. Thus we define it in terms of a set of cliques, with each clique representing a probability distribution of how likely a particular configuration of the set of nodes in that clique is. Also, during factorization, we use maximal cliques since smaller cliques must be subsets of the maximal cliques.
John
- Factorizing distributions: this seems to be (especially in the directed case) a variation of the chain Bayes' rule. Is my characterization very far off?
- (hal) It's basically chain rule + independence assumptions. Order the variables topologically and apply the chain rule. Now, remove from the conditioning side anything that is not a parent. Viola.
- (sam) Where does the independence assumption come into play? Isn't the topology of the graph implying dependencies? In the directed case a node depends on it's parents, in the undirected case a node depends on it's cliques and the compatibility functions on them.
- (John) If I understand correctly, you'd get the chain rule if you had a graph where every vertex had as parents each vertex that preceded it topologically, thus implying maximal dependency (because that's a maximal DAG). Removing edges from such a graph allows you to remove terms from the individual factors in the product. I have to admit that I'm a still a little confused about the clique relationships (though Hal indicated that more clarity on cliques in GM's was forthcoming).
- (sam) Where does the independence assumption come into play? Isn't the topology of the graph implying dependencies? In the directed case a node depends on it's parents, in the undirected case a node depends on it's cliques and the compatibility functions on them.
- (hal) It's basically chain rule + independence assumptions. Order the variables topologically and apply the chain rule. Now, remove from the conditioning side anything that is not a parent. Viola.
- Is
the probability of all the vertices' r.v.'s matching the values in
? (i.e., does
?)
- (hal)
is just the collection of all the rvs.
- (John) Ok. I think that's what I was getting at. I was getting hung up on notation a little.
- (hal)
5 September
Samuel
- Page 501 KFL01, clarification question of second column. The whole goal of this procedure is to "compile" one function for a particular marginalization. Which I assume is meant by not literally compute the numerical value. So that in the end a value into the compiled function can be plugged in and the result can be immediately computed without having to go through the whole factor graph. Is this right?
- (Piyush): I think an example would help here: For our graphical model settings, we assume that we are in a discrete land (i.e. each r.v.
, where Ai ={0,...,m-1} is a set of discrete values). For this discrete case, each marginal distribution gi(xi) would just be a probability mass function and can be represented as a vector X for size m. Since it's a pmf, sum of the entries in X would be 1. The sum-product algorithm, upon termination, produces one such vector for each r.v. Now, let's say the pmf vector for xi is Xi. Then the probability p(xi = c) is just Xi(c). So what you mean by a "compiled" function is actually the pmf vector and the individual probabilities for different values of the r.v. can be read off straightaway from this vector. There isn't any function evaluation going on (or needed). Note that if our marginal is a function of 2 random variables., we would have a 2D table (instead of a 1D vector) for the pmf. It's straightforward to generalize it to higher number of random variables.
- (Piyush): I think an example would help here: For our graphical model settings, we assume that we are in a discrete land (i.e. each r.v.
Stephen P
- 1. Page 506-7 KFL01 (Example 6), Figure 12d. How can you find f(x1) or f(y1 | x1) if you can't observe x1 in the hidden Markov model?
- (Piyush): Yes, you don't observe x and just observe y but we have models that define the quantities above. In HMM (or similar models), f(y|x) - probability of y given x - is usually called an observation model and p(xi | xi − 1) - probability of xi given xi − 1 - is usually called a transition model. Given these models and the observed data (y), the inference procedure estimates the hidden variables x (as p(x|y)) for us. To use these models in the inference procedure, you don't need to know x.
- 2. KFL01 (general question). The authors show that Bayesian Networks can be converted into factor graphs, which have better properties for generalized inference. Why not forget expressing models as directed acyclic graphs and just do everything as factor graphs?
- (Piyush): This is a modeling issue. The reason why you would want to use a Bayes net (and not directly a factor graph) is that there could be causal relationships in your model that you might need to express. This can be done only using parent-child relationships in a Bayes net formalism. Well, yes, you could also try doing it by being "direct" as you suggest but actually it would be much more difficult and much less intuitive than using a Bayes net in this setting.
Scott
- In figures 3 and 4, what's the meaning of the stuff in the dashed-line boxes? For example in Fig 3(a), why is there the summary for x2 when that doesn't appear in (3)?
- (Piyush): The dashed-line boxes mean trivial operations just drawn for the sake of completeness. For example, in 3(a), fB is a function of x2 only so it isn't really needed to sum over ("not") x2. Thus the summary operator is shown as dashed. Same reason for the product operator for node x2.
Anthony
- The graphs shown in this article were mostly cycle-free (acyclic) until you reach section V. If we have a cyclic graph like in fig. 20, can we solve the problem by breaking one of the vertices (to make it acyclic) instead of using methods describe in this section? Would there be a penalty to this type of message-passing algorithm? Which node should we choose? Arbitrary?
- (Piyush): I think it will result in incorrect estimates of marginals since by breaking one of the vertices, you would also remove many other edges to be used for passing messages. So, instead, we rely on certain other techniques for loopy factor graphs such as message passing schedule or clustering some of the nodes together, without sacrificing on the graph connectivity. We shall see more on this later during the semester.
Suresh
- in your expression for the message that a variable node passes to a function node, there's something I don't get. The message is defined as a product of messages from all neighbors of v to f (the term
). but this graph is bipartite ! so all neighbors of v are function nodes and would not be passing messages to f themselves.
- (Suresh) - ah, you have a typo in the summary: the messages are of the form
- (Piyush): Thanks. Fixed.
- (Suresh) - ah, you have a typo in the summary: the messages are of the form
Avishek
- we can apply sum-product algorithms to MRFs and Bayes nets only if we can represent them as factor graphs. Is it ALWAYS possible to convert MRFs and Bayes net to factor graphs:
- - is YES, then why not straightway model them as factor graphs ?
- - if NO, then how do we efficiently compute marginal probabilities in such situations ?
- (Piyush): Although there is an equivalence, the representational goals are different for MRF/Bayes-net and factor graphs. Something like an MRF or a Bayes net captures the relationships among random variables in the model. On the other hand, a factor graph captures the computational aspects of the model. Although the factor graph does contain the random variables (apart from the factor nodes), it may not be an appropriate choice to model the relationships among the random variables.
Nathan
- Are there any extensions to these message passing algorithms to graphs that contain cycles? Is there some amount 'allowable' amount of cycles? Would the girth of the graph help determine this?
- (Piyush): Yes, there are such algorithms (see, for example, section 5). However, for factor graphs with cycle, the inference will be approximate and not exact (as we get for tree-structured factor graphs).
Seth
- Are there any restrictions on the types of functions fj that can/should be used on the factor nodes of the graph?
- (Piyush): In general, this could be any function that can expresses how likely the configuration of random variables associated with the function is.
Jagadeesh
- Many times when we are not sure about the dependencies between the random variables, we will alter the dependencies and learn the parameters again and check which one performs better. For example, in the HMM case, the common dependency is
instead of this we can also try
. In both the cases the dependency graph changed a little (some extra edges were added) and hence most of the messages (like
and
) don't change. Of course the marginal probabilities p(yi)'s will change. But my question is, in such a cases instead of learning the parameters separately for the two models, can we learn the parameters for the most general model and from them infer the parameters of a more restricted model?
- (Piyush): I doubt if it would be possible in general. Also note that the factor graph for the HMM model you suggest (i.e. an observation model p(yi | si,si − 1)) would be loopy so it would not be possible to even do the inference exactly, in the first place!
- (Amit) : Piyush isn't it still acyclic?
- (Piyush): No it does have loops (although the factor graph isn't a directed graph).
Senthil
- How does Belief Propagation work in graphs with parameter tying (Hidden Markov Models and Dynamic Bayesian Networks)? More specifically, can this algorithm be used in training such networks (parameter learning) rather than inference? Sorry if this is covered later in the course or outside the syllabus of this course.
- (Piyush): I think it should be possible since in a Bayesian setting, even the parameters are treated as random variables. So learning the parameter should be the same as inferring them.
- (Samuel): Is it that easy? Don't we need to know at least the factorization of the distribution in order to do inference (and hence already know the marginalizations and it's just a matter of computing them efficently?).
Ruihong
- Since given one original graph, we can construct more than one factor graphs (many of which contain some unnecessary factor nodes), then given a factor graph, how can we get the smallest factor graph (which only has the necessary factor nodes) to the same original graph? I think this is a problem for when the scale comes down, we can reduce the times of message passing accordingly.
19 September
Anthony
- With reference to Fig. 7c on pg 14 (WJ, 2.5.2), any idea why the separator sets (rectangles [2 6] and [2 4]) are not on the same side as (rectangles [4 8] and [6 8])? At first glance, I would think that (rectangles [2 6] and [2 4]) should be the opposite, no?
- (Nathan) Yeah, I noticed that too, my bet is that it is a typo.
- What does it mean by 'lack a chord'?
- (Anthony) Found the definition on wikipedia - "..a graph is chordal if each of its cycles of four or more nodes has a chord, which is an edge joining two nodes that are not adjacent in the cycle." (http://en.wikipedia.org/wiki/Chordal_graph)
Stephen
- On page 12 of Building Join Trees with Belief Networks, it uses a weight to determine the order in which nodes should be removed. It states the weight of a given node is the number of values of V. I'm not sure I understand what this means. Is this the # of discrete states this variable can hold? Or something else?
- How does the concept of d-separation relate to the concept of a graph having the Markov property? Or in other words, would the Markov property still hold if a graph contained nodes that are d-connected?
- (Nathan) I don't think so. If a graphical model had nodes that were d-connected, then it would not have the markov property because you could find a state that is conditionally dependent on a previous state.
26 September
Sam The first two paragraph's have a couple of things I can't make sense of.
- What is the measure v defined via
?
- (Anthony) I'm guessing the v here refers to volume. Lebesgue measure (dx) is a standard way to measure length, area or volume.
- (Piyush) On a high level, you can think of a measure as a function that operates on a set. Probability is an example of a measure where the set is essentially the set of events you care about.
- What is a dominating measure?
- (Piyush) If μ is a dominating measure to another measure, say P, then it means that μ has a mass everywhere P has a mass.
- Why are the functions φαcalled potentials or sufficient statistics?
- (Piyush) It's called sufficient statistics since the probability distribution p(x,θ) depends on data x only through this function. We also call it potential since (in a way) this function also expresses how likely a particular configuration of data is.
Stephen I'm sure these are really obvious to most of you, but...
- What's the difference between p(x;y) and p(x|y)?
- (Piyush) p(x;y) usually means a probability distribution over x parametrized by y. p(x|y) is the probability distribution of x "given y" (in other words after you observe y).
- What is an additive decomposition (see page 20)?
- (Piyush) You can write a probability distribution P as a *product* of a set of local functions. Alternatively you can also write P in an exponential form and, in the exponent, the same local functions would appear as being summed/added over. Hence the name additive decomposition.
30 September
Piyush
- In section 4.2.1 they say that the entropy H(p(x;θ(μ))) is, in general, not possible to compute easily for large problems. What makes a problem "large" in this context? Does it mean there might be very many *multivariate* random variables such that the integration involved in computing H would be intractable?
Stephen Again, these are probably obvious to most, but:
- In Proposition 4, what does the ri before the big italicized M stand for? And what does it mean?
- (Piyush) ri(
) means the relative interior of
. For all our purposes, we can just think of it as the interior of
.
- (Piyush) ri(
- Near the top of page 27, A * is referred to as the negative entropy function and the dual function. Are these the same thing? If not, when should one or other be used?
- (Piyush) They are the same.
- What does it mean for a function to be Lipschitz (pardon my French)?
- (Piyush) If a function is Lipschitz, it just means that it has a bounded first derivative. Mathematically, for a function f, you can write this condition as:
where C is a (finite) constant. We want this to be true for our log partition function A (so that the mean parameters, which are the first derivatives of A, be finite as well) so A is Lipschitz as well.
- (Piyush) If a function is Lipschitz, it just means that it has a bounded first derivative. Mathematically, for a function f, you can write this condition as: