Talk:MLRG/summer07
From ResearchWiki
Laplacian Eigenmaps
- (Hal) Many of the algorithms we've talked about essentially use a neighborhood graph. Do we have any idea how sensitive the algorithms are to the choice of neighborhood? Why do we use Euclidean distance for constructing these graphs... what happens if we use something else?
- (Suresh) In theory, it shouldn't matter, because the definition of a manifold is that it is locally Euclidean. also, using l2 means that various optimizations are easier (read: not NP-hard)
- (Sam) I might mix something up here. For building the adjacency graph euclidean distances are used but for the weights in LE (i assume that should work for LLE too) often heat kernels are used. The Diffusion Map paper talks about how to choose kernels that approximate different operators on the underlying manifold.
- (Hal) One nice thing about PCA is that it has a probabilistic interpretation; do we see any light for finding a probabilistic interpretation of ISOMAP/LLE/LE/etc.?
- (Piyush) Recently, there has been some work on latent variable models in context of nonlinear dimensionality reduction. Instead of defining "definite" neighborhoods (in high and low dimensional spaces), latent processes are used to define pairwise relations between points in high as well as low dimensional spaces and embeddings can be achieved by minimizing some sort of divergence function (KL divergence for example) between the probability distributions of the two spaces. The stochastic neighbor embedding (SNE) is another example where the low dimensional embeddings are achieved using such a probabilistic setting.
- (Hal) The eigs are computed on L, not D, right?
- Right. I fixed the typo.
- (Suresh) What is the relationship between the graph Laplacian and the Laplace-Beltrami operator ?
- The connection works roughly like this. The Laplace-Beltrami operator is the generalization to arbitrary manifolds of the "normal" Laplacian operator in Rn, which is merely the sum of the second derivatives
. Now, how do we define the second derivative ? it's the rate of change of the first derivative, which can be rewritten as
. Notice that if we think of
as "neighbours" of x, and pretend that ε = 1, then we are essentially averaging the difference in weights between x and its neighbours. This is essentially what the graph Laplacian is doing.
- (Sam) This is esentially the finite difference approach. Imagine a 1-D function
from which we only have discrete samples in intervalls h. We can approximate the first derivative at f(x)by central differences
. We can do the same for computing the derivative at
and
which gives us approximations to
and
. Now doing the central differences on f'(x) gives
, plugin in the apprximations on the derivaties gives
, which is what we have above from suresh for the second derivative. Note that we don't need to compute the first derivative for computing the second derivative and therfore don't need samples at
. An intuition on what the Laplacian does (summing up all the unmixed partial derivatives) is given by looking at the Laplacian as the divergence of the gradient. An intuive description of divergence can be found here.
- The connection works roughly like this. The Laplace-Beltrami operator is the generalization to arbitrary manifolds of the "normal" Laplacian operator in Rn, which is merely the sum of the second derivatives
- (Suresh) Why do we have this weird scaling parameter ytDy = 1
- (Suresh) This actually makes sense if you set up the Laplacian the right way. The book by Fan Chung Graham on Spectral Graph Theory has more details on this: here is the revised version of Chapter 1, that contains the relevant details.
- (Sam) What do you mean by setting it up right? Do you mean the normalized graph Laplacian? How does it make sense?
- I believe that the normalized graph Laplacian is the same as the one defined by Fan Chung in the linked article, yes. --Suresh 17:25, 16 May 2007 (MDT)
- (Sam) I think in the Diffusion Maps paper they show that the normalized graph Laplacian approximates the Fokker-Planck operator and is better suited for clustering. The "unnormalized" graph Laplacian seemed to yield better results in my experience. I still don't understand how the scaling factor makes sense. The (or one version of the) normalized graph Laplacian is D − 1L instead of solving the eigenproblem Ly = λy we solve for the generalized eigenproblem Ly = λDy. Is that why the scaling factor makes sense? Because we want to fix the length of Dy in the normalized case?
- (Suresh) The relevant discussion is on page 16-17 of the linked chapter. Basically, the reason to use the normalized Laplacian is that it symmetrizes the Laplacian in the following sense: you can think of the entries as transition probabilities, and this normalization ensures that the probability of entering an edge is the same whether from one side or the other. Notice that otherwise, if we merely used the degrees (or weighted degrees), an edge might be traversed with different probabilities from either side. Symmetrizing makes the analysis of conductance a lot easier. But I am willing to agree that there isn't a fundamental difference in the theory if you normalize or don't, as long as you keep track of what's going on. The difference in practice is interesting though.
- (Sam) My explanation for the difference in practice is: If we have noisy data that might have vertices (outliers) that connect (shortcut) different non local parts on the manifold (e.g. in the swissroll accross the gap), they will probably have smaller weights (if we use gaussian weights with an appropriate sigma) to all the vertices it connects compared to a "regular" vertex. If we normalize by the degree (which is small too) we put more weights on the edges that shortcut (increase the effect of the shortcut).
Embeddings
- (Sam) Solution attempt for
embedding: Let
be our n points and denote with
the k coordinate of x, d(xi,xj) is the original metric. Let's set
to 0 and
to d(x1,xi). So the first coordinates contain the distances from x1 to all other points. Now we need to have the distances for x2 to
. By settings
and
we achieve this, furthermore we guarantee by triangle equality that we don't change any of the distances for x1 since we set
. We have that
, which tells us that
and
. Therefore we can construct a zero distortion
embedding in n − 1 dimensions by setting
.
- Yup. exactly right.
Probabilistic Embeddings
- There is a paper by Thomas Hofmann on Probabilistic LSI: Is this also relevant ?
- (Piyush) Yes, this seems relevant. Both, PCA and Latent Semantic Indexing are basically dimensionality reduction techniques, based on SVD. The eigenvectors calculated in LSI are essentially document vector coordinates, while in PCA these are principal components. There are a few subtle (and a few not so subtle :)) differences, however. PCA works by decomposition of a covariance matrix (always a square matrix) of the data whereas LSI seeks decomposition of a "co-occurance matrix" (term-document pairs; this invariably isn't a square matrix). The probabilistic approach to LSI described in the paper is also based on a latent variable formulation but it appears to be different from probabilistic PCA. This approach seems to define the joint (term-document) probability in terms of a mixture of components of a latent class variable. In probabilistic PCA, however, each observation is assumed to be generated by some transformation of a latent variable of low dimensionality.
- (Sam) What do we gain from GP-LVM over Kernel PCA?
- (Piyush) Kernel PCA, by itself, is non-probabilistic so it lacks the benefits offered by methods like PPCA or GPLVM, such as dealing with missing data. Furthermore, GPLVM lets us optimize kernel parameters which is not the case with Kernel PCA where the parameters have some fixed value. Note, however, that Kernel PCA does have a probabilistic version which, in fact, has been formulated on the basis of GPLVM. Here is a recent paper.
- (Sam) I am confused by the usage of the terms "optimizing the parameters" in PPCA and "optimizing with respect to latent variables" in DPPCA. Are not in both cases latent variables and parameters optimized but by different means?
- (Piyush) I guess this confusion arose because GPLVM does optimize the kernel parameters, apart from latent variables (for nonlinear kernels). Otherwise, parameter was meant to be just the matrix W which transforms X to Y ( yn = Wxn + en ). Latent variables (X) and parameters (W), both at a time, are not optimized in either of the two approaches. PPCA first obtains the marginalized likelihood of the data (Y) by marginalizing over latent variables (X), and then optimizing with respect to the parameter (W), using maximum likelihood. DPPCA, on the other hand, does the opposite. It obtains the marginalized likelihood of the data by marginalizing over the mapping parameter (which is valid since parameter W in this case is treated as a random variable), and then optimizing with respect to the latent variables, using the maximum likelihood.
- (Sam) I am confused by the usage of the terms "optimizing the parameters" in PPCA and "optimizing with respect to latent variables" in DPPCA. Are not in both cases latent variables and parameters optimized but by different means?
- (Sam) I guess I have some misconceptions here. Just to clarify (please correct me if I am wrong). In PPCA we are trying to find the parameters (W) that optimizes the likelihood of Y being generated from the latent variables X with a given prior (Gaussian in this case). So here is one thing i don't understand: How are we "placing" the xi? And in DPPCA are we trying to find "locations" of xi that maximizes the likelihood of the Y being generetaed from X? But this depends on the mapping parameters no?
- (suresh) The PPCA paper explains this: you have to remember that this is a probabilistic embedding, and so there are no x to place as such: all we can place is the "average" x, which can be computed from the inverse mapping using an ML argument (see Section 3.4). You can also write the conditional probability of x given y as a normal distribution centered around the above mean.
- (Piyush) Blame it to the underlying obnoxious maths! :) Anyway, here is how it works out: In case of PPCA, the conditional distribution of latent variables xi given observed variables yi turns out to be another gaussian (this can be calculated using bayes rule). This distribution looks like this:
p(x | y) = N(x | M − 1WTy,σ − 2M) (y in the expression is assumed as already centered) where M = WTW + σ − 2I W and σ − 2 can be estimated using maximum likelihood and this gives us x (in fact it will be given by the mean of above distribution, as Suresh pointed out above) More details in the original Tipping-Bishop paper on PPCA.
- (Piyush) And for the DPPCA, the marginalized likelihood turns out to be independent of W (it gets integrated out) and setting the gradient of the log likelihood w.r.t. X as zero and simplifying yields a solution to X given by: X = ULVT where U is an Nxd matrix whose columns are the first d eigenvectors of YYT, L is a dxd diagonal matrix and V is another arbitrary dxd rotation matrix. (Exact details of these matrices is given in the GPLVM papers listed under suggested readings)
- (suresh) There is still one thing I don't understand though: the input modelling gives us p(y | x), and p(x) is defined for us. What I don't understand is how we get p(x | y), since for this, we need p(y), and I don't know how to construct p(y).
- (Piyush) p(y) is the marginalized probability we got after integrating x out. It's in terms of W and although it's written as p(y|W), it indeed is just p(y), since W is just a parameter which p(y) depends on (the random x has already been integrated out).
- (suresh) There is still one thing I don't understand though: the input modelling gives us p(y | x), and p(x) is defined for us. What I don't understand is how we get p(x | y), since for this, we need p(y), and I don't know how to construct p(y).
- (Piyush) And for the DPPCA, the marginalized likelihood turns out to be independent of W (it gets integrated out) and setting the gradient of the log likelihood w.r.t. X as zero and simplifying yields a solution to X given by: X = ULVT where U is an Nxd matrix whose columns are the first d eigenvectors of YYT, L is a dxd diagonal matrix and V is another arbitrary dxd rotation matrix. (Exact details of these matrices is given in the GPLVM papers listed under suggested readings)
- (Sam) Thanks for all the answers. It's somewhat clearer now. I am not used to think in the probabilistic setting. Some more questions (yes it takes a while until it will get into my thick skull). One nice thing of PPCA is now that we have a probability distribution over X given Y. Can we write such a distribution in the case of DPPCA? In both cases PPCA and DPPCA d is fixed (users choice)?
- (Piyush) Writing a similar distribution wouldn't make sense since X is kind of deterministic for the DPPCA case. The number of latent dimensions d is given beforehand for PPCA and DPPCA.
- In class we talked about X being deterministic (in DPPCA) since we have to take the gradients with respect to X, but when we marginalize over W we have a distribution over Y given X. Does this make any sense if we assume X to be deterministic?
- (Piyush) It's kind of misnomer since what we have is a marginalized distribution of Y with other random variables (W for DPPCA case) integrated out. The marginal distribution of Y is in terms of the "deterministic" X, which we take gradient of w.r.t. X later.
- Is therefore DPPCA kind of a misnomer too? I would assume we call it PPCA beacuse we end up with distribution over X whereas in DPPCA there is no such thing and we therefore don't have a probabilistic PCA in the sense of PPCA.
- (Piyush) I would still like to assume that the term "dual" for DPPCA is coined because of the how the two "approaches" differ in terms of things being integrated out and optimized over. DPPCA treats the mapping parameter as a random variable so it is still based on probabilistic assumption.
- Is therefore DPPCA kind of a misnomer too? I would assume we call it PPCA beacuse we end up with distribution over X whereas in DPPCA there is no such thing and we therefore don't have a probabilistic PCA in the sense of PPCA.
- They mention that Isomap has to be used as an initialization for the GP-LVM in order to avoid local optima. Doesn't that defeat it's purpose? How well would Isomap alone perform in contrast to the GP-LVM in the NN-classification examples?
- (Piyush) I think, reconstruction-accuracy-wise, GPLVM should be more or less similar to Isomap (hopefully not worse, if not better), though I haven't seen results that compare the two on such basis. I think it would be interesting to run both methods on the swiss-roll data and observe the reconstruction errors. However, GPLVM also provides the additional advantages due to its probabilistic formulation, which Isomap doesn't. For example, Isomap can't deal with the problem of missing data. Also, Isomap doesn't produce an explicit mapping (for example, mapping a new point from similar distribution - Y to X, or generative mapping of type X to Y). GPLVM, at least, does gives us a generative mapping (X to Y) which lets us do useful things such as generating fantasy data from any point in latent space. So I would say GPLVM is richer in terms of things that can be achieved with it.
- (Sam) Since it is marginalized over the parameters DPPCA, we need a prior on the parameters. In the paper I only saw that mentioned briefly and they used a Gaussian. Is there a justification for using a Gaussian prior on the parameters?
- (Piyush) Gaussian prior is chosen for W since the likelihood too is a gaussian and the product of the two would again be a gaussian. This makes the evaluation of marginalized likelihood easier (and also the fact that now we can see the marginalized likelihood in DPPCA case as product of D independent gaussian processes, where D is the data dimension).
- So the choice of priors (in PPCA and DPPCA) is due to need for conjugate priors to solve the problem. Right? What if we would have knowledge about the type of distribution of X that generated the Y. Is there a possibility in PPCA to incorporate this knowledge by choosing a different prior?
- (Piyush) Typically conjugate priors arise from the need of having posterior of some distribution to have the same form as prior. The case here is slightly different, though choosing conjugate prior still helped. In this case, we just want the marginal probability of Y to have a familiar form (gaussian). By choosing the priors as gaussian, the marginal probability turned out to be again gaussian (since, for DPPCA for example, it's the product of p(Y|W) and p(W) integrated over W; this product turns out to be a gaussian which makes our job easier). Similar is the case with PPCA.
- So the choice of priors (in PPCA and DPPCA) is due to need for conjugate priors to solve the problem. Right? What if we would have knowledge about the type of distribution of X that generated the Y. Is there a possibility in PPCA to incorporate this knowledge by choosing a different prior?
- (Suresh) To spin off a new thread from Piyush's last reply: What priors are "allowed" for the whole scheme to work out. Will any conjugate pair of distributions work ? Piyush's last note appears to suggest that conjugacy is not critical for PPCA to work. Concretely, under what prior assumptions on x can we recover a formulation for W which we can base the method on ?
- (Piyush) I think it's not about conjugacy as such. Since PPCA adopts a linear-Gaussian framework to begin with, the data Y (and so is the latent space X which generates it) are assumed to be Gaussian. PCA is typically (always?) known to work well only on Gaussian data. I think this could be the reason why they assume everything to be Gaussian in coming up with the probabilistic formulation.
- (Piyush) This paper on multinomial PCA (a discrete version of PCA) is interesting. It's related to the problem of document components. It turns out that the hidden variable x and the observation, both, follow distributions that are indeed conjugate to each other (Dirichlet and multinomial respectively). In fact, for PPCA too, we have gaussian and gaussian which are conjugates.
- (hal) My $0.02. As Piyush mentions, there's multinomial PCA. basically for any likelihood function F, if you have a conjugate prior G, you can make a PCA out of it and call it F-PCA. you can do this even if G is not conjugate; it just won't be efficient. multinomial pca actually has another name: LDA (as in latent dirichlet allocation, not linear discriminant analysis :P). you can actually be more clever if you want. if you let F be a likelihood, and let G0 be conjugate to F, then F/DP(alpha G0) can be made into pca as well and still be reasonably efficient (DP = dirichlet process). in the case of gaussians, what happens is that if you integrate out W, you get an infinite mixture of gaussians with constrained convariance (due to the low rank)... (okay, this is a conjecture, but i'm 99.9% sure it's true). you'd have to do something like sampling or variational EM for inference though. it's probably also worth making clear that in the nonlinear version of GPPCA, the prior on W isn't a gaussian...in fact, W isn't there at all. rather we replace the linear mapping Y = WX with a non-linear mapping Y = F(X), where now F (as a function) is a random variable. we place a gaussian process prior on F (think of a gaussian process as an infinite-dimensional gaussian distribution and you won't be led too far astray). this is where the kernel trick comes from.
- (Piyush) I think it's not about conjugacy as such. Since PPCA adopts a linear-Gaussian framework to begin with, the data Y (and so is the latent space X which generates it) are assumed to be Gaussian. PCA is typically (always?) known to work well only on Gaussian data. I think this could be the reason why they assume everything to be Gaussian in coming up with the probabilistic formulation.