Talk:Algorithms Seminar/Spring09

From ResearchWiki

Jump to: navigation, search

Contents

23 Jan, k-means

Nathan

What does it mean for an algorithm to be O(X)-competitive?
An approximation algorithm A which seeks a solution that minimizes some objective φ is O(X)-competitive if \phi_A \le O(X) \cdot \phi_{OPT}, where φA is the value of the objective function obtained by A and φOPT is the value of the objective function obtained by the exact algorithm.
What are some good methods to help make Kmeans robust regarding outliers?
As Suresh discussed last week, the "mean" measure is inherently sensitive to outliers. Likewise, any algorithm that does a good job of minimizing the sum of squared distances, will also be sensitive to outliers. In other words, it seems to me that any attempt to make the k-means algorithm less sensitive to outliers is bound to also make the algorithm a worse approximation to the optimal k-means solution. So, if you don't want to be sensitive to outliers, I would think to look into k-medians instead of k-means.

Piyush

My feeling is that the k-means algorithm might fare poorly if there are substantial differences between the original cluster sizes (or in general between the cluster distributions). Say, for example, the data has one very large and two reasonably small clusters (and assume they all are spherical-shaped clusters). The way k-means partitioning works (i.e. minimizing sum of square distances from the cluster centers) might make the algorithm "split" the large cluster and assign some of its points to the smaller clusters. If this does happen, this would clearly not be the correct clustering. Are there ways to circumvent such a problem in k-means?
You have certainly brought up an important limitation of k-means. Again, I think that the problem is inherent in the choice of objective function rather than in the k-means algorithm's ability to approximate the "optimal" k-means solution. On the other hand, if I'm not mistaken, training a mixture model of Gaussians using EM with the restriction that all Gaussians are forced to have the same covariance matrix, turns out to be very similar to the k-means algorithm. EM, however, needn't impose the single covariance matrix constraint so if you have reason to suppose that your clusters vary in size and shape you might consider using EM and allowing each cluster to have a different covariance matrix (if you can still assume some flavor of Gaussian for each).

John Meier

What is a hypersphere/hyperplane?
A hypersphere is just a generalization of a sphere in d dimensions (i.e. a hyper sphere in 2 dimensions is a circle, in 3 dimensions it is a sphere).
A hyperplane is likewise a generalization of a plane in d dimensions (i.e. in one dimension it is a line, in 2 dimensions it is a plane).
How does superpolynomial time differ from non-deterministic polynomial time?
As was pointed out during the seminar, the class NP applies to problems for which there exists an algorithm (whether or not we know what the algorithm is) that can verify a solution to the problem in polynomial time. Note that stating that a problem is in NP is a declaration of what can be done.
Stating that an algorithm takes super-polynomial time is equivalent to saying that there exists an input to the algorithm for which the algorithm cannot produce the solution in polynomial time (with respect to the input size). Note that this is a statement of what the algorithm cannot do.
It seems like k-means++ offers several improvements over traditional k-means without requiring any significant tradeoffs. Has k-means++ become the norm in applications that been using the original algorithm?
As we discussed in class, the paper is still relatively new. Also, in the paper, it was only tested on a small number of data sets. Maybe my Wikipedia article on k-means++ will help to improve its popularity. :-)
Are there any methods for adaptively changing k, the number of clusters, during a run of the algorithm? (e.g., increment k and try again if the maximum point-to-centroid distance in any cluster is greater than a threshold)
There certainly are other clustering algorithms that do not require that k be specified in advanced (an algorithm named "Cobweb" comes to mind).
In class, you also asked a related question: What do you do if a cluster has no points in it?
I could imagine adding into the algorithm a check at each iteration so that if a cluster ever loses all of its data points, you can randomly assign the empty cluster to be some data point from the input (possibly using the same initialization method that was used originally---maybe k-means++). This might do strange things though if you really did pick a k that was too big.
Another possibility, if you are sure that you have the right k would be to leave the algorithm as is (assuming there is some randomness in the initialization or in the optimization process) and just re-run the algorithm until you get k legitimate clusters.
If you aren't sure about k, you could decrement k and try again at the lower k. These are just some thoughts.

Links to find authors' slides

http://theory.stanford.edu/~sergei/slides/
http://www.stanford.edu/~darthur/

Wikipedia article

Since this is my first article, Wikipedia recommended that I start it as a sub page to my user page and then move it over to the real Wikipedia when I'm satisfied with it. Here is what I have so far:
k-means++
(John Meier) The Wikipedia article looks like a good start. The only significant thing I see missing is the acutal description of how k-means++ specifies the initial cluster centers (and possibly some of the intuition behind the proof).
(Suresh) you need a lot more work on the page. Firstly, there's no need to waste time discussing k-means: you should link to the relevant wiki page. There's no detail on the approach itself. Also read Wikipedia:YFA for style guidelines. Look at related pages (like Lloyd's algorithm) for a template on how to structure the page. And above all, it's time to promote the page to an actual wiki page, which will make it visible to others to make comments.

30 Jan, Hierarchical methods

John Meier

I'm a bit unclear on a few points related to divisive clustering as described in the IR book chapter:
1) How exactly is a "flat" clustering algorithm (e.g. k-means) applied at each step? What k would be used at each level of the hierarchy? Also, it seems like running k-means to completion at the top level (all points in one cluster, or in other words unclustered) would give you a finished clustering, so what would be accomplished by running it recursively?
At each step, we invoke the flat clustering algorithm with a fixed value of k (say, 2).
At each level we can have a k=2 and continue recursively until we end up with singleton clusters.
Running k-means at the top level will given two clusters (for k=2). Thus we accomplish a division of the top level cluster into two smaller clusters.
2) The book chapter states that top-down clustering is linear in the number of points and clusters, but it seems like the run time should be dependent on the complexity of the flat algorithm applied at each step, correct?
The flat algorithm has a complexity linear in the number of points and clusters. The divisive algorithm requires multiple calls to the flat clustering and for a fixed number of top-levels, the number of calls to the flat clustering is also fixed. Hence, for a fixed number of top-levels, the overall complexity of the divisive clustering is linear in the number of points and clusters.
Are there any other useful heuristics for bottom-up HAC besides the four mentioned (single-link, complete-link, group-average, centroid)? Would using a blend of some of these heuristics produce interesting behavior (for example, choose the next merge based on a weighted blend of the single-link/complete-link criteria)?
Although not in exact terms, but the group-average criteria might be considered somewhat similar to a weighted blend of single and complete linkages.

John Moeller

The Cluster Analysis Wikipedia article talks about using "The sum of all intra-cluster variance" as a distance criterion. How would calculating information local to one cluster be helpful? Do they mean "the sum of all inter-cluster variance"? I'm not sure what that would entail though.
We might aim to minimize "The sum of all intra-cluster variance". At any point, all the K clusters have an intra-cluster variance. The next merge step will select two clusters such that the overall sum of all intra-cluster variance of the merged and the remaining K-2 clusters is minimized.

Parasaran Raman

Since the biggest problem of Hierarchical Clustering is that an incorrect assignment made early in the process cannot be corrected, are there any specific improvisations that we could make to the technique (maybe do a multi-pass) to overcome this handicap? Moreover even though the Hierarchical Clustering techniques gives us nice variety of potential clusterings, how difficult is it to interpret the complex hierarchy which more often is confusing rather than meaningful?

Adam

It looks like the main problem with centroid clustering is the non-monotonically decreasing similarity between merged clusters in subsequent mergings [I'm sure there are other problems with it, but the book seemed to focus on that problem]. What, then, if we slightly tweak the similarity function so that it penalizes cluster similarity for the size of the clusters being considered for a merge? For example, one naive approach (although I don't guarantee that it is a bad one :-), would be to define the similarity between two clusters as the negative distance between their centroids plus the negative radius of each (and by radius, I mean the distance from the farthest point in the cluster to the centroid, although other penalties such as the sum of all distances from the points in the cluster to the centroid, etc.). At the very least, I think that that would fix the inversion example given in the book. Adding such a penalty on merging large clusters would seem to put additional pressure to keep clusters similar sizes, which could be good or bad, but perhaps adding the negative square root of their radii could still solve the inversion problem without such heavy consequences. Anyway, do you see any another problems or any possible advantages of such an approach or have you already seen it in the literature?
Although I agree that such a penalty scheme might help in keeping the cluster sizes similar, I am not convinced that the scheme will always work. It does work for the example in the book. But, suppose we slightly tweak the book example such that the three points now have co-ordinates d1(1+\eps,1), d2(11,1) and d3(6,1+2sqrt(3)). In that case, d1 and d2 have a similarity of -(10-\eps) and all three of them have a similarity of -8.46, which again shows an inversion. So, one potential problem of this approach is that it's sensitive to the distance between d1 and d2. It works fine for small distances between d1 and d2 and fails when the distance of separation between d1 and d2 is large. However, I haven't seen any such approach in literature.

Seth

Has there been work on an HAC algorithm that uses a combination of the methods described?

6 Feb, Correlation clustering

John Meier

In the correlation clustering paper, the authors prove that although the optimal result for either minimizing disagreements or maximizing agreements is the same, the approximation bounds are different (constant factor for minimizing, PTAS for maximizing). Is there any intuitive reason for why this is the case?
(Jags) Though the value of the objective function is same irrespective of minimization or maximization, these two functions behave differently with respect to how well they can be approximated.
(Suresh) think of it this way. if agreements are many, disagreements are few, so the two values sum to a constant. That means when one is large, the other is small. Intuitively, approximating a large value is easier than approximating a small value

Adam

Given an arbitrary data set (a graph, in this case, I suppose), there is some non-empty set of optimal clusterings of the data with respect to the "minimizing inner- connections and inter+ connections" objective function. The paper proves that its algorithm can find a clustering that is a 3-approximation with respect to that objective function, right? Well, I want to ask questions about the size (in terms of number of clusters) of the optimal clusterings as well as the clusterings that would be found by the approximate algorithm:
  • Can we say anything about the number of clusters in each optimal clusterings? (Are they all necessarily of the same size? Can they differ arbitrarily in size?)
(Jags) Just from the number of points we may not be able to say anything about the number of clusters or the size of each cluster. We can easily construct both extreme examples from a set of 'n' points, 1) declare all the points similar to each other results in one big cluster with all nodes on the other hand 2) declare all the nodes dissimilar which gives rise to 'n' singleton clusters. But from the graph structure we may be able to say something (by solving the clustering problem :( )
  • Does it make sense to analyze the algorithm in terms of its approximation relative to the number of clusters it finds as compared to the number of clusters in some optimal solution?
(Jags) The PTAS scheme presented in the paper does start with same number of clusters as there are in the optimal. The whole analysis is based on the fact that each of the sub clusters that we identify is approximately similar to the corresponding optimal cluster.

Piyush

Can we view correlation clustering as a special case of spectral clustering which works by looking at the spectrum of the pairwise similarity matrix of the data? In other words, if we just look at the spectrum of the +1/-1 valued "similarity matrix" in the case of correlation clustering, can we hope to discover something useful?
(Jags) Though from the math perspective it (the problem) may seem a special case, but both of the approaches are actually operating in a completely different scenario. In Correlation Clustering, the similar and dissimilar value is an indicator of rivalry between objects and the aim is to find sort of communities with different opinions. Where as in the Spectral clustering all the objects are kind of similar but they vary in the degree of similarity.
I was comparing it to Agglomerative clustering when I previously posted my comment. I was wrong. Yes it is a special case.

Thanh Nguyen

What are the positive and negative points of correlation clustering and when should we use it?
How correlation clustering handle outliers?
(Jags) Outliers are a kind of noise in the labellings. If the noise is independent and has a probability < 1/2, then the algorithm w.h.p correctly clusters all large clusters and introduces an additional mistakes of O(n^\frac{3}{2}).
How does correlation clustering compare to other graph partitioning methods (like NCut which also try to minimizing the disassociation between the groups and maximizing the association within groups)?
You can ignore this question because it is similar to Piyush's.

Parasaran Raman

The correlation paper discusses weighted agreements/disagreements and hints on the failure of the approximate algorithms. What exactly is the problem with the weighted case? Isnt the simple case of correlation clustering similar to assigning all weights to 1?
(Jags) In the weighted case, the objective function will be sum of the similarity weights with in a cluster plus the sum of the weights of negative edges between the clusters. This value is not necessarily same as the number of agreements. We can indeed use the same approximation algorithm and achieve On2) new mistakes but this does not give the same approximation bounds because the optimal value is not the same.
A perfect clustering happens when there is total similarity and no dissimilarity; So the (closest possible) perfect clustering should automatically decide the number of clusters. Can we arrive at different clusterings (different cluster numbers) and still have the closest possible perfect clustering?
(Jags) A perfect clustering is possible even if you have dissimilarity. As long as there are no "erroneous triangles" perfect clustering is possible.
I guess your next question is about the uniqueness. Given the graph the optimal clustering (defined according to the number of agreements objective function) doesn't need to be unique. Consider an erroneous traingle (abc) with edges (a,b), (a,c) are similar and (b,c) is dissimilar. With this setting all the clusterings ({a,b,c}), ({a,b},{c}), ({a,c},{b}) are optimal with number of agreements being 2. So it is possible to arrive at different clusterings.

Raj Varma

Does correlation function has any role in deciding what the 'OPT' is ? If so, how good is the actual 'OPT' that we compare against approximate algorithms ?
(Jags) Yes the correlation function determines the structure of the graph and hence the optimal clustering as well.

Jiarong

Does the cost function in weighted edge case perform as normalization to [0,1]? What's the intuition of that?

Pravin

As the problem here is an optimization function (maximizing agreements or minimizing disagreements), do you think a robust search optimization technique like PSO can be used to improve the performance of cluster divisions?

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4631230

In the above paper, CC has been implemented using GA.

13 Feb, Spectral Clustering

John Meier

The formulation of the similarity matrix and the different types of graph Laplacians makes sense to me, but I don't quite see why the algorithms call for running k-means on the "rows" of a matrix consisting of the first k eigenvectors of L as columns. Is it correct to think of a row in the (n by k) eigenvector matrix as the projection of a data point into the subspace formed by the first k eigenvectors of L?
(Nathan) Yup, this sounds right.
There are lots of k's being thrown around (consider using the k-nearest neighbor similarity matrix, finding the first k eigenvectors of the resulting graph laplacian, and then running k-means). It seems that the number of eigenvectors we use is tied strongly with the number of clusters we might expect to find with k-means (since disconnected components will each have an eigenvector with an eigenvalue of 0), and the author describes some heuristics for choosing this k (looking for the first k eigenvalues before a "jump" to value k+1). However, I couldn't find any suggestion on how to set k for the k-nearest neighbor similarity matrix. Do you think this could be set as a function of the data, like # of pts, average density, etc.? Or would this have to be simply something tuned empirically?
(Nathan) Most likely, the k used in creating the knn similarity graph is going to be data dependent, so you will probably choose k based on some assumptions and through trial and error arrive at a solution that works. Choosing the right "k" is a problem throughout clustering.
Are there any other useful graph laplacians besides the three described in the Luxburg paper?
(Nathan) My guess is yes, there is an entire field of graph theory devoted to these topics, and I'm sure they have applications outside of being useful for clustering. See Here, Here and Here.

Thanh Nguyen

There are 2 ways to cut a graph: 2-way cut (second smallest eigenvector) or k-way cut (top k smallest eigenvectors). There are a few issues with k-way cut: (1)high approximation error from the real valued solution to the discrete valued solution (which is the relaxation we made for the indicator vectors so that we can have an approximation solution by using the minimum property of Rayleigh quotient [1]). The error accumulates with every eigenvector taken and all eigenvectors have to satisfy a global mutual orthogonality constraint, thereby solutions based on higher eigenvectors become unreliable. On the other hand, 2-way cut is better because it restart solving the partitioning problem on each subgraph individually; (2)k-way cut could exacerbate the oversegmentation (because of the oscillatory of continuous eigenvectors). In which case k-way cut are better than 2-way cut?
Both have trade-off. 2-cut is computationally wasteful (only the second eigenvector is used)while k-cut only compute the eigenvalues once. Secondly, when the clusters are not well-pronounced, 2-cut may stop when there is not a clear gap. However, roughly knowing the number of clusters, k-cut overcome this situation by simply making k-continuous cuts (no attempt is made to identify and exclude oscillatory eigenvectors). And in case of oversegmentation, a post-processing step could be applied to merge similar segments together.

Adam

Do you see any insightful comparisons between the way eigen-vectors are used to reduce data dimensionality in Principle Component Analysis (PCA) and the way they are being used in spectral clustering?
(Nathan) See this.

Piyush

All spectral clustering algorithms use a similarity matrix which seems to be critical to the quality of the resulting clustering. Computing the similarity matrix however requires some tunable parameters (ε for ε-neighborhood, k for k-nearest neighbors, etc). Do there exist principled ways of choosing these parameters or they are usually chosen manually?
(Nathan) It seems from this tutorial that they are chosen manually. Background knowledge of your data sets may help in finding a decent range of values.
Indeed, the spectral clustering algorithms seem easy to implement. But what about the scalability? All these algorithms have to solve an eigenvalue problem which I think would get more and more expensive as the data size grows. What are some of the common techniques used to speed things up?
(Nathan) I believe some of these techniques would scale very well. The biggest problem I can foresee is if someone chose the fully connected neighborhood sim-graph representation. Since this creates a dense matrix, many algorithms to find the spectrum may not scale.

Jags

The facts in Proposition 1, based on the fact that sum of weights of a vertex with all its neighbors adds up to the degree of the vertex. Does the similarity matrix should always respect this ??
(Thanh) From a normal adjacency matrix, we can use the epsilon-neighborhood, k-nearest neighbor, mutual k-nearest neighbor or fully connected to build the similarity graph. The graph Laplacian should always take into account the vertex degree (as we can see in the definition of L,Lsym,Lrw) so that it always has "nice" properties like Positive semi-definite, n non-negative real eigenvalues ... (as in Proposition 1,2,3,4) that we really want to exploit and use in the spectra clustering algorithms.

27 Feb, Model based clustering using EM

Adam

  • A spherical Gaussian is one that has a covariance matrix equal to the identity matrix times some scalar, right?
(Piyush) Yup, right!
  • The paper doesn't present any guarantees for using EM on mixtures of non-spherical Gaussians (I think) and maybe the bounds that they prove would completely fall apart if you allowed the Gaussian clusters to have arbitrary covariance matrices, but can you see any room for adapting the proof to work for clusters from non-spherical Gaussians so long as they are "close enough" to spherical (ie. some looser restriction on the covariance matrix)?
(Piyush) For a non-spherical Gaussian, the radius is defined as \sqrt{trace(\Sigma)} (which becomes \sigma\sqrt{n} for the spherical case). So we can define separation for the non-Gaussian case as well. Yes, the guarantees provided in the paper assumes spherical-Gaussian distributions. Although I don't have a clear sense if the guarantees in this paper would admit a natural extension to the non-spherical Gaussian case, there has been some recent work for more general case as well. See this paper which has some results for the non-spherical Gaussians. For other separation based results for more general mixture distributions, see this paper.
  • What sort of restriction on the covariance matrix do you think would make sense for a definition of "close enough"?
(Piyush) Although the definition depends on the covariance of the Gaussians, note that the closeness depends on the distance between the *means* of the Gaussians. And you want the distance to be large to have a good clustering. What's more important is that what you actually want is as small overlap as possible between *volumes* of the Gaussians, which in reasonably high dimensions would still be true even if the centers are very close by!. As an exmaple (from the paper), even if two spheres in some high dimension have their centers very close by (say (1/100)-separated), there would still be a very tiny overlap between their volumes.

John Meier

  • There were some brief mentions of applications of the EM technique during the seminar (ie, filling in corrupted survey data, identification of language by vowel sounds), but I'm still a bit unclear on what kind of datasets are well-suited to this kind of clustering. Are there any other examples of domains that meet the dimensionality and distribution requirements for running EM?
(Piyush) The most important distinction between EM based clustering and other approaches is that it is probabilistic in nature. It means that it also gives us a measure of confidence in the clusterings obtained (recall that the cluster assignment vector \mathbf{z}_n consists of values between 0 and 1, with each value \mathbf{z}_{n,k} giving the probability of point n belonging to cluster k). This is in contrast with non-probabilistic approaches, say K-means, which only give a hard-assignment. So although we do get a clustering in the end, we can't say how confident we are in the obtained clustering. In addition, as we discussed, the EM framework gives us a principled way to deal with missing observation. Think of a gene-expression dataset where some of the features of a sample may be missing or corrupted due to measurement error. The EM framework allows us to treat these missing variables as latent variables (just like \mathbf{z}_n) which can be estimated in the E step. As for your question about distribution, the GMM model we discussed in class basically assumes that the data is generated from a mixture of Gaussian distributions. However, the EM framework is much more general and can also be applied to cases where we have mixtures of more general distributions (or even to problems other than clustering).
  • The discussion of "EM" vs. "ME" was interesting, but a little over my head. Is there a good overview of the "ME" perspective anywhere?
(Piyush) This paper describes the maximization-expectation algorithm.

6 Mar, Choosing K

Pravin

why 2k in AIC?

In the model fitting problem, the calculation of exact divergence between a observed model and a fitted model would require the true density of the obserevd model, whose computation is not possible. Akaike noted that -2 ln(L) served as a biased estimate of the divergenece between the observed and the fitted model (say D). He arrived at the following bias adjustment

bias(T) = E[D - {-2 ln(L)} ]

and found that this value converges to 2k as T tends to be infinity.

the exact derviation is provided in the following paper

Criteria for Linear Model Selection Based on Kullback's Symmetric Divergence Joseph E. Cavanaugh 1 1 Dept of Biostatistics, The University of Iowa, USA

27 Mar, BIRCH and CURE

Thanh

In BIRCH, clustering feature (CF) summarizes information about sub-cluster objects and CF tree stores these hierarchical clustering features. Is the algorithm sensitive to CF tree parameters like threshold diameter and branching factor?

Parasaran

Since BIRCH makes local decisions without scanning all the data, it seems that BIRCH will face the same problem that HAC does (Points which do not potentially belong to the same cluster being grouped together). How do they achieve good efficiency then? Is there some way to negate this problem?

John Meier

In BIRCH, the authors state that it can easily be proved that all of the distance metrics can be computed from the three values in the CF triples, but how can the linear sum and sum of squares of the points in a cluster be used to get the cluster diameter, for example?

10 Apr, Comparing clusterings

John Meier

The authors suggest that the VI measure is not limited from being used between clustering spaces on different sized data sets, and seem to suggest (in the "3 generative sample comparison to a gold standard" example on page 17) that the measure makes sense when used on clusterings of different points. Could you clarify what they mean? It seems like the calculation of the VI value depends on the two clusterings being built on the same set of points, but this must not be the case?

Personal tools