Talk:Algorithms Seminar/Spring09

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
(30 Jan, Hierarchical methods)
(30 Jan, Hierarchical methods)
Line 60: Line 60:
: It looks like the main problem with centroid clustering is the non-monotonically decreasing similarity between merged clusters in subsequent mergings [I'm sure there are other problems with it, but the book seemed to focus on that problem].  What, then, if we slightly tweak the similarity function so that it penalizes cluster similarity for the size of the clusters being considered for a merge?  For example, one naive approach (although I don't guarantee that it is a bad one :-), would be to define the similarity between two clusters as the negative distance between their centroids plus the negative radius of each (and by radius, I mean the distance from the farthest point in the cluster to the centroid, although other penalties such as the sum of all distances from the points in the cluster to the centroid, etc.).  At the very least, I think that that would fix the inversion example given in the book.  Adding such a penalty on merging large clusters would seem to put additional pressure to keep clusters similar sizes, which could be good or bad, but perhaps adding the negative square root of their radii could still solve the inversion problem without such heavy consequences.  Anyway, do you see any another problems or any possible advantages of such an approach or have you already seen it in the literature?
: It looks like the main problem with centroid clustering is the non-monotonically decreasing similarity between merged clusters in subsequent mergings [I'm sure there are other problems with it, but the book seemed to focus on that problem].  What, then, if we slightly tweak the similarity function so that it penalizes cluster similarity for the size of the clusters being considered for a merge?  For example, one naive approach (although I don't guarantee that it is a bad one :-), would be to define the similarity between two clusters as the negative distance between their centroids plus the negative radius of each (and by radius, I mean the distance from the farthest point in the cluster to the centroid, although other penalties such as the sum of all distances from the points in the cluster to the centroid, etc.).  At the very least, I think that that would fix the inversion example given in the book.  Adding such a penalty on merging large clusters would seem to put additional pressure to keep clusters similar sizes, which could be good or bad, but perhaps adding the negative square root of their radii could still solve the inversion problem without such heavy consequences.  Anyway, do you see any another problems or any possible advantages of such an approach or have you already seen it in the literature?
::: Although I agree that such a penalty scheme might help in keeping the cluster sizes similar, I am not convinced that the scheme will always work. It does work for the example in the book. But, suppose we slightly tweak the book example such that the three points now have co-ordinates d1(1+\eps,1), d2(11,1) and d3(6,1+2sqrt(3)). In that case, d1 and d2 have a similarity of -(10-\eps) and all three of them have a similarity of -8.46, which again shows an inversion. So, one potential problem of this approach is that it's sensitive to the distance between d1 and d2. It works fine for small distances between d1 and d2 and fails when the distance of separation between d1 and d2 is large. However, I haven't seen any such approach in literature.
::: Although I agree that such a penalty scheme might help in keeping the cluster sizes similar, I am not convinced that the scheme will always work. It does work for the example in the book. But, suppose we slightly tweak the book example such that the three points now have co-ordinates d1(1+\eps,1), d2(11,1) and d3(6,1+2sqrt(3)). In that case, d1 and d2 have a similarity of -(10-\eps) and all three of them have a similarity of -8.46, which again shows an inversion. So, one potential problem of this approach is that it's sensitive to the distance between d1 and d2. It works fine for small distances between d1 and d2 and fails when the distance of separation between d1 and d2 is large. However, I haven't seen any such approach in literature.
 +
 +
'''Seth'''
 +
: Has there been work on an HAC algorithm that uses a combination of the methods described?

Revision as of 17:11, 30 January 2009

23 Jan, k-means

Nathan

What does it mean for an algorithm to be O(X)-competitive?
An approximation algorithm A which seeks a solution that minimizes some objective φ is O(X)-competitive if \phi_A \le O(X) \cdot \phi_{OPT}, where φA is the value of the objective function obtained by A and φOPT is the value of the objective function obtained by the exact algorithm.
What are some good methods to help make Kmeans robust regarding outliers?
As Suresh discussed last week, the "mean" measure is inherently sensitive to outliers. Likewise, any algorithm that does a good job of minimizing the sum of squared distances, will also be sensitive to outliers. In other words, it seems to me that any attempt to make the k-means algorithm less sensitive to outliers is bound to also make the algorithm a worse approximation to the optimal k-means solution. So, if you don't want to be sensitive to outliers, I would think to look into k-medians instead of k-means.

Piyush

My feeling is that the k-means algorithm might fare poorly if there are substantial differences between the original cluster sizes (or in general between the cluster distributions). Say, for example, the data has one very large and two reasonably small clusters (and assume they all are spherical-shaped clusters). The way k-means partitioning works (i.e. minimizing sum of square distances from the cluster centers) might make the algorithm "split" the large cluster and assign some of its points to the smaller clusters. If this does happen, this would clearly not be the correct clustering. Are there ways to circumvent such a problem in k-means?
You have certainly brought up an important limitation of k-means. Again, I think that the problem is inherent in the choice of objective function rather than in the k-means algorithm's ability to approximate the "optimal" k-means solution. On the other hand, if I'm not mistaken, training a mixture model of Gaussians using EM with the restriction that all Gaussians are forced to have the same covariance matrix, turns out to be very similar to the k-means algorithm. EM, however, needn't impose the single covariance matrix constraint so if you have reason to suppose that your clusters vary in size and shape you might consider using EM and allowing each cluster to have a different covariance matrix (if you can still assume some flavor of Gaussian for each).

John Meier

What is a hypersphere/hyperplane?
A hypersphere is just a generalization of a sphere in d dimensions (i.e. a hyper sphere in 2 dimensions is a circle, in 3 dimensions it is a sphere).
A hyperplane is likewise a generalization of a plane in d dimensions (i.e. in one dimension it is a line, in 2 dimensions it is a plane).
How does superpolynomial time differ from non-deterministic polynomial time?
As was pointed out during the seminar, the class NP applies to problems for which there exists an algorithm (whether or not we know what the algorithm is) that can verify a solution to the problem in polynomial time. Note that stating that a problem is in NP is a declaration of what can be done.
Stating that an algorithm takes super-polynomial time is equivalent to saying that there exists an input to the algorithm for which the algorithm cannot produce the solution in polynomial time (with respect to the input size). Note that this is a statement of what the algorithm cannot do.
It seems like k-means++ offers several improvements over traditional k-means without requiring any significant tradeoffs. Has k-means++ become the norm in applications that been using the original algorithm?
As we discussed in class, the paper is still relatively new. Also, in the paper, it was only tested on a small number of data sets. Maybe my Wikipedia article on k-means++ will help to improve its popularity. :-)
Are there any methods for adaptively changing k, the number of clusters, during a run of the algorithm? (e.g., increment k and try again if the maximum point-to-centroid distance in any cluster is greater than a threshold)
There certainly are other clustering algorithms that do not require that k be specified in advanced (an algorithm named "Cobweb" comes to mind).
In class, you also asked a related question: What do you do if a cluster has no points in it?
I could imagine adding into the algorithm a check at each iteration so that if a cluster ever loses all of its data points, you can randomly assign the empty cluster to be some data point from the input (possibly using the same initialization method that was used originally---maybe k-means++). This might do strange things though if you really did pick a k that was too big.
Another possibility, if you are sure that you have the right k would be to leave the algorithm as is (assuming there is some randomness in the initialization or in the optimization process) and just re-run the algorithm until you get k legitimate clusters.
If you aren't sure about k, you could decrement k and try again at the lower k. These are just some thoughts.

Links to find authors' slides

http://theory.stanford.edu/~sergei/slides/
http://www.stanford.edu/~darthur/

Wikipedia article

Since this is my first article, Wikipedia recommended that I start it as a sub page to my user page and then move it over to the real Wikipedia when I'm satisfied with it. Here is what I have so far:
k-means++
(John Meier) The Wikipedia article looks like a good start. The only significant thing I see missing is the acutal description of how k-means++ specifies the initial cluster centers (and possibly some of the intuition behind the proof).
(Suresh) you need a lot more work on the page. Firstly, there's no need to waste time discussing k-means: you should link to the relevant wiki page. There's no detail on the approach itself. Also read Wikipedia:YFA for style guidelines. Look at related pages (like Lloyd's algorithm) for a template on how to structure the page. And above all, it's time to promote the page to an actual wiki page, which will make it visible to others to make comments.

30 Jan, Hierarchical methods

John Meier

I'm a bit unclear on a few points related to divisive clustering as described in the IR book chapter:
1) How exactly is a "flat" clustering algorithm (e.g. k-means) applied at each step? What k would be used at each level of the hierarchy? Also, it seems like running k-means to completion at the top level (all points in one cluster, or in other words unclustered) would give you a finished clustering, so what would be accomplished by running it recursively?
At each step, we invoke the flat clustering algorithm with a fixed value of k (say, 2).
At each level we can have a k=2 and continue recursively until we end up with singleton clusters.
Running k-means at the top level will given two clusters (for k=2). Thus we accomplish a division of the top level cluster into two smaller clusters.
2) The book chapter states that top-down clustering is linear in the number of points and clusters, but it seems like the run time should be dependent on the complexity of the flat algorithm applied at each step, correct?
The flat algorithm has a complexity linear in the number of points and clusters. The divisive algorithm requires multiple calls to the flat clustering and for a fixed number of top-levels, the number of calls to the flat clustering is also fixed. Hence, for a fixed number of top-levels, the overall complexity of the divisive clustering is linear in the number of points and clusters.
Are there any other useful heuristics for bottom-up HAC besides the four mentioned (single-link, complete-link, group-average, centroid)? Would using a blend of some of these heuristics produce interesting behavior (for example, choose the next merge based on a weighted blend of the single-link/complete-link criteria)?
Although not in exact terms, but the group-average criteria might be considered somewhat similar to a weighted blend of single and complete linkages.

John Moeller

The Cluster Analysis Wikipedia article talks about using "The sum of all intra-cluster variance" as a distance criterion. How would calculating information local to one cluster be helpful? Do they mean "the sum of all inter-cluster variance"? I'm not sure what that would entail though.
We might aim to minimize "The sum of all intra-cluster variance". At any point, all the K clusters have an intra-cluster variance. The next merge step will select two clusters such that the overall sum of all intra-cluster variance of the merged and the remaining K-2 clusters is minimized.

Parasaran Raman

Since the biggest problem of Hierarchical Clustering is that an incorrect assignment made early in the process cannot be corrected, are there any specific improvisations that we could make to the technique (maybe do a multi-pass) to overcome this handicap? Moreover even though the Hierarchical Clustering techniques gives us nice variety of potential clusterings, how difficult is it to interpret the complex hierarchy which more often is confusing rather than meaningful?

Adam

It looks like the main problem with centroid clustering is the non-monotonically decreasing similarity between merged clusters in subsequent mergings [I'm sure there are other problems with it, but the book seemed to focus on that problem]. What, then, if we slightly tweak the similarity function so that it penalizes cluster similarity for the size of the clusters being considered for a merge? For example, one naive approach (although I don't guarantee that it is a bad one :-), would be to define the similarity between two clusters as the negative distance between their centroids plus the negative radius of each (and by radius, I mean the distance from the farthest point in the cluster to the centroid, although other penalties such as the sum of all distances from the points in the cluster to the centroid, etc.). At the very least, I think that that would fix the inversion example given in the book. Adding such a penalty on merging large clusters would seem to put additional pressure to keep clusters similar sizes, which could be good or bad, but perhaps adding the negative square root of their radii could still solve the inversion problem without such heavy consequences. Anyway, do you see any another problems or any possible advantages of such an approach or have you already seen it in the literature?
Although I agree that such a penalty scheme might help in keeping the cluster sizes similar, I am not convinced that the scheme will always work. It does work for the example in the book. But, suppose we slightly tweak the book example such that the three points now have co-ordinates d1(1+\eps,1), d2(11,1) and d3(6,1+2sqrt(3)). In that case, d1 and d2 have a similarity of -(10-\eps) and all three of them have a similarity of -8.46, which again shows an inversion. So, one potential problem of this approach is that it's sensitive to the distance between d1 and d2. It works fine for small distances between d1 and d2 and fails when the distance of separation between d1 and d2 is large. However, I haven't seen any such approach in literature.

Seth

Has there been work on an HAC algorithm that uses a combination of the methods described?
Personal tools