Talk:Algorithms Seminar/Spring09
From ResearchWiki
(Difference between revisions)
| Line 43: | Line 43: | ||
:: 2) The book chapter states that top-down clustering is linear in the number of points and clusters, but it seems like the run time should be dependent on the complexity of the flat algorithm applied at each step, correct? | :: 2) The book chapter states that top-down clustering is linear in the number of points and clusters, but it seems like the run time should be dependent on the complexity of the flat algorithm applied at each step, correct? | ||
: Are there any other useful heuristics for bottom-up HAC besides the four mentioned (single-link, complete-link, group-average, centroid)? Would using a blend of some of these heuristics produce interesting behavior (for example, choose the next merge based on a weighted blend of the single-link/complete-link criteria)? | : Are there any other useful heuristics for bottom-up HAC besides the four mentioned (single-link, complete-link, group-average, centroid)? Would using a blend of some of these heuristics produce interesting behavior (for example, choose the next merge based on a weighted blend of the single-link/complete-link criteria)? | ||
| + | |||
| + | '''John Moeller''' | ||
| + | : The ''[http://en.wikipedia.org/wiki/Hierarchical_clustering#Hierarchical_clustering Cluster Analysis]'' Wikipedia article talks about using "The sum of all intra-cluster variance" as a distance criterion. How would calculating information local to one cluster be helpful? Do they mean "the sum of all ''inter''-cluster variance"? I'm not sure what that would entail though. | ||
Revision as of 03:11, 30 January 2009
23 Jan, k-means
Nathan
- What does it mean for an algorithm to be O(X)-competitive?
- An approximation algorithm A which seeks a solution that minimizes some objective φ is O(X)-competitive if
, where φA is the value of the objective function obtained by A and φOPT is the value of the objective function obtained by the exact algorithm.
- An approximation algorithm A which seeks a solution that minimizes some objective φ is O(X)-competitive if
- What are some good methods to help make Kmeans robust regarding outliers?
- As Suresh discussed last week, the "mean" measure is inherently sensitive to outliers. Likewise, any algorithm that does a good job of minimizing the sum of squared distances, will also be sensitive to outliers. In other words, it seems to me that any attempt to make the k-means algorithm less sensitive to outliers is bound to also make the algorithm a worse approximation to the optimal k-means solution. So, if you don't want to be sensitive to outliers, I would think to look into k-medians instead of k-means.
Piyush
- My feeling is that the k-means algorithm might fare poorly if there are substantial differences between the original cluster sizes (or in general between the cluster distributions). Say, for example, the data has one very large and two reasonably small clusters (and assume they all are spherical-shaped clusters). The way k-means partitioning works (i.e. minimizing sum of square distances from the cluster centers) might make the algorithm "split" the large cluster and assign some of its points to the smaller clusters. If this does happen, this would clearly not be the correct clustering. Are there ways to circumvent such a problem in k-means?
- You have certainly brought up an important limitation of k-means. Again, I think that the problem is inherent in the choice of objective function rather than in the k-means algorithm's ability to approximate the "optimal" k-means solution. On the other hand, if I'm not mistaken, training a mixture model of Gaussians using EM with the restriction that all Gaussians are forced to have the same covariance matrix, turns out to be very similar to the k-means algorithm. EM, however, needn't impose the single covariance matrix constraint so if you have reason to suppose that your clusters vary in size and shape you might consider using EM and allowing each cluster to have a different covariance matrix (if you can still assume some flavor of Gaussian for each).
John Meier
- What is a hypersphere/hyperplane?
- A hypersphere is just a generalization of a sphere in d dimensions (i.e. a hyper sphere in 2 dimensions is a circle, in 3 dimensions it is a sphere).
- A hyperplane is likewise a generalization of a plane in d dimensions (i.e. in one dimension it is a line, in 2 dimensions it is a plane).
- How does superpolynomial time differ from non-deterministic polynomial time?
- As was pointed out during the seminar, the class NP applies to problems for which there exists an algorithm (whether or not we know what the algorithm is) that can verify a solution to the problem in polynomial time. Note that stating that a problem is in NP is a declaration of what can be done.
- Stating that an algorithm takes super-polynomial time is equivalent to saying that there exists an input to the algorithm for which the algorithm cannot produce the solution in polynomial time (with respect to the input size). Note that this is a statement of what the algorithm cannot do.
- It seems like k-means++ offers several improvements over traditional k-means without requiring any significant tradeoffs. Has k-means++ become the norm in applications that been using the original algorithm?
- As we discussed in class, the paper is still relatively new. Also, in the paper, it was only tested on a small number of data sets. Maybe my Wikipedia article on k-means++ will help to improve its popularity. :-)
- Are there any methods for adaptively changing k, the number of clusters, during a run of the algorithm? (e.g., increment k and try again if the maximum point-to-centroid distance in any cluster is greater than a threshold)
- There certainly are other clustering algorithms that do not require that k be specified in advanced (an algorithm named "Cobweb" comes to mind).
- In class, you also asked a related question: What do you do if a cluster has no points in it?
- I could imagine adding into the algorithm a check at each iteration so that if a cluster ever loses all of its data points, you can randomly assign the empty cluster to be some data point from the input (possibly using the same initialization method that was used originally---maybe k-means++). This might do strange things though if you really did pick a k that was too big.
- Another possibility, if you are sure that you have the right k would be to leave the algorithm as is (assuming there is some randomness in the initialization or in the optimization process) and just re-run the algorithm until you get k legitimate clusters.
- If you aren't sure about k, you could decrement k and try again at the lower k. These are just some thoughts.
Links to find authors' slides
Wikipedia article
- Since this is my first article, Wikipedia recommended that I start it as a sub page to my user page and then move it over to the real Wikipedia when I'm satisfied with it. Here is what I have so far:
- (John Meier) The Wikipedia article looks like a good start. The only significant thing I see missing is the acutal description of how k-means++ specifies the initial cluster centers (and possibly some of the intuition behind the proof).
30 Jan, Hierarchical methods
John Meier
- I'm a bit unclear on a few points related to divisive clustering as described in the IR book chapter:
- 1) How exactly is a "flat" clustering algorithm (e.g. k-means) applied at each step? What k would be used at each level of the hierarchy? Also, it seems like running k-means to completion at the top level (all points in one cluster, or in other words unclustered) would give you a finished clustering, so what would be accomplished by running it recursively?
- 2) The book chapter states that top-down clustering is linear in the number of points and clusters, but it seems like the run time should be dependent on the complexity of the flat algorithm applied at each step, correct?
- Are there any other useful heuristics for bottom-up HAC besides the four mentioned (single-link, complete-link, group-average, centroid)? Would using a blend of some of these heuristics produce interesting behavior (for example, choose the next merge based on a weighted blend of the single-link/complete-link criteria)?
John Moeller
- The Cluster Analysis Wikipedia article talks about using "The sum of all intra-cluster variance" as a distance criterion. How would calculating information local to one cluster be helpful? Do they mean "the sum of all inter-cluster variance"? I'm not sure what that would entail though.