Filed under: Papers

[author]Suresh Venkatasubramanian and Qiushi Wang[/author]

*ALENEX11: Workshop on Algorithms Engineering and Experimentation (in conjunction with SODA 2011)*

**Abstract:**

The Johnson-Lindenstrauss Lemma states that a set of $n$ points may be embedded in a space of dimension $O(\log n/\eps^2)$ while preserving all pairwise distances within a factor of $(1+\epsilon)$ with high probability. It has inspired a number of proofs that extend the result, simplify it, and improve the efficiency of computing the resulting embedding. The lemma is a critical tool in the realm of dimensionality reduction and high dimensional approximate computational geometry. It is also employed for data mining in domains that analyze intrinsically high dimensional objects such as images and text. However, while algorithms for performing the dimensionality reduction have become increasingly sophisticated, there is little understanding of the behavior of these embeddings in practice. In this paper, we present the first comprehensive study of the empirical behavior of algorithms for dimensionality reduction based on the JL Lemma.

Our study answers a number of important questions about the quality of the embeddings and the performance of algorithms used to compute them. Among our key results:

- Determining a likely range for the big-Oh constant in practice for the dimension of the target space, and demonstrating the accuracy of the predicted bounds.
- Finding `best in class’ algorithms over wide ranges of data size and source dimensionality, and showing that these depend heavily on parameters of the data as well its sparsity.
- Developing the best implementation for each method, making use of non-standard optimized codes for key subroutines.
- Identifying critical computational bottlenecks that can spur further theoretical study of efficient algorithms.

Links: PDF

Tags: CCF 0953066

4 Comments so farYou should compare to PCA, no?

Comment by Sariel 10.06.10 @ 9:15 pmWe could I guess. There are two reasons not to though: firstly, PCA doesn’t actually preserve distances, and secondly, we have a fair bit of evidence (from the MDS paper) that PCA does a terrible job at distances (which motivated the iterative improvement schemes we use there).

But if you’re think of the larger picture of ‘dimensionality reduction’, then yes, PCA is a player. Maybe we’ll add something in before we arxiv it.

Comment by suresh 10.06.10 @ 9:40 pmPCA should be optimal (among all linear projections) if you minimize \sum (d(y_i, y_j)^2 – \delta(x_i, x_j)^2)^2, the squared distances. What is the intuition why it behaves badly for the distances itself?

Comment by Sam 11.02.10 @ 10:39 amSam, that statement is not true. PCA is not optimal for the distance function you mention. Moreover, JL results are for minimizing the max ratio, not the average difference.

Comment by suresh 11.02.10 @ 11:02 amLeave a commentLine and paragraph breaks automatic, e-mail address never displayed, HTML allowed:

`<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>`