Large boolean matrices are a basic representational
unit in a variety of applications, with some notable examples being
interactive visualization systems, mining large graph structures, and
association rule mining. Designing space and time efficient
scalable storage and query mechanisms for such large matrices is a
challenging problem.
We present a lossless compression strategy to store and access such
large matrices efficiently on disk. Our approach is
based on viewing the columns of the matrix as points in a very high
dimensional Hamming space, and then formulating an appropriate
optimization problem that reduces to solving an instance of the Traveling
Salesman Problem on this space.
Finding good solutions to large TSP's in high dimensional Hamming spaces
is itself a challenging and little-explored problem -- we
cannot readily exploit geometry to avoid the
need to examine all N2 inter-city distances and instances can be
too large for standard TSP codes to run in main memory.
Our multi-faceted approach adapts classical TSP heuristics by means
of instance-partitioning and sampling, and may be of independent interest.
For instances derived from interactive visualization and telephone
call data we obtain significant improvement in access time over
standard techniques, and for the visualization application we also
make significant improvements in compression.