In many problems in data mining and machine learning, data items that need to be clustered or classified are not points in a high-dimensional space, but are distributions (points on a high dimensional simplex). For distributions, natural measures of distance are not the $\ell_p$ norms and variants, but information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and others. Efficient estimation of these distances is a key component in algorithms for manipulating distributions. Thus, sublinear resource constraints, either in time (property testing) or space (streaming) are crucial.
In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. We start by resolving two open questions regarding property testing of distributions. Firstly, we show a tight bound for estimating bounded, symmetric \emph{f-divergences} between distributions in a general property testing (sublinear time) framework (the so-called \emph{combined oracle model}). This yields optimal algorithms for estimating such well known distances as the Jensen-Shannon divergence and the Hellinger distance. Secondly, we close a $(\log n)/H$ gap between upper and lower bounds for estimating entropy $H$ in this model. We provide an optimal algorithm over all values of the entropy, and for small entropy the improvement is significant. In a stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of $F_0$ (the number of distinct elements) estimation algorithms; we also provide other results along the space/time/approximation tradeoff curve.
Our results have interesting structural implications that connect sublinear time- and space-constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Patterson. We show that any property testing algorithm in the combined oracle model for permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum~\etal regarding the relationship between property testing and stream algorithms. Further we give a polylog-space PTAS for the estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle model.