Algorithms Seminar/Spring08

From ResearchWiki

Jump to: navigation, search


Spring 2008: CS 7936 - The Geometry of Information Spaces

Friday 10:45-12:05 WEB 1460

Contents

Summary

A finite dimensional distribution can be viewed as a point in a space, and distances between distributions can be interpreted as distances in this space. Doing this puts a geometry on spaces of distributions.

It turns out that understanding this geometry allows us to realize that inference procedures and algorithms designed for different families of distributions are really instances of a single general procedure, and that many different families of distributions are merely members of a common class. A simple example of this is the class of Bregman divergences: these include well known distance functions like \ell_2^2 and the KL distance, and algorithms designed for these distance measures can actually be generalized to all Bregman divergences.

A more complicated example of this same idea is the Cramer-Rao theorem: as originally stated, it is a way of placing a lower bound on the variance of an unbiased estimator of the parameter of a distribution. However, it can be seen as a measure of the curvature of a statistically defined manifold.

To what extent does the "geometry" of spaces of distributions allow us more power in doing data analysis ? In this seminar, we will explore such topics.


Talks

Date Paper(s) Presenter
11 Jan Background on information theory, and a primer on the Legendre-Fenchel Transform. Also see Convex Analysis, by Rockafellar Suresh
Bregman distances
18 Jan The geometry of Bregman distances I (Sec 2.1-2.3, Sec 3) Amit
25 Jan The geometry of Bregman distances II (Sec 4-6 (talk to me)) Arvind
1 Feb Bregman distances and exponential families (Sec 4 only) Piyush
8 Feb Bregman clustering (Sec 3,5) Nathan
15 Feb Unifying Adaboost and logistic regression via Bregman distances (not beyond Sec 7)
Other information distances
29 Feb The importance of f-divergences Avishek
29 Feb (to be rescheduled) An axiomatic approach to motivating information distances Avishek
Statistical Manifolds
28 Mar Differential Geometry Primer I (Ch 2 of Guy Lebanon's thesis) Piyush
4 Apr Exponential families as statistical manifolds, Fisher Information, and the geometry of multinomials and Gaussians (Kass/Vos, Ch 7 (handout)) Suresh
18 Apr Application: Hyperplane classifers on the multinomial manifold Nathan
25 Apr The importance of Fisher information: Cencov's theorem

Paper Summaries

1/11/07: An overview, and duality

Many inference procedures boil down to computing certain distances between distributions. The Kullback-Leibler distance shows up often, as does the Euclidean l^2_2 distance. The total variation distance is used to measure how a process (like a Markov chain) converges to a stationary point.

Once we endow a space with a distance measure, we often want to do more: clustering for example; nearest neighbour searches, and a whole variety of questions. If the distance under consideration is a metric, then we can usually exploit triangle inequality and algorithmic methods for metric spaces. But what if not ? Many information distances are not symmetric, and don't even satisfy a directed triangle inequality.

It turns out that there is an underlying geometry that describes the behaviour of these measures. Endowing a space with geometry means that rather than thinking of points in an abstract space, you think of a smooth space with certain structure, which you can use to understand the behaviour of distance measures.

This is the algorithmic perspective: using geometry to design good algorithms. There's an equally important aspect to this 'geometrization': the use of geometry to explain certain results. We will see examples of this as we go along. For example, two completely different inference procedures will turn out to be identical once we realize that the underlying distance measures being used are just different examples of a general class of distance functions on a certain space.

More profoundly, we'll see that basic results in statistics about the variance of estimators can be viewed as questions about the curvature of a related surface, and that almost all the distance functions we consider can be seen as implied by distance measures on this surface.

Convex Conjugates: The linked paper is a good reference for the basics of conjugacy, but a definitive reference is Convex Analysis, by Rockafellar. The chapters on conjugacy, support lines and polars are invaluable and really explain the concepts well.

One way of thinking about the conjugate of a function is to imagine it as a hard surface, and pretend that you're taking a ruler (or a metal sheet) and running it over the bottom of the surface, recording tangents as you go, and making sure that you always stay below the function as much as possible. If you think about it this way, then the relation to projective duality becomes a bit clearer: this is essentially how we think about projective duals (the Gauss map is another useful concept)

This way of thinking about the conjugate also explains why, if f is not convex, f^{**} \ne f. THis is because the metal sheet can only access the convex portions of the surface, and so it can only recover the convex hull of the shape, rather than the original shape.

1/18/07: Bregman divergences:

Motivation:

This paper, talks about building a framework for defining and building Voronoi diagrams for a broad class of distance functions called Bregman divergences. Bregman divergences include not only the traditional (squared) Euclidean distance but also various divergence measures based on entropic functions. To understand Voronoi diagrams for Bregman divergences we first have to study about Bregman divergences. This talk will try to cover defining Bregman divergences and some of their main properties. After that we will see how functions belonging to this class behaves in primal and dual space. We will also cover elements of Bregman geometory i.e. bregman bisectors, balls.

Definition of Bregman divergences:

For any two points p and q of  X \subseteq R^d , the Bregman divergence  D_F (.||.) : X  \longmapsto R of p to q associated to a strictly convex and differentiable function F (called the generator function of the divergence) is defined as

 D_F(p||q) = F(p)- F(q)-  <\nabla  F(q),p-q>

Properties of Bregman divergences:

  1. It is not symmetric i.e DF(p | | q) = DF(q | | p) . It becomes symmetric if Hessian  \nabla ^2 F is constant on X.
  2. Non negativity  D_F(p||q) \ge 0
  3. Convexity: DF(p | | q) is convex in its first argument p but not necessarily in its second argument q.
  4. Linearity:  D_{F_1+ \lambda F2}(p||q) =  D_{F_1}(p||q) + \lambda D_{F_2}(p||q)
  5. Invariant under linear transformation:

G(x) =F(x) + <a,x> + b is a linear transform of F(x) but DF(p | | q) = DG(p | | q) e.g. Mahalanobis distance i.e XTQX where Q is positive definite matrix is a generalized case of Euclidean distance

Legendre Duality:

We know this from previous lecture: F * (x) = sup < x,x' > − F(x) where x \epsilon X We also know F* is strictly convex. Also  x' = \nabla F(x) We will mostly be making use of these three properties.

  • \nabla F^* = \nabla ^{-1} F

The most important property the gradient in dual is inverse of gradient in primal.

D_F(p||q) = D_{F^*}(q'||p')

The above statement is saying nothing but the distance between p and q in primal space is equal to distance between q' and p' in dual space. Now if look at Squared Function F(x) = x2. For this case since it is metric we know that DF(p | | q) = DF(q | | p). But we also know one more thing about F(x) = x2 i.e. \nabla F* = \nabla F. Now you can easily see that in primal and dual space we have same points i.e q' corresponds to q and p' corresponds to p. If this doesn't make sense, then take Hessian  \nabla ^2 F that will be constant. Hence from property 1 defined earlier DF(p | | q) = DF(q | | p). Hence we can say that F(x) = x2 is member of more general class i.e. Bregman distances. For Burg entropy F(x) = − log(x) \nabla F* = \nabla F holds but since its not a quadratic function the hessian is not zero hence DF(p | | q) = DF(q | | p) doesn't hold. The reason is even though the derivatives are same but the points q',p' are not equal to p,q respectively.

Elements of Bregman Geometory:

We discuss several basic geometric properties that will be useful when studying Bregman Voronoi diagrams. Specifically, we characterize Bregman bisectors, Bregman balls.

Since Bregman divergences are not symmetric, we can define several types of bisectors. There are two types of Bregman bisectors which are defined as:

H_F(p,q)={x \in X | D_F(x||p)=D_F(x||q)}

H_F^'(p,q)={x \in X | D_F(p||x)=D_F(q||x)}

They become same when the divergence is symmetric. However, in general, they are distinct, the bisectors of the first type being linear while the bisectors of the second type are potentially curved (but always linear in the gradient space).

P and q lie on different side of HF(p,q)

 H_F(p,q) = \nabla^{-1}F(H^'_{F^*}(q^',p^'))

 H^'_F(p,q) = \nabla^{-1}F(H_{F^*}(q^',p^'))

  • The main thing is bisectors may not be symmetric in primal space but the bisector in primal space is equal to the bisector in dual space and vice versa.

Projection, orthogonality:

Three point Property:

For any triple p,q and r of points of X we have:

  • DF(p | | q) + DF(q | | r) = DF(p | | r) + < pq,r' − q' >

1. < pq,r' − q' > . It can be positive or negative. If it is positive it will behave like a triangle inequality but won't hold when this dot product becomes negative.

2. If < pq,r' − q' > becomes zero the above formula becomes Pythagoras i.e pq \perpqr. Now important thing to note is pq \perpqr \notin qr \perppq. Now you can see even orthogonality is not symmetric but in the gradient space they are orthogonal i.e r'q' \perpq'p'. The reason is < r' − q,pq' > is also zero(dot product is commutative).

ΓF(p,q) is Bregman Orthogonal to the bregaman bisectors HF(p,q) while Λ(p,q) is Bregman orthogonal to HF * (p,q)

1/25/08: Bregman Voronoi diagrams, Bregman triangulations and their applications

The Voronoi diagram of a finite set of objects is a fundamental geometric structure that subdivides the embedding space into regions, each region consisting of points that are closer to a given object than to the others. More specifically, Given a set of points {p_1,\cdots ,p_n} \in R^d, there is a cell associated with each point pi such that any point x in that cell is closer to pi than any other point p_{j \ne i}. When we say that one point is closer to another point, we are definitely talking about some distance function. If distance function is Euclidian, we call Voronoi diagram, a Euclidian Voronoi diagram. We can get generalized Voronoi diagrams called Bregman Voronoi diagrams by just using Bregman divergence instead of Euclidian distance. We can also think of Voronoi diagram as a graph obtained by bisecting the lines joining every pair of point pi and pj for all i and j. Since we have three different kinds of Bregman bisectors we have three different kinds of Voronoi diagrams, one for each bisector (First type, second type and symmetrized).

Bregman Voronoi diagram gives rise to Bregman triangulation when transformed into a dual form (this duality is different from geometric point-line duality), in fact this is true for any Voronoi diagram. Consider f to be a face of a Voronoi diagram (in this case, a cell of Voronoi diagram), we call f* a dual face which is convex hull of the sites (points) associated with the faces neighboring the face f. Now If no subset of d+2 sites lie of the same d-dimensional sphere (for a two dimensional case, if no four points lie on a circle), set of dual faces constitute a triangulation embedded in R^d whose vertices are sites. This triangulation is called Delaunay triangulation. Bregman Delaunay triangulations can be obtained by simply using Bregman Voronoi diagram instead of Euclidian Voronoi diagram and Bregman sphere instead of Euclidian sphere.

There are many applications of Bregman triangulations (or Voronoi diagrams). One of them is mesh generation. Mesh of a size ε is generated in such a way that maximum distance of any point x from the closet site pi should not be greater than ε and distance between two sites should always be greater than ε.

Take away message of this talk is that properties of general Voronoi diagrams e.g. converting them into triangulations through duality) extends to Bregman Voronoi diagrams hence we can use same standard techniques that we use to solve problems involving Euclidian distance, to solve more general problem involving Bregman divergence.

2/1/08: Bregman Divergences and Exponential Families

Motivation: In clustering, we typically minimize some sort of distortion measure (Euclidean, Mahalanobis, etc) to come up with the "optimal" clustering. Often, we want to work with more generic measures of distortion. Clustering (almost always) also assumes that the data comes from a certain probability distribution. So another generalization can be in terms of the probability distributions that we consider. Such generalizations for clustering (and also for other problems dealing with distance/distortions) can be studied using two very important tools -- Bregman divergences (generalizing distortion measures) and exponential families (generalizing probability distributions which take a certain form). It turns out that there is a close connection between exponential families and Bregman divergences and this has been studied in literature in the past.

This paper formalizes the connection and shows that there exists a one-to-one and onto correspondence (i.e. bijection) between regular exponential families and a large class of Bregman divergences (called regular Bregman divergence).

To establish this correspondence, we would resort to the properties of exponential families, notions of Legendre duality between convex functions of certain types, and certain theorems from convex analysis which we shall see in the talk.

It turns out that the general form expressing the relationship looks like this: p_{\{\psi,\theta\}}(\textbf{x}) = \exp(-d_{\phi}(\textbf{x},\mu))b_{\phi}(\textbf{x})

Here p{ψ,θ} is a probability distribution function from a regular exponential family, dφ(.,.) is the corresponding Bregman divergence, ψ (cumulant function) and θ (natural parameter) are parameters associated with the exponential family distribution, φ is the conjugate function to ψ, \textbf{x} \in dom(\phi) , and μ is called the expectation parameter associated with the regular exponential family, and bφ(.) is another uniquely determined function.

We'll also see a simple example of a distribution from exponential family -- a univariate Gaussian, for which the corresponding Bregman divergence turns out to be the L_2^2 distance. Similarly, for multinomial distributions, the corresponding Bregman divergence is KL-divergence.

2/8/08: Clustering with Bregman divergences

This paper sets out to answer the following question: For what type of cost functions can we define an iterative relocation scheme (think Kmeans) such that the distortion wrt cluster representatives is decreased? The answer turns out to be that you can use a sheme such as this with arbitrary Bregman divergences. An even stronger result from Banerjee shows that this scheme will only work when the distortion is a Bregman divergence.

Several well known loss functions, such as squared Euclidian distance and the KL-divergence have been used in clustering algorithms before. This paper shows us how we can take a step back from these specific loss functions and look at an entire class of functions (Bregman divergences) that can successfully be used in Kmeans-style clustering.

Section three of this paper defines Bregman informatin which motivates a loss function that allows us to solve the hard clustering problem for arbitrary Bregman divergences.

Bregman Information:

This idea was motivated from Shannon's rate-distortion problem which involves finding a coding scheme which has a given rate of distortions (bits per symbol), called the distortion rate such that the expected distortion between the source r.v. and the decoded r.v. are minimized.

Given a r.v. X that takes the values X = \{x_i\}^n_{i=1} \subset S \subseteq \mathbb{R}^d where S is a convex set. X has a discrete probability measure ν, and let the distortion be measured by a Bregman divergence dφ. The distortion-rate function becomes Eν[dφ(X,s)] which depends on the choice of the representative s. The optimal distortion rate function is called the Bregman information of the r.v. X with Bregman divergence dφ:

I_{\phi}(X) = \min_{s \in ri(S)} E_{\nu} [d_{\phi}(X,s)]

2/15/08: Bregman Optimization

A good example of two entirely different problems unified by the use of Bregman divergences. what we also get is a look at optimization with bregman divergences, and how a nice duality theory holds.

Adaboost:

Adaboost is a 'meta-algorithm' that addresses the problem of learning with weak learners. Suppose you're given examples (x_i, y_i), i = 1 \ldots n, where y_i \in \{-1,+1\}, and you want a classifier that can correctly predict these outcomes. At your disposal are so-called 'weak learners', or candidate functions fj that map xi to yi. Can you find a weight vector λ so that the combination of these learners with weights λ is a good predictor ?

The standard measure is to count the number of misclassifications: i.e the number of cases when yifλ(xi) < 0. But this problem is usually hard. Rather, we replace the sharp threshold of yifλ(xi) < 0 by the weaker cost function exp( − yifλ(xi)) and attempt to minimize this sum over all λ

The actual algorithm runs an update procedure starting with a distribution over test instances, and refining the distribution to emphasize test instances that are currently not well predicted by the weight function. The weight function is also updated appropriately.

The key result for Adaboost is that if each hypothesis has error rate \epsiloni = 1 / 2 − γi, then the overall learner has error rate exp(-\sum \gamma_i^2)

Logistic Regression.

Rather than doing the above, suppose we postulated that y was a stochastic function of the xi, and the goal was to determine the probabilities p(y | x) One technique for doing this is to write the probability p(y | x) as

p(y=+1|x) = \frac{1}{1 + exp(-f(x)}
In which case, we can write down the likelihood function for the data, and maximize it, yielding the optimization 
\min_\lambda \sum \ln(1 + exp(-f(x_i))

Bregman optimization:

Suppose you want to find a Bregman projection: Specifically, you want to find a point in a subspace closest to a given point, where "close" is defined by a Bregman divergence. Using the convexity of the function d(p,q) (as a function of p), you can write down a duality theorem that characterizes optimal solutions in terms of the Bregman distance d(p,q) and the Bregman distance d(q,p), via a "pythagorean" theorem. Moreover, this duality leads to a natural iterative optimization scheme

Both of the above problems can be placed in this framework, which allows them to be solved by a unified iterative algorithm.

2/28/08: On divergences, surrogate loss functions and decentralized detection (Avishek)

Applications in decentralized detection aim to select optimal discriminant functions and compression rules by minimizing some kind of loss function. The most commonly used loss function is the 0-1 loss. However, the 0-1 loss function is not very well-behaved in the sense that it is not convex. This makes minimizing the probability of error for the 0-1 loss intractable. In the machine learning community, this problem is circumvented by replacing the 0-1 loss functions with computationally efficient and well-behaved loss functions like the hinge loss, exponential loss, logistic loss, etc. On the other hand the signal processing community, inspired by Blackwell's seminal work on comparison of experiments, use the f-divergence measure to minimize the probability of error due to misclassification. This paper aims at establishing a correspondence between the class of well-behaved loss functions and the f-divergences, both of which act as surrogates to the 0-1 loss function. In addition, it demonstrates that for any given loss function, regardless of it's convexity, there exists a function f such that minimizing the loss is equivalent to maximizing the f-divergence. All surrogate loss functions induce exactly one type of f-divergence. However, many surrogate loss functions may induce the same f-divergence. This leads to the notion of equivalence between loss functions which induce the same kind of f-divergence.

The correspondence between loss functions and f-divergences has been proved in both directions:

φ-risk to f-divergence:

It has been shown that given some risk function Rφ, we can derive the relation for an equivalent divergence such that minimizing the risk corresponds to maximizing the divergence measure.

f-divergence to φ-risk:

This part is a bit tricky. First, the paper presents a procedure (Lemma 5) to construct a function ψ from some risk function φ and relates (Proposition 7a) this ψ to the f-divergence. This implies that given some f-divergence, we should be able to derive (Theorem 8) ψ from it and subsequently the φ loss can be reconstructed (Proposition 7b) from the ψ-function.

Corollary 9 presents some additional properties which may be utilized in verifying whether some f-divergence is realizable for a given surrogate loss function φ.

Finally, the surrogate loss functions have been compared to quantizer designs. Inequalities that exists for f-divergences have been extended for the loss functions. The paper also introduces the notion of universal equivalence between loss functions in the sense that loss function which induce the same f-divergence are equivalent. Thereafter it tries to construct the class of loss functions which are universally equivalent to the 0-1 loss function.

3/28/08: Elements of Riemannian Geometry

We will discuss relevant concepts from Riemannian Geometry that we typically encounter while studying statistical manifolds. Notions of topological manifolds, manifold with a boundary, differentiable maps, differentiable manifolds, examples, diffeomorphism, tangent spaces, submanifolds, Riemannian manifolds (differentiable manifold with a metric), geodesic distance and geodesic curves, push-forward maps, pull-back metric, isometry, etc. More to come..

4/18/08: Application: Hyperplane classifers on the multinomial manifold

This paper details an application of statistical manifolds we have been discussing the past few weeks. Specially, the authors point out the benefits of using linear classifiers (ease of understanding, ease of use, etc) but also bring to light a shortcoming. Standard linear classifiers make the assumption that the underlying geometry of your data set is Euclidean. "Real World" data sets are likely not to have this property. Their solution is to extend standard hyperplane and margin classifiers to cover multinomial geometries as well.

The application of this extension comes in the form of categorizing text documents. Given a set of topics (sports, politics, economics) and a set of documents, label each document with 1 or more topics. To model this problem with multinomial distributions requires creating a simplex of multinomial models built from a Riemannian structure with the Fisher Information metric.

There will be discussion on how to create a Multinomail manifold, how to define a hyperplane and margin on the n-sphere and a description of how to extend logistic regression with statistical manifolds.

4/25/08: Covariant derivatives, α-connections and invariants

Some links:

Seminar Outline

  • Refresher on entropy, mutual information, KL-distance, Legendre-Fenchel duality: how it generalizes linear duality
  • Bregman distances: generalization of projective duality, Voronoi diagrams
  • Bregman distances and exponential families: the Cramer transform
  • Csiszar and information distances
  • Information distances and inference (Chernoff, Bhattacharya, KL, Jensen-Shannon)
  • Mutual information geometry (?): embedding information distances, hellinger, jeffrey's prior
  • Amari-style geometry: Cramer Rao theorem reinterpreted via Efron/Dawid as a statement about curvature
  • Cencov's result: Fisher information as a "geometric invariant" under markov transforms
  • alpha - divergences
  • Jun Yang paper on the representability of bregman distances.

Past Semesters

Personal tools