Talk:Algorithms Seminar/Spring08
From ResearchWiki
Contents |
Jan, 11
Sam
- What I got from this paper is that the Legendre-Fenchel duality can be described as a mapping from position to slopes of the convex envelope of a function. Where and how is this duality used? Is it right that if we an compute the Legendre-Fenchel duality of a non-convex function we can apply convex optimization techniques to the dual to find the global minimum of the non-convex function?
- (Suresh) That's not right. The dual of the LF-dual is the "convex hull" of the graph of the function, so it doesn't quite work. However, solving the dual problem in this way does give a lower bound on the optimal solution.
- (Sam) Doesn't the convex hull contain the minimal point or at least contain the infimum over f (if f is bounded from below) ?
- (Suresh) That's not right. The dual of the LF-dual is the "convex hull" of the graph of the function, so it doesn't quite work. However, solving the dual problem in this way does give a lower bound on the optimal solution.
Jan, 18
Nathan
Ques: I am having hard time understanding what is going on in section 2.3 (and the end of 3.1)? I understand the main point as being that you can take a non-symmetric Bregman divergence, and turn it into a symmetric (but not necessarily Bregman) divergence i.e. SF(p,q). What does this do for us in the big picture?
Amit
Ans: SF(p,q) is symmetrized divergence because F(x) is convex whereas
can be any function not necessarily be convex.
Piyush
- Can there be an intuitive explanation for the hyperplane vs hypersurface distinction of 1st and 2nd type of Bregman bisectors (sec 3.1), and similarly for the convex vs non-convex distinction of 1st and 2nd type of Bregman balls (sec 3.2)? Can just looking at the equations in Lemma 4 make it clear for Bregman bisector case?
- (Piyush): The first equation in Lemma 4 in linear in x so it represents a hyperplane. The second one is linear in x' (belonging to the gradient space, and not necessarily in the original space) and thus it's a hypersurface. For balls case, we can look at the sign of the double derivatives to decide convexity.
- On notations -- sec 2.1 defines D_f(.||.):X --> R. Shouldn't it be X x X --> R ?
- (Piyush): Indeed, that's an omission.
Arvind
- Author says that we can obtain other divergences (i.e. KL divergence) using Bregman divergence. Some of them are even given in table 1. I don't understand this because what author says KL divergence (in table 1) is not really a KL divergence.
- (Arvind) Thats is actually unnormalized KL divergence.
- Can we define Bregman divergence for non-convex function (or for functions that are not strictly convex)?
- (Sam) I guess you could do it but it would lead to negative divergences measures, points that are far away can have zero divergence and all sorts of strange behaviors. With the Bregman divergence you have at least that if you measure from p to any other point that you have an ordering that is consistent with the measure e.g if D(q | | r) < D(q | | p) then
. So I guess it doesn't really make any sense to use such a function as some sort of measure.
- (Sam) I guess you could do it but it would lead to negative divergences measures, points that are far away can have zero divergence and all sorts of strange behaviors. With the Bregman divergence you have at least that if you measure from p to any other point that you have an ordering that is consistent with the measure e.g if D(q | | r) < D(q | | p) then
- What exactly is the application of Bregman Divergence? Is it used only as similarity measure? Section 3 talks about Bregman geometry (i.e. bisector, spheres, lifting map), why do we care for them?
- (Arvind) Basically we study Bregman Divergence because this is a general class from which we can deduce many specific divergences i.e. KL, hellinger etc. Applications of this will be clear in the coming talks
Sam
- Would something like kernelized Voronoi diagramms be possible?
- I don't quite see why type 1 bisectors are linear.
- (Sam) I was afraid I would have to do some symbol crunching. But i think we see that by placing the x as the first argument this corresponds to the equal distance to two gradient hyperplanes to f at p and q which is the bisector hyperplane of the two gradient hyperplanes. I know they wrote that in Lemma 4 but all these symbols made me a little dizzy.
Jan, 25
Nathan
- What would happen if the Bregman divergence were symmetric, would
also be a type-1 Bregman Voronoi diagram?
- (Sam) I think it would just reduce to the original Voroni diagram (with according distance measure, i am having trouble imagining what an
Voronoi diagram might look like) since we are just looking for all points of equal distance to points in the data set.
- (Sam) I think it would just reduce to the original Voroni diagram (with according distance measure, i am having trouble imagining what an
- Why are the cells in type-2 Bregman Voronoi diagrams curved?
Amit
I think you are right. For Euclidean case all three should converge. About Brergman voronoi diagrams be curved somehow relates to bisectors being curved. I still don't have any clue why do we care about symmetrized bregman divergences? What about voronoi diagrams in dual space. Is it like the bisector case. Linear becomes curve and curve becomes linear. why do we care about Bregman triangulation in dual space?
Piyush
- Are there specific situations in which one would prefer working with one type of triangulation (straight edge - Delaunay) of the Bregman Voronoi diagrams over the other (curved edge - geodesic)?
- Can we say anything about shape of cells in a third-type Bregman Voronoi diagram? In Fig 9, they (kind of) look straight-edged (i.e. forming convex polyhedra) for KL divergence but look slightly curved for Itakura-Saito divergence. Does it mean that the shape of cells for third-type Bregman diagrams depends on the type of divergence being looked at? We also note that, in both cases in fig 9, the green edges (corresponding to third-type) lie between red (first-type) and blue (second-type). Does that too signify anything?
Avishek
- I am unable to understand the construction of symmetrized Voronoi diagrams in terms of first-type Bregman voronoi diagrams and manifolds (last para of Section 4.1).
Feb, 1
Nathan
- A few questions on notation: L3 from defn 4: What comprises the set bd(Θ)? What does int(dom(Ψ)) signify, specifically, the int/dom functions? (I really should have taken analysis when I had the chance...)
- (Piyush): The set bd(Θ) means the set of boundary elements of set Θ. Intuitively, this means that, for any point
, there exist arbitrarily close points that are in Θ, and also arbitrarily close points that are NOT in Θ.
- (Piyush): The set bd(Θ) means the set of boundary elements of set Θ. Intuitively, this means that, for any point
- Thm 4: What is meant by bφ being a uniquely determined function in the context of this theorem?
- (Piyush): The uniqueness comes from the one-to-one correspondence of the exponential family distribution p{ψ,θ} and the corresponding Bregman divergence dφ, in terms of which bφ is defined.
- (From section 4.5) I think it is pretty cool you can formulate Bregman divergences in terms of parameters of an expon. fam. distribution, but I don't have a good grasp at what d(x,μ) is telling us about x and μ?
- (Piyush): Notice from
that
in a sense expresses how the distribution p{ψ,θ} drops off once we start going away from the expectation parameter μ. How exactly this dropping off happens, depends on the specific distribution of the exponential family (it is in
sense for Gaussians, KL divergence sense for multinomials, and so on..)
- (Piyush): Notice from
Arvind
- While defining an exponential family, we never talk about supplying p0 to it. Is it something that can be computed using other inputs (natural statistics, natural parameters etc.)? Does it behave like a normalization constant?
- (Arvind) It's not a normalization constant.
- I don't see how we get
(In para just below theorem 2)?
- (Arvind) it's easy to see. Just expand out the expression (integral expression in the para) and differentiate it with respect to θ
Feb, 8
Arvind
- What exactly is ν (probability measure) in the entire formulation? What is it defined over?
- (Nathan) ν can be whatever you want it to be, but I'm sure some assignments are going to work better for your data set than others. It is defined over the r.v. X, and you need this probability measure so you can define Bregman information in terms of some expected value involving ν.
- (Suresh) If it helps, you can "pretend" that ν is merely the uniform distribution, so you're just computing averages over the data. In general, you could envisage ν as being some kind of prior.
Feb, 15
Avishek
- Last para on Page 7. How does the closure of Q imply Lagrange multipliers can be infinite and hence an unique solution exists ?
- I didn't understand the proof of Theorem 1 very well. It was so very "sketchy" :)
- Is there some criteria to select good Bregman divergences to be optimized, for a given loss function ?
Piyush
- Is there some rationale behind choosing the values of
and q0 for ExpLoss and LogLoss problems, or they choose such values only for computational convenience (basically to make things look familiar in some form)? Is the same true for the choice of lifting function F and the resulting Bregman divergence as well? Are these choice made keeping in mind the particular loss function?
- Is the number of columns (n) of the mxn matrix M same as the size of hypothesis set? If so, is the method of Bregman distance optimization applicable to only those learning algorithms that work in rounds (i.e. "iterative") choosing an "optimal" hypothesis at each step (such as AdaBoost and Logistic regression)?
Nathan
- Can any optimization problem be formulated in terms of Bregman divergences? Or, much like how we showed a criterion for using iterative descent methods in the last seminar, this paper shows us how to implement a parallel update scheme for different optimization problems given certain Bregman divergences.
Feb, 29
Nathan
- I am having trouble understanding how Lemma 18 goes on to prove Theorem 17?
- What do they mean by experiment design? Which they are denoting as the value Q later in the paper.
- In proposition 4, what is the "fusion center"?