Talk:MLRG/spring08
From ResearchWiki
Contents |
16 Jan 08
Amit
- In the paper they say that we get a unique optimal hyperplane.why there can't be more than one hyperplane which separates the data well. Also Geometrically, it can be shown that the margin of the closest training point on either side of the hyperplane is 1/||w||.
- (Piyush): Note that we are solving a convex optimization problem in w and thus there will be a unique solution for it.
- (Hal): The important thing is that there may be many separating hyperplanes (in general, there are either none, one, or infinitely many --- and "one" is exceedingly unlikely). However, only one of them will have maximal margin.
- Also why we use >=1 in the constraints. He makes an argument doesn't make sense to me about length of w has decreased.
- (Hal): It's hard to argue for this without drawing pictures, and it's actually probably going to be on a homework assignment for the machine learning class... so I'm a bit hesitant to give a formal answer. Intuitively, what's happening is that we can formulate the maximum margin problem in one of two ways. We can either say we want a vector w with norm 1 that achieves maximal margin, or we can ask for a vector w with minimal norm that separates the data with some positive margin. The reason we have to constrain the norm of w to 1 in the first case is because if we get a margin of gamma, and then replace w with 10*w, we get a margin of 10*gamma, which looks a lot better. However, the 1 here is clearly inconsequential---it's just a scaling factor. A similar argument on the other formulation shows that the "force the positive margin to 1, rather than to say 10" is also irrelevant... it's just a scaling factor.
Avishek
- How/Where does Kernel PCA score over normal PCA ? We already have PCA. So what's the motivation of bringing up Kernel PCA.
- More importantly, given some problem instance, how do we decide on whether to apply normal PCA or its kernel variant ?
- (Piyush): Basically, PCA assumes that the data lies on a low dimensional linear subspace. However, many datasets will be intrinsically nonlinear and thus there would be a need to kernelize. Note however that if the data yields to a manifold representation, it would be more appropriate to use some other manifold learning techniques (LLE, ISOMAP, Laplacian Eigenmaps, etc) instead of kernel PCA.
- (Sam): Just a small remark to the above answer. I think LLE, Isomap, Laplacian Eigenmaps, etc can be formulated as kernel PCA with a specific kernel for each of them.
- (Suresh): Specifically, this paper: A kernel view of the dimensionality reduction of manifolds, J. H. Ham, D. D. Lee, and S. Mika, and B. Schoelkopf, Proceedings of the International Conference on Machine Learning (2004).
- (Sam): Just a small remark to the above answer. I think LLE, Isomap, Laplacian Eigenmaps, etc can be formulated as kernel PCA with a specific kernel for each of them.
- (Piyush): Basically, PCA assumes that the data lies on a low dimensional linear subspace. However, many datasets will be intrinsically nonlinear and thus there would be a need to kernelize. Note however that if the data yields to a manifold representation, it would be more appropriate to use some other manifold learning techniques (LLE, ISOMAP, Laplacian Eigenmaps, etc) instead of kernel PCA.
Arvind
- if h<m, how come test risk R[f] is independent of underlying probability distribution that generated the data? (Eq 1.19)
- (Sam): I would assume independent here means that this is an upper bound on the risk no matter what the probability distribution that generated the data. The actual risk for a certain probability distribution might be lower.
Evrard
- What do the eigenvalues in the principal component analysis method allow us to infer on the sample data?
- (Evrard): PCA allows us to reduce the dimensionality of the sample data, and the eigenvalues, together with the corresponding eigenvectors, give us the directions of the sample data. In which case we look at the eigenvector corresponding to the maximum eigenvalue to infer main direction of the data.
Nathan
- After equation 1.39, the authors state that the sum of the slack variables (\sum \xi_i) is an upper bound on the number training errors. Why is this so?
- (Piyush):In order for a point to be misclassified, \xi_i (for each training point) must be more than 1 (recall y_i(w*x_i + b) >= 1 - \xi_i) so (\sum \xi_i) would be the upper bound on the number of training errors.
- How does one find ``good values for C, \nu, and \xi? X-validation? Guess and check till paper deadline?
- (Piyush): Usually C controls the trade-off between margin maximization (||w||^2) and the penalty of misclassification. High C would mean large penalty but smaller margin (remember our overall goal is to minimize the primal objective function). In practice, small value for C is generally chosen. For hinge loss, the \xi_i term doesn't appear in the dual objective so it's not a concern. \nu is also chosen by some sort of cross validation.
Sam
- Can we say something about the VC-Dimension/Capacity for a hyperplane classifiers when using the kernel trick? How does the VC-Dimension/Capacity behave when using to the kernel trick?
- (Piyush): The VC dimension of SVMs with kernels can be very large. It can be shown that even for a polynomial kernel, the VC dimension scales combinatorially with the degree of polynomial (high degree => high capacity). For Gaussian RBF kernels, small width implies a high capacity. For Gaussian RBF kernels with sufficiently small widths, the SVM can have infinite VC dimension. Burges's SVM tutorial gives some theorems and examples on these.
- (Hal): The fact that even with infinite VC dimension, SVMs can do well is a bit surprising an counter-intuitive. As far as I know, there's no good theory that really explains this in satisfactory detail (correct me if I'm wrong). Late in the semester when we talk about minimum description length stuff, we'll get some arguments for why RBF is so nice, but it still doesn't answer this question.
- (Suresh): It's not clear to me why the RBF space has infinite VC dimension, but I do know that the VC-dimension of spaces of hyperplanes with separation at least γ is 1 / γ2. Might that have something to do with this (i.e the intrinsic dimension doesn't matter anymore)
- (Sam): I am not sure if that's is correct, please correct if wrong. I think as the width of the RBF goes to zero the SVM degenerates to a nearest neighbor classifier which in turn can be shown to have infinite VC-Dimension.
- (Suresh): It's not clear to me why the RBF space has infinite VC dimension, but I do know that the VC-dimension of spaces of hyperplanes with separation at least γ is 1 / γ2. Might that have something to do with this (i.e the intrinsic dimension doesn't matter anymore)
- How does parzen windows (or kernel density estimation in general) relate to e.g. normal density estimation in some dot product space? What does the maximum Llikelihood (or the unbiased) estimate of the covariance / mean in some do product space corresponds to for the kernel density estimate in the original space?
- (Hal): I don't really know that anyone has looked at this. Certainly you can do maximum likelihood gaussian estimation in a dot-product space, at least for the mean---for the covariance matrix, you'd probably need some representer theorem. I suspect this will start looking a lot like gaussian processes. Regarding kernel density estimation, RBF kernels give you something that in practice looks a lot like a kernel density estimate with a gaussian window.
Sidd
- I'm not very clear on the role played by the "1" on the RHS of inequality 1.25. They do have an explanation about it in the following paragraph, but I'm afraid I didn't quite grasp their explanation. Does this "1" then appear as the "-1" in the Lagrangian in 1.26? Or is that something different?
- In their explanation about the "1", they seem to indicate that any positive value in place of the "1" would work. So if we were to replace the "-1" in the Lagrangian with some other negative value, would that be okay too?
- (Hal): See the above response to Amit.
- We're using a subset of training examples (the support vectors) to determine the maximum separation hyperplane between the positive and negative examples. Is it then reasonable to infer that the distance of a new test example from the hyperplane can be a "confidence score" of its classification? Why or why not? Is the answer to this question covered by any of the assumptions made for the derivation?
- (Hal): Sort-of. People do interpret distance from hyperplane as a measure of confidence. The problem is that it can be misleading. Suppose we have training data that all lies pretty near the origin, but with positive examples "above" and negative examples "below" the x-axis. Now, we get a test example that's 10 miles above the x-axis. This will have a huge distance from the origin and hence a huge "confidence" as prescribed by the model, but in a sense we really have no idea what the classification should be because it's nothing like the training data. In this way, gaussian processes are preferable because they actually model the data by points inside the clusters of data (rather than just the boundary) and so in this case, the 10-mile-away-point would actually get low confidence. Nevertheless, people do this and it works pretty well in practice. You can also plug the distance-to-hyperplane into a logisitic function to get a form of a probability estimate out... John Platt has a paper on this that's spurred a bunch more work.
23 Jan 08
Nathan
- Notation question: This notation is used a lot in these papers:
, what exactly does the
represent in these functions?
- (chang): (.) is a slot where the argument is placed.
- (nathan): yeah, that is what I am asking. What is the argument? Can it be anything?
- (Piyush): In feature space, think of k(x,.) as a function placed at x and defined over the entire domain of input patterns ( which means that "." could be the entire X). Essentially, k(x,.) (or rather \phi which induces it) maps a pattern x into a kernel shaped *function* sitting at x and it lets us compute the similarity of pattern x with all other patterns (x') by doing k(x,x') (which is actually equivalent to doing <\phi(x),\phi(x')>).
- (nathan): yeah, that is what I am asking. What is the argument? Can it be anything?
- (chang): (.) is a slot where the argument is placed.
Piyush
- (1) Empirical kernel map (w.r.t. a set of n points) \phi_n basically approximates \phi by mapping an input pattern x \in R^N to a point in R^n (instead of mapping it to an infinite dimensional object). Section 2.2.6 says that using the empirical kernel map, we don't lose information while applying a *linear* algorithm in the feature space (which is R^n), since everything now takes place in a linear span of the mapped training points. Does linear span here mean the way we typically represent the solution in terms of the training points ( \sum alpha_i*k(x,x_i))? If so, can there be other kinds of linear spans of mapped training points that correspond to valid empirical (not necessary kernel) maps in R^n?
- (2) Section 2.2.6 also says that the space induced by empirical kernel map (\phi_m) isn't a dot product space in R^m and we need to endow it with a dot product. The argument given is that \phi(x_i) doesn't usually form an orthonormal system. How does it imply the first statement?
- (Avishek): Let us start with a vector space. A vector space consists of "orthogonal" basis vectors. Any vector within the vector space can be expressed in terms of these basis vectors. An inner product is defined within a inner product space which is again a vector space. So the inner product space has to contain some orthogonal basis vectors. For any non-orthogonal system (NOS), all pair of vectors in the system are non-orthogonal. Hence, NOS can't have any basis vectors since by definition basis vectors are mutually orthogonal. As a result the NOS can never be a vector space and hence no dot product can be defined for an NOS.
- (Suresh): I'm not sure this is correct. there's no such thing as "orthogonal" in a vector space unless you define an inner product. What's happening is that the inherited dot product from R^n doesn't satisfy the kernel condition on this set of training points (intuitively because they might be some weird ellipsoidal shape in the space). So you do the rescaling of the eigenvalues to convert the ellipse into a sphere, in which case the property holds again.
- (Avishek): Let us start with a vector space. A vector space consists of "orthogonal" basis vectors. Any vector within the vector space can be expressed in terms of these basis vectors. An inner product is defined within a inner product space which is again a vector space. So the inner product space has to contain some orthogonal basis vectors. For any non-orthogonal system (NOS), all pair of vectors in the system are non-orthogonal. Hence, NOS can't have any basis vectors since by definition basis vectors are mutually orthogonal. As a result the NOS can never be a vector space and hence no dot product can be defined for an NOS.
Arvind
- What does f mean in equation 2.29? Is it same as f(.)? If so, then what is f(x)? Is is equivalent to writing f(x)(.)?
- (Piyush): In 2.29, k(.,x) is the reproducing kernel (at point x) which when taken a dot product with some function f, gives the value of f at x (i.e. it "reproduces" f's value at x, or in a sense is the "representer of evaluation" of f at x, as the line below equation 2.29 says). f(.) here is a function having a single argument (x). For a specific case, f itself can be another kernel function k(x',.) which when taken a dot product with k(x,.) gives f's value at x which is k(x',x). We can view the same w.r.t. k(x,.) too.
Evrard
- There is a typo in section 6.4.1 of Eigenfunctions. Should the formula not be: Av = λv, where λ is the eigenvalue and v is the corresponding eigenvector?
- (Sidd): Also, I was wondering what is A. Should that be M?
Suresh
- I'm still confused about the fundamental difference between a Mercer Kernel and an RKHS kernel. Is it merely a matter of continuous (mercer) vs discrete (RKHS) ?
- (Piyush): Given a kernel, there can be different ways of constructing the associated feature map. RKHS map and Mercer's map are two such ways of looking at it. RKHS is associated with k (the kernel function) while Mercer is associated with \el_2 feature space. Page 39 in the paper emphasizes this point (equivalence of feature spaces).
Chang
- I don't understand Theorem 2.10 well. What does eq. (2.38) mean?
- (Piyush): Theorem 2.10 mathematically formalizes the notion of equivalence between a kernel and the corresponding dot product in a feature space. The operator T_k has the kernel k associated with it. Statement 2 in theorem 2.10 presents the conditions for a valid map and shows the equivalence of k(x,x') with the dot product in a particular feature space (which can be finite or infinite dimensional). The actual mappings can be seen in terms of normalized orthogonal eigenfunctions of the operator T_k (equation 2.40).
Sidd
- In the tutorial "From Zero to Reproducing...", the Riesz representation theorem mentions a unique "vector u" exists, such that
. Is vector u in this case actually a function from
? If not, how do we define a dot-product between f (which is a function) and u (which is a vector)?
Amit
- Mercer Theorem 2.10 doesn't make sense to me also.
- (Piyush): See the answer to Chang's question above.
Sam
- I got lost in these two papers. So here is a whole bunch of questions.
0 -> RKHS :
- 1. In Section 5.1 there is an example of how the dot product induce the 2-norm. Does a dot-product always induce some norm? Can we define a Hilbert space with a norm different from the norm induced by the dot product e.g use the
norm and the
dot product?
- (Sam): Yes and yes. I was told though it would be weird and make no sense. Why does the induced norm make sense?
- 2. Notation in Section 6.2. Why is the dirac functional is written as an element of the Hilbert space? The
is assume the power of p and not the derivative? Why is the first norm an element in R are there some brackets missing? What is the meaning of this whole definition?
- (Sam): It's not an element of the Hilbert space. It shows that we can represent φ(f) where
(a bounded functionals defined on
) by some < f,u > . I understand that more or less now but what can we use that for? Where is the connection to the kernel trick?
- (Sam): It's not an element of the Hilbert space. It shows that we can represent φ(f) where
- 3. Section 6.4. Why are we starting from a universe and not from a field? What's a dot product on universe don't we need the properties of a field to define a dot product?
Kernel Tutorial :
- 1. Definition 2.4, what does this tell us about K?
- (Sam): It's an notion of positiveness for matrices (or I assume Integral operators), similar to the notion of a positive scalar.
- 2. I don't understand the connection between the integral operators and the kernel used here (Formula 2.16)?
- 3. Formula 2.22: How is it possible to describe a possibly infinite dimensional vector space using a finite amount of linear combinations?
- (Sam): Formula 2.22 can be written in terms of integrals and everything else will still hold.
30 Jan 08
Sidd
- (a) Hmmm... I probably missed the core idea presented in the paper. I thought we had already established that any algorithm that can be represented as a function of dot-products in a feature space (which is a Hilbert space) can be kernelized. Is this paper formalizing that idea?
- (Suresh): No. the ability to kernelize is not enough to ensure that the space being minimized over is finite dimensional. you need the representer theorem to reduce dimensionality.
- (b) I don't understand the role of the cost function (18) in the derivation. Does that correspond to the "learning" algorithm (i.e. learning of weights in a perceptron)?
- (Suresh): I think so. Basically the cost function measures the gap between what the learning algorithm should be doing and what it is doing.
- (c) What do they mean by "admits a representation of the form" (19)?
- (Suresh): That's the conclusion of the theorem: it says that any function minimizing cost can be written in that form.
Sam
- In the paper here training points get mapped to functions right? It wasn't so clear to me in the previous papers.
- (Suresh): correct.
- What is meant by hard constraints? And why are hard constraints included by allowing the cost function to take on infinity?
- (Suresh): a hard constraint is one that MUST be satisfied. Allowing the cost function to take the value infinity means that we can penalize SEVERELY any violation of a hard constraint. In other words, any solution that violates a hard constraint will have cost infinity, and all solutions that satisfy the hard constraints will have finite cost, and thus a cost-minimizing solution will be guaranteed to have finite cost if the hard constraints admit a feasible solution.
- At the end of page 5 he refers to the monotonicity of the cost functions and jumps to all sorts of conclusions in this paragraph I can't follow. Convexity of g implies no local minima? Not strict monotonicity implies one minimizer allows expansion?
- (Suresh): G monotone is needed to make the argument that the "orthogonal component" doesn't matter. Further, if g is convex, as well as the loss function, then the entire function is convex and has a single local minimum that's also global. If g is not strictly monotone, then there's a "space" of solutions at the bottom of the landscape. However, this space will have a finite dimensional boundary that can be used to find a solution of the same cost.
- Just as a remark: This paper answers my question from last meeting about why in the kernel tutorial a finite linear combination of functions was used as the vector space. The class of interesting functions for a learning problem is contained in this space.
Piyush
- What advantages does biased regularization offer? From an SVM context, does it mean we can regularize the bias term b?
- (Suresh): I imagine that it allows us to "bias" the desirable space of functions, by saying that we need to be as close to some f_0 as possible, within the constraints. I'm not sure about the SVM issue.
- Does hard constraints mean that penalty incurred with f(.) deviating from the "true" prediction would be considered very high? Or is it that we don't allow any slackness while learning?
- (Suresh): see my answer to Sam above.
Nathan
- How can we can by with an arbitrary cost function in Theorem 4? It seems to me that one could come up with some cost function that wouldn't minimize the risk function in a way that would be useful?
- (Suresh): sure you can. but the representer theorem doesn't care about the "meaning" of the solution per se. it merely says that the solution is low dimensional.
- I don't understand how hard constraints are included by the union of
and
in the range of the cost function?
- (Suresh): see my answer to Sam above.
Chang
- How does loss function measure goodness of functions for classification?
- What guarantees we could always get optimal algorithms? How to choose an algorithm among many local minima?
Avishek
- I couldn't understand Example 9 on Bayesian MAP Estimates.
- (Suresh): what they're saying is that for a MAP problem, you want to MAximize the Posterior probability, which for some problems is written as exp(-c...) for the loss term and exp(-g(|f|)) for the regularizer. Maximizing the probability is equivalent to minimizing the negative log likelihood, which recovers the original formulation.
Amit
- Why function space is infinite and how does representer theorem helps us?
Arvind
- I am having hard time getting any intuition out of Theorem 4, specially eq. 16.
6 Feb 08
Suresh
- I don't quite get the intuition behind the definition for the string kernel, even after reading the Haussler paper. Here's my question. In Haussler's paper, he derives conditions under which the string kernel induces the Hamming distance between strings. Is there any way in which one can derive a meaningful distance function from the proposed kernel ?
- It is interesting in the experiments that they conclude that "small to moderate substring sizes" are the best for F1. Since small to moderate in their case is 4-7, doesn't this negate the original claim that this approach allows them to extend context beyond what standard k-gram based methods can do ?
- Also, they show that F1 is best when λ is small. This again appears to point to the uselessness of longer substrings, because for small λ, those strings will almost never get incorporated.
Nathan
- I'm assuming that the std word kernel described in the Lodhi paper is pretty close to cosine similarity based approaches?
- Why should I use string kernels in X type of text processing? Are there X's for which string kernels will perform better in than others?
Piyush
- Can string kernels be equally useful for other kinds of sequence data such as protein sequences? Or do people use different types of kernels based on the notion of similarity in a given context -- local alignment strategies for protein sequences, for example?
- The parameter λ is said to be penalizing the subsequences that have interior gaps in them. Why do they set
? Is it because with
, large subsequence lengths for a given pair yield small similarity value K ?
- (Amit): λ accounts for the interior gaps. They pick
to give less weight age to non-contiguous subsequences.
- (Amit): λ accounts for the interior gaps. They pick
Arvind
- In section 2, last line of first para, paper talks that "it is possible to work in very high dimensional spaces such as those induced by kernels without overfitting",; Questions I have
- 1) If overfitting is something, classifying all points correctly so as to leave little room for generalization, then increasing dimension would cause overfitting in the same manner as increasing the complexity of the function, I have never seen this concept before or may be this point (overfitting through dimension) is not as emphasized. Is there any reason for that?
- 2) Why kernels do not cause overfitting?
Avishek
- SSK measures similarity by matching string subsequences in documents. It has been claimed that SSKs can capture semantic information of documents. I do not understand how this is possible.
- 1. Suppose, two different articles are reporting some criminal incident and each of them use different set of words to imply the same meaning, for e.g., homicide/murder, cops/police, residents/inhabitants. Can the SSKs match these documents well ?
- 2. Let's assume that we are trying to match a very large document and it's summary. Now, most of the string subsequences in the original document will have zero similarity score but all the string subsequences in the summary will have a high similarity score. So the SimilarityScore_(Summ->Doc) \neq SimilarityScore_(Doc->Summ). How will the SSK take care of such a situation ?
Sidd
- So, it looks like they are showing that and SVM with substring similarity performs as well as an SVM trained with word n-gram frequencies. Is there any work that shows if this captures and syntactic and semantic properties of text?
Evrard
- What is the accuracy level of this method on finding passages in documents, such as if used within a search engine?
- (Evrard): It is not suitable for web site or any dynamic information. It is mostly used in genome sequencing algorithms.
Chang
- 1) Why does it need include empty sequence in a string in definition 1?
- 2) In section 5, to get (3), we need Φ(si) are orthogonal. However, to get(4), we need (si) are not fully orthogonal. I am confused with the relationship between them.
13 Feb 08
Nathan
- The authors claim: the kernel methods we propose can, arguably, be motivated and derived only through the geometry of statistical manifolds. I buy it for the kernel methods, but are they themselves going to give us results that cannot be achieved through other means?
Avishek
- From Page 9 of the paper:
- "While geodesic distances are difficult to compute in general, in the case of the multinomial information geometry we can easily compute the geodesics by observing that the standard Euclidean metric on the surface of the positive n-sphere is the pull-back of the Fisher information metric on the simplex. This relationship is suggested by the form of the Fisher information given in equation (10)"
- - I didn't understand this para. What is an n-sphere ? What does "pull-back of the Fisher information metric on the simplex." mean ? How does Eq.10 imply this relationship ?
Piyush
- In section 3.2 (for Gaussian/spherical geometry case), it's shown that the Hellinger distance (dH) approaches the information distance (d) as
. Is this condition implied by sparse statistics of typical text problems? In other words, does
mean the points lie near the boundary of the simplex?
- The paper says that the spherical geometry can also be used non-parametrically (does it mean, for example, that we won't necessarily have an n-sphere for an n-simplex?) for other probability measures leading to a spherical geometry in a Hilbert space. Things work when they transform the n-simplex to the n-sphere and assume a diffeomorphic transformation F. But can we find such a transformation if we think of a spherical geometry in a Hilbert space? Also, pull-back, from what I gathered on its wikipedia entry is a linear map. Will we have a linear map for this general case as well?
- Last para of 3.2.1 mentions "asymptotic expansion in (8)". Isn't this expansion the un-numbered equation above for leading order correction term ψ0? Equation 8 on page 143 seems something else to me (or am I mistaken?).
- Which one is precisely the eqn 10 that they mention while suggesting the relationship between Euclidean metric on sphere surface and the pull-back of the Fisher information metric on the simplex (second last para on page 137). I guess it's one of the two equations in Appendix B on page 159. Which of the two do they mean (both are un-numbered :P)?
Sidd
- One of the points made in this paper (I think) is that the kernel depends on our intuition regarding the distribution of the data. Are there ways to automatically get an approximate notion of the distribution of any given data set?
Amit
- Why do we consider fisher information as a Riemannian matrix. Can we have some other metric using which we can express fisher information better?
Chang
- For the first sentence of 2.1.1. "For most geometries..." Could you provide some examples to explain it? For heat equation, there is only one bounded solution. (The initial model involves into a Cauchy problem.) So, what does it mean 'there is no closed form solution.'
- In addition, for a heat diffusion problem, how about using Fourier series to solve the Kernel within a period? Or we use the parametrix expansion just because of the short time.
Evrard
- Since geometry in general is not confined to 2D representations, can this kernel be adapted to input data with some relationship that can be expressed in a multi-dimensional geometry of statistical families?
22 Feb 08
Nathan
- The authors imply that a kernel does not have to be positive definite, I thought this was required?
Piyush
- If a kernel K is positive definite symmetric then -K is negative definite symmetric. I think just flipping the sign of K in
should let us see this (or does it entail something more?). The converse, however, doesn't hold in general (section 5.4). Why so?
Amit
- the similarity functions defined by a weighted FSTs are not guaranteed to be positive definite and symmetric. Why is it so? How do we check these fn's are PDS?
Chang
- The transitions are in a cycle. p[pi]=n[pi]. What does the statement, weighted automation or transducer is acyclic, mean? If cycled transducer will lead to infinite, how does it happen?
27 Feb 08
Avishek
I don't know much about the Fisher information matrix, but what's the intuition/motivation of using this particular matrix over here ?
12 Mar 08
Nathan
- For hyperkernels constructed with the power series method, what does the radius of convergence R tell us?
- (Chang): It means within the radius R, the Taylor series is convergent and approximately equal to value of g.
Piyush
- Doesn't the representer theorem hold for any choice of the regularizer function (it should just be a monotonically increasing function)? If so then, for hyper-kernel case, why do they just consider the l2 norm of the kernel function, as the regularization term?
(Chang): Hilbert Space is a integral square function space. The regularization term is based on this inner product space. So I think it should be l2 norm.
- The hyper-kernel method makes the kernel function k itself belong to an RKHS with an additional requirement (the symmetry). I wonder what would limit us to go even further up with such a hierarchy (i.e. think of hyper-hyper-...hyper-kernels). Do we restrict ourselves to one level (i.e. just hyper-kernel) just because of the additional constraints that would need to be imposed in the definition of each new RKHS? Or is there some other practical/theoretical basis for this?
(Chang): I think hyper kernel hilbert space should have similar properties as kernel hilbert space. The basic of kernel hilbert space is the function f , and the basic of hyper kernel hilbert space is kernel function of f. I agree with you in some degree. In this hierarchy problem we have underlying limitations for hyper case, because the basic function of hyper is kernel function, not normal function.
28 Mar 08
Piyush
The algorithms presented in the paper have running times linear in n and linear in c (the number of constraints). The constraints they use are distance constraints. Will the number c always be sublinear/linear/better-than-quadratic in the number of points? Can't there be cases having (O(n2)) many constraints (the number of point pairs). Given that the projection step alone takes O(cr2) time, for a reasonable value of r (e.g. O(n)), can't the running time again get close to (or more than) O(n3)?
Evrard
What is meant by scalability of the kernel method in machine learning?
18 Apr 08
Piyush
Can there be cases when neither NML nor NMAP exist? Or is it the case that NMAP would always exist since the regret is penalized in this case and hence minimax (penalized) regret would never go to infinity?
Have people tried analyzing similar bounds for losses other than log loss?
30 Apr 08
Nathan
Is there any reconciliation between this work and other standard feature selection techniques? How do they compare in practice?
- (Piyush) I don't think there is. I think the reason behind using kernels for feature generation is different from why we do feature selection in general. In standard feature selection, we typically want to get rid of features that are irrelevant from the learning perspective. Kernel generated features, on the other hand, give us the power of kernels, even without actually having to pay for this (i.e. without actually going to a high dimensional feature space and without kernelizing an algorithm). Besides, here we aren't actually getting a *smaller* number of features than what the original data has: feature generation using kernels gives us a small set of features (but yet reasonably big enough) so as to be as expressive if we used the associated kernel in the standard way.
Chang
For this paper, could I understand it like this. The kernel functions help mapping data space to feature space which preserves linear seperation property.Some constraints provide the number of dimensions in feature space.
- (Piyush) I think a way to think of such mappings would be the following two step process: First a kernel function maps the data to some high dimensional feature space. Then, using random projections, the kernel induced feature space representations (which are implicit) are again mapped to some lower dimensional feature space (where the features are actually explicit). Using these small number of features, thus obtained, is approximately as good as using the kernel in the standard manner. The advantage is that you don't need to ability to kernelize your algorithm and yet obtain the same expressive power of kernels.
>The map is from X to Rd. If we do some binary classification under feature space, could we inverse the map from Rd to X. I mean if the map can not be inversed how could we know the class of data in X set.
- (Piyush) You don't actually need an inverse map. Note that the class labelings are consistent no matter what space you live in. There is no mapping of labels.
>This situation would happen when the mapping matrix is nearly singular.
- (Piyush) There exist pseudo inverses to take care of such situations but as I said you don't actually need such an inverse map.
>How to compare computation cost between classify under feature space and under data space? Since mapping also needs some computation.
- (Piyush) Again, the mappings are implicit: you don't have to actually compute them. Since the inputs appear only as dot products, in the kernelized version they would appear as φ(x)φ(y) which is same as K(x,y). Evaluating the kernel is assumed to be a constant time operation.