MLRG

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
(28 Mar: Learning Low-Rank Kernels (Nathan))
m (Redirecting to MLRG/fall10)
 
(29 intermediate revisions not shown)
Line 1: Line 1:
-
<startFeed />
+
#REDIRECT [[MLRG/fall10]]
-
 
+
-
= Machine Learning Reading Group =
+
-
 
+
-
 
+
-
Informal reading group at [http://www.cs.utah.edu Utah] on machine learning topics (led by [http://hal3.name Hal]).  Papers, discussion, etc. will be posted here.
+
-
 
+
-
'''Time:''' Wed, Mon or Fri 3:40-5:00pm (see schedule for each week)
+
-
 
+
-
'''Topic:''' Everything Kernels
+
-
 
+
-
'''Location:''' TBA (probably MEB Graphics Annex, again)
+
-
 
+
-
==Schedule==
+
-
 
+
-
{|  border="1" style="width: 100%; text-align:left" class="content"
+
-
|-
+
-
! Date || Papers || Presenter
+
-
|-
+
-
| colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Introduction, Motivation and Optimization'''
+
-
|-
+
-
| W 16 Jan
+
-
|| '''Introduction to Kernel Methods: SVMs and Kernel PCA'''
+
-
<i>[http://www.learning-with-kernels.org/sections Learning with Kernels], Chapter 1 (focus on sec 1, 4, 5, 7)</i>
+
-
|| Piyush
+
-
|-
+
-
| W 23 Jan
+
-
|| '''Reproducing Kernel Hilbert Spaces'''
+
-
<i>[http://www.learning-with-kernels.org/sections Learning with Kernels], Chapter 2 (focus on sec 2)</i><br/>
+
-
<i>[http://www.cs.utah.edu/~hal/publications.html#daume04rkhs From Zero to RKHS] (focus on sec 6)</i>
+
-
|| Avishek
+
-
|-
+
-
| W 30 Jan
+
-
|| '''Kernel-based learning and representer theorems'''
+
-
<i>[http://www.cs.utah.edu/~hal/tmp/P139.pdf A Generalized Representer Theorem]</i>
+
-
|| Suresh
+
-
|-
+
-
| colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Kernels for Specific Types of Data'''
+
-
|-
+
-
| W 6 Feb
+
-
|| '''String and Subsequence Kernels'''
+
-
<i>[http://jmlr.csail.mit.edu/papers/volume2/lodhi02a/lodhi02a.pdf Text Classification using String Kernels]</i>
+
-
<i>[http://axiom.anu.edu.au/~vishy/papers/VisSmo02.pdf Fast Kernels for String and Tree Matching]</i>
+
-
|| Amit
+
-
|-
+
-
| W 13 Feb
+
-
|| '''Heat Kernels on Graphs'''
+
-
<i>[http://www.jmlr.org/papers/volume6/lafferty05a/lafferty05a.pdf Diffusion kernels on statistical manifolds]</i> (also: [http://www.stat.purdue.edu/%7Elebanon/papers/idk.pdf shorter version])
+
-
|| Arvind
+
-
|-
+
-
| F 22 Feb
+
-
|| '''Rational Kernels'''
+
-
<i>[http://www.jmlr.org/papers/volume5/cortes04a/cortes04a.pdf Rational Kernels: Theory and Algorithms]</i>
+
-
|| Sid
+
-
|-
+
-
| W 27 Feb
+
-
|| '''Fisher Kernels and Finite State Automata'''
+
-
<i>[http://www.ecs.soton.ac.uk/~cjs/publications/FSA_NIPS02.pdf String Kernels, Fisher Kernels and Finite State Automata]</i>
+
-
|| Amit
+
-
|-
+
-
| colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Learning the Kernel'''
+
-
|-
+
-
| F 7 Mar
+
-
|| '''Semi-definite programming methods'''
+
-
<i>[http://jmlr.csail.mit.edu/papers/volume5/lanckriet04a/lanckriet04a.pdf Learning the kernel matrix with SDP]</i>
+
-
|| Piyush
+
-
|-
+
-
| W 12 Mar
+
-
|| '''Hyperkernel methods'''
+
-
<i>[http://jmlr.csail.mit.edu/papers/volume6/ong05a/ong05a.pdf Learning the Kernel with Hyperkernels]</i>
+
-
|| Chang
+
-
|-
+
-
| F 28 Mar
+
-
|| '''Rank-constrained methods'''
+
-
<i>[http://www.icml2006.org/icml_documents/camera-ready/064_Learning_Low_Rank_Ke.pdf Learning Low-Rank Kernel Matrices]</i>
+
-
|| Nathan
+
-
|-
+
-
| colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Kernel Theory'''
+
-
|-
+
-
| M 31 Mar
+
-
|| '''Kernels that are always good'''
+
-
<i>[http://jmlr.csail.mit.edu/papers/volume7/micchelli06a/micchelli06a.pdf Universal kernels]</i>
+
-
|| Chang
+
-
|-
+
-
| W 9 Apr
+
-
|| '''RBF and consistency'''
+
-
<i>[http://books.nips.cc/papers/files/nips18/NIPS2005_0170.pdf Worst-case bounds for GPs]</i>
+
-
|| Evrard
+
-
|-
+
-
| W 16 Apr
+
-
|| '''Kernel random projections'''
+
-
<i>[http://citeseer.ist.psu.edu/751943.html Kernels as features]</i>
+
-
|| Piyush
+
-
|-
+
-
| W 23 Apr
+
-
|| TBA
+
-
<i></i>
+
-
||
+
-
|-
+
-
|}
+
-
 
+
-
==Paper Summaries==
+
-
 
+
-
=== 16 Jan: Introduction to Kernel Methods (Piyush) ===
+
-
 
+
-
Many learning problems require us to come up with similarity measures between input patterns. The dot or inner product <math>( <\textbf{x},\textbf{x}'>)</math> is one of the most commonly used measures for expressing similarity. Dot product is particularly attractive because of its computational simplicity and geometric interpretations. However, for many interesting problems, the patterns do not exist in a dot product space so, in order to use dot product as a similarity measure, we need a mapping of these patterns to some dot product space. Also, many a times, we need to come up with alternate representations (e.g. a nonlinear map) of patterns even if they do exist in a dot product space to get a representation more suitable to the problem at hand. Kernel methods helps us achieve these in an efficient manner.
+
-
 
+
-
Let's call X to be the input space and <math>\phi</math> to be a mapping of patterns from X to a (usually high dimensional) dot product space H (also called the <i>feature space</i>). This mapping can be represented as:
+
-
 
+
-
<math>\phi: X --> H </math>
+
-
 
+
-
<math>\textbf{x} := \phi(x)</math>
+
-
 
+
-
Using a kernel function <math>k</math>, we can define a similarity measure between patterns in input space X using dot products in the feature space H.
+
-
 
+
-
<math> k(x,x') := <\textbf{x},\textbf{x}'> = <\phi(x),\phi(x')> </math>
+
-
 
+
-
The nice thing about the kernel function <math>k</math> is that it lets us compute the inner products in <math>H</math> without carrying out the explicit mappings <math>(\phi(x))</math> of the input patterns from <math>X</math> to <math>H</math>.
+
-
 
+
-
<b> Hyperplane Learning in Dot Product space: </b>Hyperplane learning algorithm in a dot product space (H) can be a useful example to understand the utility of kernels. Consider a hyperplane classifier in some dot product space <math>H</math>:
+
-
 
+
-
<math> <\textbf{w},\textbf{x}> + b = 0 </math> where <math> w \in H, b \in R </math>
+
-
 
+
-
The corresponding decision function for a new test pattern <math>\textbf{x}</math> is <math>f(\textbf{x}) = sgn(<\textbf{w},\textbf{x}> + b) </math>, the value of <math>f(\textbf{x})</math> indicating which side of the hyperplane the pattern lies on. Among many possible separating hyperplanes for a problem, the one with the maximum margin of separation between training points is called the optimal hyperplane. Geometrically, it can be shown that the margin of the closest training point on either side of the hyperplane is <math>\frac{1}{||w||}</math>. It then turns out that the problem of finding the maximum margin hyperplane can be formulated as a quadratic programming problem:
+
-
 
+
-
<math> \min_{\textbf{w} \in H, b \in R} \ \ \tau(\textbf{w}) = \frac{1}{2}|| \textbf{w} ||^2 </math>
+
-
subject to <math> y_i(<\textbf{w},\textbf{x}_i> + b) >=1 </math> (for all <math> i=1,...,m </math> training points)
+
-
 
+
-
The above constrained optimization problem can be written as a Lagrangian.
+
-
 
+
-
<math> L(\textbf{w},b,\alpha) = \frac{1}{2} ||\textbf{w}||^2 - \sum_{i=1}^m \alpha_i( y_i(<\textbf{w},\textbf{x}_i> + b) - 1).</math>        (primal problem)
+
-
 
+
-
Minimizing the Lagrangian with respect to the primal variables <math>\textbf{w}</math> and <math>b</math> leads to
+
-
 
+
-
<math> \sum_{i=1}^m \alpha_iy_i = 0 </math> and <math> \textbf{w} = \sum_{i=1}^m \alpha_iy_i\textbf{x}_i </math>
+
-
 
+
-
In practice, we usually solve the dual problem (for reasons which will become apparent later) in terms of the Lagrange's multipliers <math> \alpha </math>.
+
-
 
+
-
<math> \max_{\alpha \in R^m} \ \ \ W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^{m} \alpha_i \alpha_j y_i y_j <\textbf{x}_i,\textbf{x}_j> </math>
+
-
 
+
-
subject to <math> \alpha_i >=0 </math> for all i=1,...,m  and <math> \sum_{i=1}^m \alpha_iy_i = 0 </math>
+
-
 
+
-
The decision function for a test pattern <math>\textbf{x}</math> can be written as
+
-
 
+
-
<math> f(\textbf{x}) = sgn(\sum y_i\alpha_i <\textbf{x},\textbf{x}_i> + b) </math>
+
-
 
+
-
where <math>b</math> can be solved by exploiting the constraints.
+
-
 
+
-
<b>The Kernel Trick for hyperplane learning in an arbitrary space:</b> To apply the above techniques for patterns in X (which is not a dot product space), we employ the kernel trick which helps us get away with explicitly computing the mappings <math> \phi </math>.
+
-
 
+
-
<math> <\textbf{x},\textbf{x}'>=<\phi(x),\phi(x')>=k(x,x') </math>
+
-
 
+
-
The dual problem above can now be written in terms of the kernel function.
+
-
 
+
-
<math> \max_{\alpha \in R^m} W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y_i y_j k(x_i,x_j)</math>
+
-
 
+
-
subject to <math> \alpha_i >=0 </math> for all i=1,...,m and <math>\sum_{i=1}^m \alpha_i y_i = 0 </math>
+
-
 
+
-
And the decision function becomes:
+
-
 
+
-
<math> f(x) = sgn(\sum y_i\alpha_i <\phi(x),\phi(x_i) + b) = sgn(\sum y_i\alpha_ik(x,x_i) + b) </math>
+
-
 
+
-
Thus we see that even when the input patterns do not lie in a dot product space, kernels give us a way to carry out a mapping to a high dimensional feature space (endowed with dot product) and work in the new feature space instead. As mentioned earlier, such mappings can also be used to get alternate representations of the input which can be useful in some cases (e.g., while using SVM which is essentially a linear classifier, a set of points not separable linearly in the original space can be mapped to a new feature space where they become linearly separable and thus we can apply SVM in the new feature space)
+
-
 
+
-
We were able to employ the kernel trick for the hyperplane classifiers (SVM for example) since we could write the problem in terms of inner products between the input patterns. It turns out that the same technique can be used for any problem which can be formulated in terms of inner products and thus nonlinear generalizations can be achieved for all such learning problems. PCA (a linear dimensionality reduction technique), for example, deals with covariance matrix (essentially expressed in terms of dot products) of input patterns. This makes it easy to come up with its nonlinear generalization (kernel PCA).
+
-
 
+
-
=== 23 Jan: Reproducing Kernel Hilbert Spaces (Avishek) ===
+
-
 
+
-
The last talk familiarized us with the concept of kernels. We were introduced to the notion of kernel as a similarity measure. We learned that, at times, it's beneficial to map data separation problems from an input space to a new feature space so that the mapped data is linearly separable and hence easier to classify. The way to achieve this was by means of the "kernel trick". In addition, applying kernels in the input space is equivalent to computing dot-products in the mapped feature space. This week's talk delves deeper into the theory of kernels and tries to answer questions like: a) what kind of functions can give rise to kernels ? b) given some kernel, is it possible to come up with a feature space amenable to the "kernel trick" ? c) how can we construct a feature space for any arbitrary kernel such that the mapping properties from input to feature space are preserved ? d) starting with some feature space, can we construct a kernel for it ? etc.
+
-
 
+
-
Section 2 of the first paper provides some basic definitions for kernel functions. Subsequently, it presents the topics of feature space construction for kernel functions. The next two sections explain hilbert spaces for reproducing kernels, mercer kernels and mercer map. The other tutorial starts with an introductory refresher on some essential underlying mathematical concepts. Thereafter, it deftly walks us through the fundamental concepts of reproducing kernels, i.e., given some Hilbert space, what are the properties that it needs to satisfy so that some reproducing kernel may exist in that Hilbert space. It also delineates the construction of reproducing kernel hilbert spaces (RKHS) and describes two approaches for building feature spaces for reproducing kernels.
+
-
 
+
-
=== 30 Jan: The Representer Theorem (Suresh) ===
+
-
 
+
-
To understand the power and value of the representer theorem, we need to start with the idea of regularization, for which [http://web.mit.edu/9.520/www/Classes/class02.pdf these notes] are very helpful.
+
-
 
+
-
In a basic prediction setup, you are given pairs <math>(x_i, y_i)</math>, and the goal is to train a function f such that <math>f(x_i) = y_i</math>. Of course in general, this will not be doable, so we'll define a loss function V, and estimate the average cost
+
-
<math>\frac{1}{m}\sum V(f(x_i), y_i)</math>.
+
-
 
+
-
Examples:
+
-
* <math>V(x,y) = \|x - y\|_2^2</math>
+
-
* <math>V(x,y) = \|x - y\|_1</math>
+
-
* <math>V(x,y) = (x-y-\epsilon)_+</math> (hinge loss)
+
-
 
+
-
How do we choose f ? The problem is that if we don't restrict f, we can always solve the problem exactly (the problem is ill-posed). But overfitting can happen. We want to restrict the solutions to "stable" ones, that exist, and vary smoothly as data points change (one data point change should not change the function drastically)
+
-
 
+
-
This is the idea of [http://en.wikipedia.org/wiki/Tikhonov_regularization (Tikhonov) regularization]. In general, we define some kind of measure of stability (usually a norm) over the function space, and write a modified optimization of the form
+
-
 
+
-
<math>\min_f \frac{1}{m}\sum V(f(x_i), y_i) + \gamma \|f\|^2</math>
+
-
 
+
-
But now we have to optimize over functions in the function space, and this is an infinite-dimensional space !
+
-
 
+
-
This is where the representer theorem comes to the rescue. In simple English, it says,
+
-
 
+
-
'''''"If functions are drawn from an RKHS, then the space of functions that minimize the cost functional has dimension equal to the number of training points"'''''
+
-
 
+
-
Assume the 'smoothness' function on f is represented by some <math>g(\|f \|)</math>. Note that for the special case of <math>g = \|f\|^2</math>, with <math>l_2</math> loss, this problem was solved earlier, using a spline basis.
+
-
 
+
-
Proof sketch:
+
-
 
+
-
We can write down the kernel functions generated by each data point: <math>x \rightarrow k(\cdot, x)</math>. This set of functions has dimension m.
+
-
 
+
-
Now any function in the function space can be written down in terms of components in this subspace, and components orthogonal to the subspace. By the magic of the kernel trick, the orthogonal components disappear from the cost function. Also, because the function g() is assumed to be strictly monotone, we can split the norm of f into terms in the subspace and orthogonal, and g can only reduce by eliminating the orthogonal terms. Therefore any minimizing f is a function solely of the training data.
+
-
 
+
-
Message: if we use kernels, intrinsic dimensionality of a learning problem is only the size of the training set.
+
-
 
+
-
=== 5 Feb: String Kernels (Amit) ===
+
-
 
+
-
''' Motivation '''
+
-
 
+
-
The motivation behind studying string kernels is their wide applicability on applications like
+
-
text Classification (Spam, Web-Spam, Categorization),Security (Network Traffic, Viruses, Trojans),Biology (Promoter, Splice Site Prediction) and many more.
+
-
 
+
-
''' Why String Kernels? '''
+
-
 
+
-
So the first question which can come to any ones mind is why do we need String kernels for say text classification task and why it should perform better?
+
-
The question make sense since we have a classical technique for this task in which we map a document to a high dimensional feature vector, where each vector represents the presence or absence of feature.
+
-
So whats the problem with above method? If someone looks carefully then he/she can realize that the above approach loses all the word order information only retaining the frequency of the terms in the document and that information is very crucial from tasks like text classification and Bio where the sequence order matters a lot. Also to make the above approach efficient we remove stop words and do stemming which leads to losing inflectional information. One can argue that English is relatively poor in inflectional morphology but suppose we are working on agglutinative languages where most words are formed by joining morphemes together. There it won't be a nice idea to drop such a information.
+
-
So how should we representing our features so that we don't lose crucial information.
+
-
Here Kernels methods comes to our rescue which are an effective alternative to explicit feature extraction. They are very useful since feature extraction in itself is a challenging task and kernels chose implicitly good features. Decision trees is also another class of learning algorithms which choses its features implicitly. Also we know that kernels are known for computing similarity. Here also we need to do the same i.e to compute similarity between documents.
+
-
 
+
-
''' Overview of String Kernels '''
+
-
 
+
-
Now we will be tomorrow looking at SSK (String Subsequence kernel-SSK) and its other variants like n-gram kernel, standard word kernel, spectrum kernels and weighted degree kernel.
+
-
 
+
-
So basically the general idea for text classification is we describe a kernel between two text documents which will compare them by the means of the substrings they contain: the more substrings in common, the more similar they are. The important part is such substrings need not be contiguous.
+
-
 
+
-
''' How to make String Kernels fast? '''
+
-
 
+
-
Now there are some inherent disadvantages like they are very slow. The running time is O(nst) where n is length of subsequence and s and t denotes document length. This running time also they get using  dynamic programming which is still quadratic in document length. So for larger values of training data they propose a approximate method. Also in other paper which we going to say authors do some clever preprocessing by sorting non-zero entries of d1 and d2 beforehand and then time taken will be linear in the sum of non-zero entries combined to compute similarity between documents.
+
-
 
+
-
=== 13 Feb: Heat Kernels (Arvind) ===
+
-
This paper talks about using heat kernels for a text classification problem. Like anyother kernel, heat kernels also measure the similarity between two points, but in this case, this similarity is problem dependent. Heat kernels are based on heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family which mens that we do not have general heat kernel that is applicable to all problems. Heat kernels are problem dependent, they depend on the geometry of the data. In order to use these kernels, first we hypothesize about this geometry (Riemannian manifolds) which is equivalent to hypothesizing about the distribution of the data, once we know (ofcourse by assumption) the distribution, we can derive the heat kernel for that distribution (or Reimannian
+
-
manifold defined by the distribution) and use them. One nice thing about these heat kernels is that they take us beyond our standard Euclidean metric and let us define the distance between two points based on the actual geometry of the data.
+
-
 
+
-
=== 22 Feb: Rational Kernels (Sid) ===
+
-
This paper describes a way to create kernels based on ''Weighted Finite State Transducers''. These kernels are essentially meant for data that is in the form of sequences of symbols (i.e. strings). The basic idea is that a notion of similarity between symbol sequences can be captured using a weighted FST. The FST has weights associated with states and state-transitions. Thus, it is possible to compute a weight for a specific output generated by the FST corresponding to a given input. In other words, for a given task, the weight assigned by the FST to two strings can be viewed as the similarity between the two data points. This forms the basis of this research -- the creation of kernels for strings, using weighted FSTs. The name ''rational'' comes from the "rational relations or rational transductions represented by the weighted transducer".
+
-
 
+
-
Because the similarity functions defined by a weighted FSTs are not guaranteed to be ''positive definite and symmetric'', they do not always define a kernel. The paper then provides some guidelines for constructing ''Positive Definite and Symmetric Rational Kernels'' from weighted FSTs. Finally, they provide some examples of existing kernels that can be reformulated as rational kernels.
+
-
 
+
-
=== 27 Feb: Fisher Kernels and Finite State Automata (Amit) ===
+
-
This paper show how the generation of documents can be thought of as k-stage Markov process which leads to a Fisher kernel from which the n-gram and string kernels can be reconstructed. The probabilistic modeling approach suggests extending the Markov process to consider subsequences of varying length, rather than the standard fixed-length approach used in the string kernel. They use fisher score and information to create a fisher kernel from it. And then use it to deduce n-gram and string string kernels from it. After that they extend Fisher kernels to FSMs to have a more flexible approach to consider different length subsequences as features, depending on the features. The advantage of this is we don't need to choose the value of k a-priori.
+
-
 
+
-
=== 7 Mar: Learning the kernel matrix with SDP (Piyush) ===
+
-
 
+
-
Choosing a kernel for a given problem requires some sort of prior knowledge, typically meaning that a) we need to be able to specify the geometry of the feature space, and b) we need some notion of similarly between input instances in that space. This boils down to a model selection task which is something we would like to avoid in general. This begs the following question: can we "learn" the kernel automatically? The answer, for certain specific problem settings, is yes! The problem is further simplified by the observation that, in many situations, we are not really required to learn the "kernel function" explicitly, for the entire input space. For example, in a transductive setting, we are given a set of labeled training data and another set of unlabeled test data. Instead of learning an explicit labeling function, the task is to just complete the labeling for unlabeled part. In such a setting, our task reduces to learning just the Gram matrix (i.e. the Kernel matrix) associated with the given points (which we care about), rather than learning the kernel function.
+
-
 
+
-
'''Semidefinite programming:''' Semidefinite programming deals with optimizing convex functions over the convex cone (P) of symmetric, positive semi-definite (PSD) matrices. A semidefinite <i>program</i> consists of a linear objective, and [http://en.wikipedia.org/wiki/Linear_matrix_inequality linear matrix inequality (LMI)] and affine equality constraints (Ax  = b type).
+
-
 
+
-
'''Learning the kernel matrix with SDP:''' We know that any symmetric PSD matrix is also a kernel matrix. We also know that the set of symmetric PSD matrices forms a convex cone. We define our cost function to assess the quality of a kernel matrix and we specifically look for convex (in kernel matrix K) cost functions. It is clear that, under these conditions, we would end up with a semidefinite programming problem for learning the kernel matrix.
+
-
 
+
-
'''Transforming a convex optimization problem to a semidefinite program''': it can be shown that learning the kernel matrix requires us to minimize a "generalized performance measure":
+
-
 
+
-
<math>\omega_{C,\tau}(K) = \max_{\mathbf{\alpha}} 2\mathbf{\alpha}^T\textbf{e} - \alpha^T(G(K) + \tau I)\mathbf{\alpha} : C \geq \alpha \geq 0 , \alpha^Ty = 0 </math>
+
-
 
+
-
on the training data with respected to the kernel matrix K in some convex subset <math>\kappa</math> of positive semi-definite matrices. The important point to notice here is that the above objective is convex in K. The above is the standard dual formulation of (general) maximum margin problem where <math>G_{ij}(K) = [K]_{ij}y_iy_j = k(x_i,x_j)y_iy_j</math> and other parameters have their usual meaning: <math>\alpha</math> are the Lagrange coefficients, C is the soft margin parameter, <math>\tau \geq 0</math>, <math>\textbf{e}</math> is a vector of ones. It turns out that this minimization problem can be written as:
+
-
 
+
-
<math>\min_{K \in \kappa}\omega_{C,\tau}(K_{tr})\ \ s.t.\ \ trace(K) = c</math>
+
-
 
+
-
where <math>K_{tr}</math> is the part of kernel matrix associated with the training data and c is a constant (i.e. the trace is bounded).
+
-
 
+
-
The above is, in general, a difficult to solve convex optimization problem. However, as we shall see in the talk, problems of this type can be equivalently cast as an SDP (for which constraints are LMI, instead of convex functions) using tools from standard convex analysis such as Lagrangian duality, [http://en.wikipedia.org/wiki/Schur_complement Schur Complement] Lemma, etc. The good thing is that there exist techniques that let us solve the resulting SDP efficiently.
+
-
 
+
-
The transformation mentioned above does not take into account any entanglement between training and test data blocks of the kernel matrix being learned (which is necessary if we want to do any sort of transduction). The entanglement is necessary because only the <math>K_{tr}</math> part of the kernel matrix is being optimized and it is hoped that the "mixed block" of the matrix (consisting of both training and test data) would be '''learned''' (by some sort of automatic tuning) in the process. The entanglement can be achieved for the case when the kernel matrix K is constrained to the linear subspace (<math>K = \sum_{i=1}^m \mu_iK_i</math>). This is done by controlling the capacity of the search space of possible kernel matrices. It prevents overfitting and gives good generalization on test data. The form of kernel matrix assumed  and the bounded trace(K) condition help us achieve this. The matrices <math>K_i</math> essentially serve as "rough" estimates of similarity in the input space (which basically consists of training and test data both). By finding an optimal linear combination of such matrices, we are basically "fine tuning" our estimate of similarity in the input space.
+
-
 
+
-
It turns out that SDP formulations for specific problems such as hard margin and 1-norm/2-norm soft margin problem can be recovered from the SDP formulation of the above problem. We can also learn the 2-norm soft-margin parameter C using SDP.
+
-
 
+
-
'''Extending SDP for induction setting:''' The standard framework applies to transduction setting where we are given training and test data, both, beforehand. To generalize this for induction setting where we only have the training data, one would need to solve the transduction for each new test point. This would amount to solving a new SDP (of dimension <math>n_{tr}+1</math> for each such new test point which would be computationally very expensive. A possible solution could be to use a recursive approach which uses the previously solved SDP of dimension <math>n_{tr}</math> to solve the new SDP of dimension <math>n_{tr}+1</math>. Such an approach could also be used in an online learning setting.
+
-
 
+
-
=== 12 March: Learning the Kernel with Hyperkernels (Chang) ===
+
-
'''Objective:'''
+
-
Choose a kernel suitable for estimation with a support vector machine.
+
-
 
+
-
'''Motivation:'''
+
-
>Gaussian RBF is unable to find a kernel width suitable for both directions. The classification function is dominated by the dimension with large variance. The traditional way to handle it is to normalize each dimension independently.
+
-
>Instead of normalizing the input data, we make the kernel adaptive to allow independent scales for each dimension. This allows the kernel to handle unnormailsed data.
+
-
 
+
-
'''Method:'''
+
-
Define a reproducing kernel Hilbert space on the space of kernels itself, which is named hyper RKHS. Introduce hyperkernel which is a kernel on the kernel. For some point wise positive, symmetric kernels, like Gaussian RBF kernel, we could use them to construct a hyperkernel. After using Taylor expansion, the hyperkernel is actually a linear combination of kernels.
+
-
 
+
-
Let us back to the original problem. To achieve the goal, the paper introduces a quality functional, which measures the ‘badness’ of kernel function. The optimization problem of regularized functional, which aims to prevent overfitting, has a mix of linear, quadratic and semidefinte constraints. Therefore, the optimization can be found by solving the semidefinite programming problem.
+
-
 
+
-
'''Outline of the paper:'''
+
-
>Define quality functionals.
+
-
Pr {|Remp (f, X, Y)-R(f)|>=error(risk)}< deltaQ + delta(Q+error(Q));
+
-
Indicates the error bound is independent of the particular value of the quality functional.
+
-
 
+
-
>Define Hyper-RKHS H’
+
-
>>The function in this space is kernel. x’:=X*X , and the hyper kernel in this space is k’.
+
-
>>Regularized Quality Functional
+
-
Qreg(k, X, Y) := (1/m)*sum[square(yi-f(xi))]+lambda/2*square(||f||)on H+ lambdaQ/2*square(||k||) on H’
+
-
>>Mixture Gaussian process
+
-
Exponentiate the negative of the regularized quality functional.
+
-
Compare it to Gaussian process estimation, and assume the covariance function of Gaussian Process is distributed according to a Wishart distribution. We find we are studying a mixture of GP. The maximum likelihood estimator leads to the same optimization problems as minimizing the regularized quality functional.
+
-
 
+
-
>Some examples of Hyper kernel
+
-
 
+
-
>Optimization Problems for Regularized Risk based Quality Functionals
+
-
We only consider positive semidefinite kernel. Rewrite the regularized quality functional into (18). Compare (18) with Quadratic Minimax. We could use semidefinite programming to obtain the optimization. At the same time, to achieve the goal the paper states at the beginning, we need to choose several constraints to it.
+
-
 
+
-
=== 28 Mar: Learning Low-Rank Kernels (Nathan) ===
+
-
 
+
-
Given a set of data points <math>a_1 \dots a_n</math> and a kernel function <math>k(a_i, a_j)</math>, we generally think of the kernel matrix <math>K</math> as being an <math>n \times n</math> matrix where the <math>(i, j)</math> element corresponds to the value of the kernel function at: <math>k(a_i, a_j)</math>. However, this raises some concern with regard to scalability of using the Kernel matrix. Most kernel-based algorithms require <math>O(n^3)</math> and obviously requires <math>O(n^2)</math> space.
+
-
 
+
-
==Past Semesters==
+
-
 
+
-
* [[MLRG/summer07|Summer 2007]]
+
-
* [[MLRG/fall07|Fall 2007]]
+
-
 
+
-
[[Category:Seminars]]
+

Latest revision as of 04:02, 4 September 2010

  1. REDIRECT MLRG/fall10
Personal tools