MLRG

From ResearchWiki

(Difference between revisions)
Jump to: navigation, search
(Schedule)
(22 Feb: Rational Kernels (Sid))
Line 238: Line 238:
Because the similarity functions defined by a weighted FSTs are not guaranteed to be ''positive definite and symmetric'', they do not always define a kernel. The paper then provides some guidelines for constructing ''Positive Definite and Symmetric Rational Kernels'' from weighted FSTs. Finally, they provide some examples of existing kernels that can be reformulated as rational kernels.
Because the similarity functions defined by a weighted FSTs are not guaranteed to be ''positive definite and symmetric'', they do not always define a kernel. The paper then provides some guidelines for constructing ''Positive Definite and Symmetric Rational Kernels'' from weighted FSTs. Finally, they provide some examples of existing kernels that can be reformulated as rational kernels.
 +
 +
=== 27 Feb: Fisher Kernels and Finite State Automata (Amit) ===
 +
This paper show how the generation of documents can be thought of as k-stage Markov process which leads to a Fisher kernel from which the n-gram and string kernels can be reconstructed. The probabilistic modeling approach suggests extending the Markov process to consider subsequences of varying length, rather than the standard fixed-length approach used in the string kernel. They use fisher score and information to create a fisher kernel from it. And then use it to deduce n-gram and string string kernels from it. After that they extend Fisher kernels to FSMs to have a more flexible approach to consider different length subsequences as features, depending on the features. The advantage of this is we don't need to choose the value of k a-priori.
==Past Semesters==
==Past Semesters==

Revision as of 18:20, 27 February 2008

Contents

Machine Learning Reading Group

Informal reading group at Utah on machine learning topics (led by Hal). Papers, discussion, etc. will be posted here.

Time: Wed, Mon or Fri 3:40-5:00pm (see schedule for each week)

Topic: Everything Kernels

Location: TBA (probably MEB Graphics Annex, again)

Schedule

Date Papers Presenter
Introduction, Motivation and Optimization
W 16 Jan Introduction to Kernel Methods: SVMs and Kernel PCA

Learning with Kernels, Chapter 1 (focus on sec 1, 4, 5, 7)

Piyush
W 23 Jan Reproducing Kernel Hilbert Spaces

Learning with Kernels, Chapter 2 (focus on sec 2)
From Zero to RKHS (focus on sec 6)

Avishek
W 30 Jan Kernel-based learning and representer theorems

A Generalized Representer Theorem

Suresh
Kernels for Specific Types of Data
W 6 Feb String and Subsequence Kernels

Text Classification using String Kernels Fast Kernels for String and Tree Matching

Amit
W 13 Feb Heat Kernels on Graphs

Diffusion kernels on statistical manifolds (also: shorter version)

Arvind
F 22 Feb Rational Kernels

Rational Kernels: Theory and Algorithms

Sid
W 27 Feb Fisher Kernels and Finite State Automata

String Kernels, Fisher Kernels and Finite State Automata

Amit
Learning the Kernel
F 7 Mar Semi-definite programming methods

Learning the kernel matrix with SDP

Piyush
W 12 Mar Hyperkernel methods

Learning the Kernel with Hyperkernels

Chang
W 26 Mar Rank-constrained methods

Learning Low-Rank Kernel Matrices

Nathan
Kernel Theory
W 2 Apr Kernels that are always good

Universal kernels

Chang
W 9 Apr RBF and consistency

Worst-case bounds for GPs

Evrard
W 16 Apr Kernel random projections

Kernels as features

Piyush
W 23 Apr TBA

Paper Summaries

16 Jan: Introduction to Kernel Methods (Piyush)

Many learning problems require us to come up with similarity measures between input patterns. The dot or inner product ( <\textbf{x},\textbf{x}'>) is one of the most commonly used measures for expressing similarity. Dot product is particularly attractive because of its computational simplicity and geometric interpretations. However, for many interesting problems, the patterns do not exist in a dot product space so, in order to use dot product as a similarity measure, we need a mapping of these patterns to some dot product space. Also, many a times, we need to come up with alternate representations (e.g. a nonlinear map) of patterns even if they do exist in a dot product space to get a representation more suitable to the problem at hand. Kernel methods helps us achieve these in an efficient manner.

Let's call X to be the input space and φ to be a mapping of patterns from X to a (usually high dimensional) dot product space H (also called the feature space). This mapping can be represented as:

φ:X − − > H

\textbf{x} := \phi(x)

Using a kernel function k, we can define a similarity measure between patterns in input space X using dot products in the feature space H.

 k(x,x') := <\textbf{x},\textbf{x}'> = <\phi(x),\phi(x')>

The nice thing about the kernel function k is that it lets us compute the inner products in H without carrying out the explicit mappings (φ(x)) of the input patterns from X to H.

Hyperplane Learning in Dot Product space: Hyperplane learning algorithm in a dot product space (H) can be a useful example to understand the utility of kernels. Consider a hyperplane classifier in some dot product space H:

 <\textbf{w},\textbf{x}> + b = 0 where  w \in H, b \in R

The corresponding decision function for a new test pattern \textbf{x} is f(\textbf{x}) = sgn(<\textbf{w},\textbf{x}> + b) , the value of f(\textbf{x}) indicating which side of the hyperplane the pattern lies on. Among many possible separating hyperplanes for a problem, the one with the maximum margin of separation between training points is called the optimal hyperplane. Geometrically, it can be shown that the margin of the closest training point on either side of the hyperplane is \frac{1}{||w||}. It then turns out that the problem of finding the maximum margin hyperplane can be formulated as a quadratic programming problem:

 \min_{\textbf{w} \in H, b \in R} \ \ \tau(\textbf{w}) = \frac{1}{2}|| \textbf{w} ||^2 subject to  y_i(<\textbf{w},\textbf{x}_i> + b) >=1 (for all i = 1,...,m training points)

The above constrained optimization problem can be written as a Lagrangian.

 L(\textbf{w},b,\alpha) = \frac{1}{2} ||\textbf{w}||^2 - \sum_{i=1}^m \alpha_i( y_i(<\textbf{w},\textbf{x}_i> + b) - 1). (primal problem)

Minimizing the Lagrangian with respect to the primal variables \textbf{w} and b leads to

 \sum_{i=1}^m \alpha_iy_i = 0 and  \textbf{w} = \sum_{i=1}^m \alpha_iy_i\textbf{x}_i

In practice, we usually solve the dual problem (for reasons which will become apparent later) in terms of the Lagrange's multipliers α.

 \max_{\alpha \in R^m} \ \ \ W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^{m} \alpha_i \alpha_j y_i y_j <\textbf{x}_i,\textbf{x}_j>

subject to αi > = 0 for all i=1,...,m and  \sum_{i=1}^m \alpha_iy_i = 0

The decision function for a test pattern \textbf{x} can be written as

 f(\textbf{x}) = sgn(\sum y_i\alpha_i <\textbf{x},\textbf{x}_i> + b)

where b can be solved by exploiting the constraints.

The Kernel Trick for hyperplane learning in an arbitrary space: To apply the above techniques for patterns in X (which is not a dot product space), we employ the kernel trick which helps us get away with explicitly computing the mappings φ.

 <\textbf{x},\textbf{x}'>=<\phi(x),\phi(x')>=k(x,x')

The dual problem above can now be written in terms of the kernel function.

 \max_{\alpha \in R^m} W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y_i y_j k(x_i,x_j)

subject to αi > = 0 for all i=1,...,m and \sum_{i=1}^m \alpha_i y_i = 0

And the decision function becomes:

 f(x) = sgn(\sum y_i\alpha_i <\phi(x),\phi(x_i) + b) = sgn(\sum y_i\alpha_ik(x,x_i) + b)

Thus we see that even when the input patterns do not lie in a dot product space, kernels give us a way to carry out a mapping to a high dimensional feature space (endowed with dot product) and work in the new feature space instead. As mentioned earlier, such mappings can also be used to get alternate representations of the input which can be useful in some cases (e.g., while using SVM which is essentially a linear classifier, a set of points not separable linearly in the original space can be mapped to a new feature space where they become linearly separable and thus we can apply SVM in the new feature space)

We were able to employ the kernel trick for the hyperplane classifiers (SVM for example) since we could write the problem in terms of inner products between the input patterns. It turns out that the same technique can be used for any problem which can be formulated in terms of inner products and thus nonlinear generalizations can be achieved for all such learning problems. PCA (a linear dimensionality reduction technique), for example, deals with covariance matrix (essentially expressed in terms of dot products) of input patterns. This makes it easy to come up with its nonlinear generalization (kernel PCA).

23 Jan: Reproducing Kernel Hilbert Spaces (Avishek)

The last talk familiarized us with the concept of kernels. We were introduced to the notion of kernel as a similarity measure. We learned that, at times, it's beneficial to map data separation problems from an input space to a new feature space so that the mapped data is linearly separable and hence easier to classify. The way to achieve this was by means of the "kernel trick". In addition, applying kernels in the input space is equivalent to computing dot-products in the mapped feature space. This week's talk delves deeper into the theory of kernels and tries to answer questions like: a) what kind of functions can give rise to kernels ? b) given some kernel, is it possible to come up with a feature space amenable to the "kernel trick" ? c) how can we construct a feature space for any arbitrary kernel such that the mapping properties from input to feature space are preserved ? d) starting with some feature space, can we construct a kernel for it ? etc.

Section 2 of the first paper provides some basic definitions for kernel functions. Subsequently, it presents the topics of feature space construction for kernel functions. The next two sections explain hilbert spaces for reproducing kernels, mercer kernels and mercer map. The other tutorial starts with an introductory refresher on some essential underlying mathematical concepts. Thereafter, it deftly walks us through the fundamental concepts of reproducing kernels, i.e., given some Hilbert space, what are the properties that it needs to satisfy so that some reproducing kernel may exist in that Hilbert space. It also delineates the construction of reproducing kernel hilbert spaces (RKHS) and describes two approaches for building feature spaces for reproducing kernels.

30 Jan: The Representer Theorem (Suresh)

To understand the power and value of the representer theorem, we need to start with the idea of regularization, for which these notes are very helpful.

In a basic prediction setup, you are given pairs (xi,yi), and the goal is to train a function f such that f(xi) = yi. Of course in general, this will not be doable, so we'll define a loss function V, and estimate the average cost \frac{1}{m}\sum V(f(x_i), y_i).

Examples:

  • V(x,y) = \|x - y\|_2^2
  • V(x,y) = \|x - y\|_1
  • V(x,y) = (xy − ε) + (hinge loss)

How do we choose f ? The problem is that if we don't restrict f, we can always solve the problem exactly (the problem is ill-posed). But overfitting can happen. We want to restrict the solutions to "stable" ones, that exist, and vary smoothly as data points change (one data point change should not change the function drastically)

This is the idea of (Tikhonov) regularization. In general, we define some kind of measure of stability (usually a norm) over the function space, and write a modified optimization of the form

\min_f \frac{1}{m}\sum V(f(x_i), y_i) + \gamma \|f\|^2

But now we have to optimize over functions in the function space, and this is an infinite-dimensional space !

This is where the representer theorem comes to the rescue. In simple English, it says,

"If functions are drawn from an RKHS, then the space of functions that minimize the cost functional has dimension equal to the number of training points"

Assume the 'smoothness' function on f is represented by some g(\|f \|). Note that for the special case of g = \|f\|^2, with l2 loss, this problem was solved earlier, using a spline basis.

Proof sketch:

We can write down the kernel functions generated by each data point: x \rightarrow k(\cdot, x). This set of functions has dimension m.

Now any function in the function space can be written down in terms of components in this subspace, and components orthogonal to the subspace. By the magic of the kernel trick, the orthogonal components disappear from the cost function. Also, because the function g() is assumed to be strictly monotone, we can split the norm of f into terms in the subspace and orthogonal, and g can only reduce by eliminating the orthogonal terms. Therefore any minimizing f is a function solely of the training data.

Message: if we use kernels, intrinsic dimensionality of a learning problem is only the size of the training set.

5 Feb: String Kernels (Amit)

Motivation

The motivation behind studying string kernels is their wide applicability on applications like text Classification (Spam, Web-Spam, Categorization),Security (Network Traffic, Viruses, Trojans),Biology (Promoter, Splice Site Prediction) and many more.

Why String Kernels?

So the first question which can come to any ones mind is why do we need String kernels for say text classification task and why it should perform better? The question make sense since we have a classical technique for this task in which we map a document to a high dimensional feature vector, where each vector represents the presence or absence of feature. So whats the problem with above method? If someone looks carefully then he/she can realize that the above approach loses all the word order information only retaining the frequency of the terms in the document and that information is very crucial from tasks like text classification and Bio where the sequence order matters a lot. Also to make the above approach efficient we remove stop words and do stemming which leads to losing inflectional information. One can argue that English is relatively poor in inflectional morphology but suppose we are working on agglutinative languages where most words are formed by joining morphemes together. There it won't be a nice idea to drop such a information. So how should we representing our features so that we don't lose crucial information. Here Kernels methods comes to our rescue which are an effective alternative to explicit feature extraction. They are very useful since feature extraction in itself is a challenging task and kernels chose implicitly good features. Decision trees is also another class of learning algorithms which choses its features implicitly. Also we know that kernels are known for computing similarity. Here also we need to do the same i.e to compute similarity between documents.

Overview of String Kernels

Now we will be tomorrow looking at SSK (String Subsequence kernel-SSK) and its other variants like n-gram kernel, standard word kernel, spectrum kernels and weighted degree kernel.

So basically the general idea for text classification is we describe a kernel between two text documents which will compare them by the means of the substrings they contain: the more substrings in common, the more similar they are. The important part is such substrings need not be contiguous.

How to make String Kernels fast?

Now there are some inherent disadvantages like they are very slow. The running time is O(nst) where n is length of subsequence and s and t denotes document length. This running time also they get using dynamic programming which is still quadratic in document length. So for larger values of training data they propose a approximate method. Also in other paper which we going to say authors do some clever preprocessing by sorting non-zero entries of d1 and d2 beforehand and then time taken will be linear in the sum of non-zero entries combined to compute similarity between documents.

13 Feb: Heat Kernels (Arvind)

This paper talks about using heat kernels for a text classification problem. Like anyother kernel, heat kernels also measure the similarity between two points, but in this case, this similarity is problem dependent. Heat kernels are based on heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family which mens that we do not have general heat kernel that is applicable to all problems. Heat kernels are problem dependent, they depend on the geometry of the data. In order to use these kernels, first we hypothesize about this geometry (Riemannian manifolds) which is equivalent to hypothesizing about the distribution of the data, once we know (ofcourse by assumption) the distribution, we can derive the heat kernel for that distribution (or Reimannian manifold defined by the distribution) and use them. One nice thing about these heat kernels is that they take us beyond our standard Euclidean metric and let us define the distance between two points based on the actual geometry of the data.

22 Feb: Rational Kernels (Sid)

This paper describes a way to create kernels based on Weighted Finite State Transducers. These kernels are essentially meant for data that is in the form of sequences of symbols (i.e. strings). The basic idea is that a notion of similarity between symbol sequences can be captured using a weighted FST. The FST has weights associated with states and state-transitions. Thus, it is possible to compute a weight for a specific output generated by the FST corresponding to a given input. In other words, for a given task, the weight assigned by the FST to two strings can be viewed as the similarity between the two data points. This forms the basis of this research -- the creation of kernels for strings, using weighted FSTs. The name rational comes from the "rational relations or rational transductions represented by the weighted transducer".

Because the similarity functions defined by a weighted FSTs are not guaranteed to be positive definite and symmetric, they do not always define a kernel. The paper then provides some guidelines for constructing Positive Definite and Symmetric Rational Kernels from weighted FSTs. Finally, they provide some examples of existing kernels that can be reformulated as rational kernels.

27 Feb: Fisher Kernels and Finite State Automata (Amit)

This paper show how the generation of documents can be thought of as k-stage Markov process which leads to a Fisher kernel from which the n-gram and string kernels can be reconstructed. The probabilistic modeling approach suggests extending the Markov process to consider subsequences of varying length, rather than the standard fixed-length approach used in the string kernel. They use fisher score and information to create a fisher kernel from it. And then use it to deduce n-gram and string string kernels from it. After that they extend Fisher kernels to FSMs to have a more flexible approach to consider different length subsequences as features, depending on the features. The advantage of this is we don't need to choose the value of k a-priori.

Past Semesters

Personal tools