|
|
| (44 intermediate revisions not shown) |
| Line 1: |
Line 1: |
| - | <startFeed />
| + | #REDIRECT [[MLRG/fall10]] |
| - | | + | |
| - | = Machine Learning Reading Group =
| + | |
| - | | + | |
| - | | + | |
| - | Informal reading group at [http://www.cs.utah.edu Utah] on machine learning topics (led by [http://hal3.name Hal]). Papers, discussion, etc. will be posted here.
| + | |
| - | | + | |
| - | '''Time:''' Wed, Mon or Fri 3:40-5:00pm (see schedule for each week)
| + | |
| - | | + | |
| - | '''Topic:''' Everything Kernels
| + | |
| - | | + | |
| - | '''Location:''' TBA (probably MEB Graphics Annex, again)
| + | |
| - | | + | |
| - | ==Schedule==
| + | |
| - | | + | |
| - | {| border="1" style="width: 100%; text-align:left" class="content"
| + | |
| - | |-
| + | |
| - | ! Date || Papers || Presenter
| + | |
| - | |-
| + | |
| - | | colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Introduction, Motivation and Optimization'''
| + | |
| - | |-
| + | |
| - | | W 16 Jan
| + | |
| - | || '''Introduction to Kernel Methods: SVMs and Kernel PCA'''
| + | |
| - | <i>[http://www.learning-with-kernels.org/sections Learning with Kernels], Chapter 1 (focus on sec 1, 4, 5, 7)</i>
| + | |
| - | || Piyush
| + | |
| - | |-
| + | |
| - | | W 23 Jan
| + | |
| - | || '''Reproducing Kernel Hilbert Spaces'''
| + | |
| - | <i>[http://www.learning-with-kernels.org/sections Learning with Kernels], Chapter 2 (focus on sec 2)</i><br/>
| + | |
| - | <i>[http://www.cs.utah.edu/~hal/publications.html#daume04rkhs From Zero to RKHS] (focus on sec 6)</i>
| + | |
| - | || Avishek
| + | |
| - | |-
| + | |
| - | | W 30 Jan
| + | |
| - | || '''Kernel-based learning and representer theorems'''
| + | |
| - | <i>[http://www.cs.utah.edu/~hal/tmp/P139.pdf A Generalized Representer Theorem]</i>
| + | |
| - | || Suresh
| + | |
| - | |-
| + | |
| - | | colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Kernels for Specific Types of Data'''
| + | |
| - | |-
| + | |
| - | | W 6 Feb
| + | |
| - | || '''String and Subsequence Kernels'''
| + | |
| - | <i>[http://jmlr.csail.mit.edu/papers/volume2/lodhi02a/lodhi02a.pdf Text Classification using String Kernels]</i>
| + | |
| - | <i>[http://axiom.anu.edu.au/~vishy/papers/VisSmo02.pdf Fast Kernels for String and Tree Matching]</i>
| + | |
| - | || Amit
| + | |
| - | |-
| + | |
| - | | W 13 Feb
| + | |
| - | || '''Heat Kernels on Graphs'''
| + | |
| - | <i>[http://www.jmlr.org/papers/volume6/lafferty05a/lafferty05a.pdf Diffusion kernels on statistical manifolds]</i> (also: [http://www.stat.purdue.edu/%7Elebanon/papers/idk.pdf shorter version])
| + | |
| - | || Arvind
| + | |
| - | |-
| + | |
| - | | F 22 Feb
| + | |
| - | || '''Rational Kernels'''
| + | |
| - | <i>[http://www.jmlr.org/papers/volume5/cortes04a/cortes04a.pdf Rational Kernels: Theory and Algorithms]</i>
| + | |
| - | || Sid
| + | |
| - | |-
| + | |
| - | | W 27 Feb
| + | |
| - | || '''Fisher Kernels and Finite State Automata'''
| + | |
| - | <i>[http://www.ecs.soton.ac.uk/~cjs/publications/FSA_NIPS02.pdf String Kernels, Fisher Kernels and Finite State Automata]</i>
| + | |
| - | || Amit
| + | |
| - | |-
| + | |
| - | | colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Learning the Kernel'''
| + | |
| - | |-
| + | |
| - | | F 7 Mar
| + | |
| - | || '''Semi-definite programming methods'''
| + | |
| - | <i>[http://jmlr.csail.mit.edu/papers/volume5/lanckriet04a/lanckriet04a.pdf Learning the kernel matrix with SDP]</i>
| + | |
| - | || Piyush
| + | |
| - | |-
| + | |
| - | | W 12 Mar
| + | |
| - | || '''Hyperkernel methods'''
| + | |
| - | <i>[http://jmlr.csail.mit.edu/papers/volume6/ong05a/ong05a.pdf Learning the Kernel with Hyperkernels]</i>
| + | |
| - | || Chang
| + | |
| - | |-
| + | |
| - | | W 26 Mar
| + | |
| - | || '''Rank-constrained methods'''
| + | |
| - | <i>[http://www.icml2006.org/icml_documents/camera-ready/064_Learning_Low_Rank_Ke.pdf Learning Low-Rank Kernel Matrices]</i>
| + | |
| - | || Nathan
| + | |
| - | |-
| + | |
| - | | colspan="3" bgcolor="#dddddd" style = "text-align:center" | '''Kernel Theory'''
| + | |
| - | |-
| + | |
| - | | W 2 Apr
| + | |
| - | || '''Kernels that are always good'''
| + | |
| - | <i>[http://jmlr.csail.mit.edu/papers/volume7/micchelli06a/micchelli06a.pdf Universal kernels]</i>
| + | |
| - | || Chang
| + | |
| - | |-
| + | |
| - | | W 9 Apr
| + | |
| - | || '''RBF and consistency'''
| + | |
| - | <i>[http://books.nips.cc/papers/files/nips18/NIPS2005_0170.pdf Worst-case bounds for GPs]</i>
| + | |
| - | || Evrard
| + | |
| - | |-
| + | |
| - | | W 16 Apr
| + | |
| - | || '''Kernel random projections'''
| + | |
| - | <i>[http://citeseer.ist.psu.edu/751943.html Kernels as features]</i>
| + | |
| - | || Piyush
| + | |
| - | |-
| + | |
| - | | W 23 Apr
| + | |
| - | || TBA
| + | |
| - | <i></i>
| + | |
| - | ||
| + | |
| - | |-
| + | |
| - | |}
| + | |
| - | | + | |
| - | ==Paper Summaries==
| + | |
| - | | + | |
| - | === 16 Jan: Introduction to Kernel Methods (Piyush) ===
| + | |
| - | | + | |
| - | Many learning problems require us to come up with similarity measures between input patterns. The dot or inner product <math>( <\textbf{x},\textbf{x}'>)</math> is one of the most commonly used measures for expressing similarity. Dot product is particularly attractive because of its computational simplicity and geometric interpretations. However, for many interesting problems, the patterns do not exist in a dot product space so, in order to use dot product as a similarity measure, we need a mapping of these patterns to some dot product space. Also, many a times, we need to come up with alternate representations (e.g. a nonlinear map) of patterns even if they do exist in a dot product space to get a representation more suitable to the problem at hand. Kernel methods helps us achieve these in an efficient manner.
| + | |
| - | | + | |
| - | Let's call X to be the input space and <math>\phi</math> to be a mapping of patterns from X to a (usually high dimensional) dot product space H (also called the <i>feature space</i>). This mapping can be represented as:
| + | |
| - | | + | |
| - | <math>\phi: X --> H </math>
| + | |
| - | | + | |
| - | <math>\textbf{x} := \phi(x)</math>
| + | |
| - | | + | |
| - | Using a kernel function <math>k</math>, we can define a similarity measure between patterns in input space X using dot products in the feature space H.
| + | |
| - | | + | |
| - | <math> k(x,x') := <\textbf{x},\textbf{x}'> = <\phi(x),\phi(x')> </math>
| + | |
| - | | + | |
| - | The nice thing about the kernel function <math>k</math> is that it lets us compute the inner products in <math>H</math> without carrying out the explicit mappings <math>(\phi(x))</math> of the input patterns from <math>X</math> to <math>H</math>.
| + | |
| - | | + | |
| - | <b> Hyperplane Learning in Dot Product space: </b>Hyperplane learning algorithm in a dot product space (H) can be a useful example to understand the utility of kernels. Consider a hyperplane classifier in some dot product space <math>H</math>:
| + | |
| - | | + | |
| - | <math> <\textbf{w},\textbf{x}> + b = 0 </math> where <math> w \in H, b \in R </math>
| + | |
| - | | + | |
| - | The corresponding decision function for a new test pattern <math>\textbf{x}</math> is <math>f(\textbf{x}) = sgn(<\textbf{w},\textbf{x}> + b) </math>, the value of <math>f(\textbf{x})</math> indicating which side of the hyperplane the pattern lies on. Among many possible separating hyperplanes for a problem, the one with the maximum margin of separation between training points is called the optimal hyperplane. Geometrically, it can be shown that the margin of the closest training point on either side of the hyperplane is <math>\frac{1}{||w||}</math>. It then turns out that the problem of finding the maximum margin hyperplane can be formulated as a quadratic programming problem:
| + | |
| - | | + | |
| - | <math> \min_{\textbf{w} \in H, b \in R} \ \ \tau(\textbf{w}) = \frac{1}{2}|| \textbf{w} ||^2 </math>
| + | |
| - | subject to <math> y_i(<\textbf{w},\textbf{x}_i> + b) >=1 </math> (for all <math> i=1,...,m </math> training points)
| + | |
| - | | + | |
| - | The above constrained optimization problem can be written as a Lagrangian.
| + | |
| - | | + | |
| - | <math> L(\textbf{w},b,\alpha) = \frac{1}{2} ||\textbf{w}||^2 - \sum_{i=1}^m \alpha_i( y_i(<\textbf{w},\textbf{x}_i> + b) - 1).</math> (primal problem)
| + | |
| - | | + | |
| - | Minimizing the Lagrangian with respect to the primal variables <math>\textbf{w}</math> and <math>b</math> leads to
| + | |
| - | | + | |
| - | <math> \sum_{i=1}^m \alpha_iy_i = 0 </math> and <math> \textbf{w} = \sum_{i=1}^m \alpha_iy_i\textbf{x}_i </math>
| + | |
| - | | + | |
| - | In practice, we usually solve the dual problem (for reasons which will become apparent later) in terms of the Lagrange's multipliers <math> \alpha </math>.
| + | |
| - | | + | |
| - | <math> \max_{\alpha \in R^m} \ \ \ W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^{m} \alpha_i \alpha_j y_i y_j <\textbf{x}_i,\textbf{x}_j> </math>
| + | |
| - | | + | |
| - | subject to <math> \alpha_i >=0 </math> for all i=1,...,m and <math> \sum_{i=1}^m \alpha_iy_i = 0 </math>
| + | |
| - | | + | |
| - | The decision function for a test pattern <math>\textbf{x}</math> can be written as
| + | |
| - | | + | |
| - | <math> f(\textbf{x}) = sgn(\sum y_i\alpha_i <\textbf{x},\textbf{x}_i> + b) </math>
| + | |
| - | | + | |
| - | where <math>b</math> can be solved by exploiting the constraints.
| + | |
| - | | + | |
| - | <b>The Kernel Trick for hyperplane learning in an arbitrary space:</b> To apply the above techniques for patterns in X (which is not a dot product space), we employ the kernel trick which helps us get away with explicitly computing the mappings <math> \phi </math>.
| + | |
| - | | + | |
| - | <math> <\textbf{x},\textbf{x}'>=<\phi(x),\phi(x')>=k(x,x') </math>
| + | |
| - | | + | |
| - | The dual problem above can now be written in terms of the kernel function.
| + | |
| - | | + | |
| - | <math> \max_{\alpha \in R^m} W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y_i y_j k(x_i,x_j)</math>
| + | |
| - | | + | |
| - | subject to <math> \alpha_i >=0 </math> for all i=1,...,m and <math>\sum_{i=1}^m \alpha_i y_i = 0 </math>
| + | |
| - | | + | |
| - | And the decision function becomes:
| + | |
| - | | + | |
| - | <math> f(x) = sgn(\sum y_i\alpha_i <\phi(x),\phi(x_i) + b) = sgn(\sum y_i\alpha_ik(x,x_i) + b) </math>
| + | |
| - | | + | |
| - | Thus we see that even when the input patterns do not lie in a dot product space, kernels give us a way to carry out a mapping to a high dimensional feature space (endowed with dot product) and work in the new feature space instead. As mentioned earlier, such mappings can also be used to get alternate representations of the input which can be useful in some cases (e.g., while using SVM which is essentially a linear classifier, a set of points not separable linearly in the original space can be mapped to a new feature space where they become linearly separable and thus we can apply SVM in the new feature space)
| + | |
| - | | + | |
| - | We were able to employ the kernel trick for the hyperplane classifiers (SVM for example) since we could write the problem in terms of inner products between the input patterns. It turns out that the same technique can be used for any problem which can be formulated in terms of inner products and thus nonlinear generalizations can be achieved for all such learning problems. PCA (a linear dimensionality reduction technique), for example, deals with covariance matrix (essentially expressed in terms of dot products) of input patterns. This makes it easy to come up with its nonlinear generalization (kernel PCA).
| + | |
| - | | + | |
| - | === 23 Jan: Reproducing Kernel Hilbert Spaces (Avishek) ===
| + | |
| - | | + | |
| - | The last talk familiarized us with the concept of kernels. We were introduced to the notion of kernel as a similarity measure. We learned that, at times, it's beneficial to map data separation problems from an input space to a new feature space so that the mapped data is linearly separable and hence easier to classify. The way to achieve this was by means of the "kernel trick". In addition, applying kernels in the input space is equivalent to computing dot-products in the mapped feature space. This week's talk delves deeper into the theory of kernels and tries to answer questions like: a) what kind of functions can give rise to kernels ? b) given some kernel, is it possible to come up with a feature space amenable to the "kernel trick" ? c) how can we construct a feature space for any arbitrary kernel such that the mapping properties from input to feature space are preserved ? d) starting with some feature space, can we construct a kernel for it ? etc.
| + | |
| - | | + | |
| - | Section 2 of the first paper provides some basic definitions for kernel functions. Subsequently, it presents the topics of feature space construction for kernel functions. The next two sections explain hilbert spaces for reproducing kernels, mercer kernels and mercer map. The other tutorial starts with an introductory refresher on some essential underlying mathematical concepts. Thereafter, it deftly walks us through the fundamental concepts of reproducing kernels, i.e., given some Hilbert space, what are the properties that it needs to satisfy so that some reproducing kernel may exist in that Hilbert space. It also delineates the construction of reproducing kernel hilbert spaces (RKHS) and describes two approaches for building feature spaces for reproducing kernels.
| + | |
| - | | + | |
| - | === 30 Jan: The Representer Theorem (Suresh) ===
| + | |
| - | | + | |
| - | To understand the power and value of the representer theorem, we need to start with the idea of regularization, for which [http://web.mit.edu/9.520/www/Classes/class02.pdf these notes] are very helpful.
| + | |
| - | | + | |
| - | In a basic prediction setup, you are given pairs <math>(x_i, y_i)</math>, and the goal is to train a function f such that <math>f(x_i) = y_i</math>. Of course in general, this will not be doable, so we'll define a loss function V, and estimate the average cost
| + | |
| - | <math>\frac{1}{m}\sum V(f(x_i), y_i)</math>.
| + | |
| - | | + | |
| - | Examples:
| + | |
| - | * <math>V(x,y) = \|x - y\|_2^2</math>
| + | |
| - | * <math>V(x,y) = \|x - y\|_1</math>
| + | |
| - | * <math>V(x,y) = (x-y-\epsilon)_+</math> (hinge loss)
| + | |
| - | | + | |
| - | How do we choose f ? The problem is that if we don't restrict f, we can always solve the problem exactly (the problem is ill-posed). But overfitting can happen. We want to restrict the solutions to "stable" ones, that exist, and vary smoothly as data points change (one data point change should not change the function drastically)
| + | |
| - | | + | |
| - | This is the idea of [http://en.wikipedia.org/wiki/Tikhonov_regularization (Tikhonov) regularization]. In general, we define some kind of measure of stability (usually a norm) over the function space, and write a modified optimization of the form
| + | |
| - | | + | |
| - | <math>\min_f \frac{1}{m}\sum V(f(x_i), y_i) + \gamma \|f\|^2</math>
| + | |
| - | | + | |
| - | But now we have to optimize over functions in the function space, and this is an infinite-dimensional space !
| + | |
| - | | + | |
| - | This is where the representer theorem comes to the rescue. In simple English, it says,
| + | |
| - | | + | |
| - | '''''"If functions are drawn from an RKHS, then the space of functions that minimize the cost functional has dimension equal to the number of training points"'''''
| + | |
| - | | + | |
| - | Assume the 'smoothness' function on f is represented by some <math>g(\|f \|)</math>. Note that for the special case of <math>g = \|f\|^2</math>, with <math>l_2</math> loss, this problem was solved earlier, using a spline basis.
| + | |
| - | | + | |
| - | Proof sketch:
| + | |
| - | | + | |
| - | We can write down the kernel functions generated by each data point: <math>x \rightarrow k(\cdot, x)</math>. This set of functions has dimension m.
| + | |
| - | | + | |
| - | Now any function in the function space can be written down in terms of components in this subspace, and components orthogonal to the subspace. By the magic of the kernel trick, the orthogonal components disappear from the cost function. Also, because the function g() is assumed to be strictly monotone, we can split the norm of f into terms in the subspace and orthogonal, and g can only reduce by eliminating the orthogonal terms. Therefore any minimizing f is a function solely of the training data.
| + | |
| - | | + | |
| - | Message: if we use kernels, intrinsic dimensionality of a learning problem is only the size of the training set.
| + | |
| - | | + | |
| - | === 5 Feb: String Kernels (Amit) ===
| + | |
| - | | + | |
| - | ''' Motivation '''
| + | |
| - | | + | |
| - | The motivation behind studying string kernels is their wide applicability on applications like
| + | |
| - | text Classification (Spam, Web-Spam, Categorization),Security (Network Traffic, Viruses, Trojans),Biology (Promoter, Splice Site Prediction) and many more.
| + | |
| - | | + | |
| - | ''' Why String Kernels? '''
| + | |
| - | | + | |
| - | So the first question which can come to any ones mind is why do we need String kernels for say text classification task and why it should perform better?
| + | |
| - | The question make sense since we have a classical technique for this task in which we map a document to a high dimensional feature vector, where each vector represents the presence or absence of feature.
| + | |
| - | So whats the problem with above method? If someone looks carefully then he/she can realize that the above approach loses all the word order information only retaining the frequency of the terms in the document and that information is very crucial from tasks like text classification and Bio where the sequence order matters a lot. Also to make the above approach efficient we remove stop words and do stemming which leads to losing inflectional information. One can argue that English is relatively poor in inflectional morphology but suppose we are working on agglutinative languages where most words are formed by joining morphemes together. There it won't be a nice idea to drop such a information.
| + | |
| - | So how should we representing our features so that we don't lose crucial information.
| + | |
| - | Here Kernels methods comes to our rescue which are an effective alternative to explicit feature extraction. They are very useful since feature extraction in itself is a challenging task and kernels chose implicitly good features. Decision trees is also another class of learning algorithms which choses its features implicitly. Also we know that kernels are known for computing similarity. Here also we need to do the same i.e to compute similarity between documents.
| + | |
| - | | + | |
| - | ''' Overview of String Kernels '''
| + | |
| - | | + | |
| - | Now we will be tomorrow looking at SSK (String Subsequence kernel-SSK) and its other variants like n-gram kernel, standard word kernel, spectrum kernels and weighted degree kernel.
| + | |
| - | | + | |
| - | So basically the general idea for text classification is we describe a kernel between two text documents which will compare them by the means of the substrings they contain: the more substrings in common, the more similar they are. The important part is such substrings need not be contiguous.
| + | |
| - | | + | |
| - | ''' How to make String Kernels fast? '''
| + | |
| - | | + | |
| - | Now there are some inherent disadvantages like they are very slow. The running time is O(nst) where n is length of subsequence and s and t denotes document length. This running time also they get using dynamic programming which is still quadratic in document length. So for larger values of training data they propose a approximate method. Also in other paper which we going to say authors do some clever preprocessing by sorting non-zero entries of d1 and d2 beforehand and then time taken will be linear in the sum of non-zero entries combined to compute similarity between documents.
| + | |
| - | | + | |
| - | === 13 Feb: Heat Kernels (Arvind) ===
| + | |
| - | This paper talks about using heat kernels for a text classification problem. Like anyother kernel, heat kernels also measure the similarity between two points, but in this case, this similarity is problem dependent. Heat kernels are based on heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family which mens that we do not have general heat kernel that is applicable to all problems. Heat kernels are problem dependent, they depend on the geometry of the data. In order to use these kernels, first we hypothesize about this geometry (Riemannian manifolds) which is equivalent to hypothesizing about the distribution of the data, once we know (ofcourse by assumption) the distribution, we can derive the heat kernel for that distribution (or Reimannian
| + | |
| - | manifold defined by the distribution) and use them. One nice thing about these heat kernels is that they take us beyond our standard Euclidean metric and let us define the distance between two points based on the actual geometry of the data.
| + | |
| - | | + | |
| - | === 22 Feb: Rational Kernels (Sid) ===
| + | |
| - | This paper describes a way to create kernels based on ''Weighted Finite State Transducers''. These kernels are essentially meant for data that is in the form of sequences of symbols (i.e. strings). The basic idea is that a notion of similarity between symbol sequences can be captured using a weighted FST. The FST has weights associated with states and state-transitions. Thus, it is possible to compute a weight for a specific output generated by the FST corresponding to a given input. In other words, for a given task, the weight assigned by the FST to two strings can be viewed as the similarity between the two data points. This forms the basis of this research -- the creation of kernels for strings, using weighted FSTs. The name ''rational'' comes from the "rational relations or rational transductions represented by the weighted transducer".
| + | |
| - | | + | |
| - | Because the similarity functions defined by a weighted FSTs are not guaranteed to be ''positive definite and symmetric'', they do not always define a kernel. The paper then provides some guidelines for constructing ''Positive Definite and Symmetric Rational Kernels'' from weighted FSTs. Finally, they provide some examples of existing kernels that can be reformulated as rational kernels.
| + | |
| - | | + | |
| - | === 27 Feb: Fisher Kernels and Finite State Automata (Amit) ===
| + | |
| - | This paper show how the generation of documents can be thought of as k-stage Markov process which leads to a Fisher kernel from which the n-gram and string kernels can be reconstructed. The probabilistic modeling approach suggests extending the Markov process to consider subsequences of varying length, rather than the standard fixed-length approach used in the string kernel. They use fisher score and information to create a fisher kernel from it. And then use it to deduce n-gram and string string kernels from it. After that they extend Fisher kernels to FSMs to have a more flexible approach to consider different length subsequences as features, depending on the features. The advantage of this is we don't need to choose the value of k a-priori.
| + | |
| - | | + | |
| - | === 7 Mar: Learning the kernel matrix with SDP (Piyush) ===
| + | |
| - | | + | |
| - | Choosing a kernel for a given problem requires some sort of prior knowledge, typically meaning that a) we need to be able to specify the geometry of the feature space, and b) we need some notion of similarly between input instances in that space. This boils down to a model selection task which is something we would like to avoid in general. This begs the following question: can we "learn" the kernel automatically? The answer, for certain specific problem settings, is yes! The problem is further simplified by the observation that, in many situations, we are not really required to learn the "kernel function" explicitly, for the entire input space. For example, in a transductive setting, we are given a set of labeled training data and another set of unlabeled test data. Instead of learning an explicit labeling function, the task is to just complete the labeling for unlabeled part. Thus our task reduces to learning just the Gram matrix (i.e. the Kernel matrix) associated with the given points (which we care about), rather than learning the kernel function.
| + | |
| - | | + | |
| - | '''Semidefinite programming:''' Semidefinite programming deals with optimizing convex functions over the convex cone of symmetric, positive semi-definite matrices. A semidefinite program consists of a linear objective, and [http://en.wikipedia.org/wiki/Linear_matrix_inequality linear matrix inequality (LMI)] and affine equality constraints (Ax = b type).
| + | |
| - | | + | |
| - | '''Learning the kernel matrix with SDP:''' It can be shown that learning the kernel matrix requires us to minimize a "generalized performance measure":
| + | |
| - | | + | |
| - | <math>\omega_{C,\tau}(K) = \max_{\mathbf{\alpha}} 2\mathbf{\alpha}^T\textbf{e} - \alpha^T(G(K) + \tau I)\mathbf{\alpha} : C \geq \alpha \geq 0 , \alpha^Ty = 0 </math>
| + | |
| - | | + | |
| - | on the training data with respected to the kernel matrix K in some convex subset <math>\kappa</math> of positive semi-definite matrices. The above is the standard dual formulation of the (general) maximum margin problem where <math>G_{ij}(K) = [K]_{ij}y_iy_j = k(x_i,x_j)y_iy_j</math> and other parameters have their usual meaning: <math>\alpha</math> are the Lagrange coefficients, C is the soft margin parameter, <math>\tau \geq 0</math>, <math>\textbf{e}</math> is a vector of ones. This minimization problem can be written as:
| + | |
| - | | + | |
| - | <math>\min_{K \in \kappa}\omega_{C,\tau}(K_{tr})\ \ s.t.\ \ trace(K) = c</math>
| + | |
| - | | + | |
| - | where <math>K_{tr}</math> is the part of kernel matrix associated with the training data.
| + | |
| - | | + | |
| - | The above is, in general, a difficult to solve convex optimization problem. However, as we shall see in the talk, problems of this type can be equivalently cast as an SDP (for which constraints are LMI, instead of convex functions) using tools from standard convex analysis such as Lagrangian duality, [http://en.wikipedia.org/wiki/Schur_complement Schur Complement] Lemma, etc. The good thing is that there exist techniques that let us solve the resulting SDP efficiently.
| + | |
| - | | + | |
| - | Such a transformation can also be achieved for the case when the kernel matrix K is constrained to the linear subspace (<math>K = \sum_{i=1}^m \mu_iK_i</math>).
| + | |
| - | It turns out that SDP formulations for specific problems such as hard margin and 1-norm/2-norm soft margin problem can be recovered from the SDP formulation of the above problem. We can also learn the 2-norm soft-margin parameter C using SDP.
| + | |
| - | | + | |
| - | ==Past Semesters==
| + | |
| - | | + | |
| - | * [[MLRG/summer07|Summer 2007]]
| + | |
| - | * [[MLRG/fall07|Fall 2007]]
| + | |
| - | | + | |
| - | [[Category:Seminars]]
| + | |