MLRG/summer08

From ResearchWiki

Jump to: navigation, search

Contents

Machine Learning Reading Group

Informal reading group at Utah on machine learning topics (led by Hal). Papers, discussion, etc. will be posted here.

Time: Wed 3:40-5:00pm (see schedule for each week)

Topic: Multitask learning, transfer learning and domain adaptation

Location: Graphics Annex

Schedule

Date Papers Presenter
Historical Interest
W 07 May

Rich Caruana (1997). Multitask learning. Machine Learning, 28, 41-75.
Sebastian Thrun (1996). Is learning the nth thing any easier than learning the first? In Advances in NIPS, 640-646.

Hal
The Bayesian Way
W 14 May

Rajat Raina, Andrew Y. Ng and Daphne Koller (2006). Constructing informative priors using transfer learning. ICML.

Cecily
W 21 May

Kai Yu, Volker Tresp and Anton Schwaighofer (2005). Learning Gaussian Processes from Multiple Tasks. ICML.

Amit
W 28 May

Ya Xue, Xuejun Liao, Lawrence Carin, Balaji Krishnapuram (2007). Multi-task learning for classification with dirichlet process priors. JMLR.

Nathan
The Kernel Way
W 04 Jun

Theodoros Evgeniou, Charles Micchelli, and Massimiliano Pontil (2005). Learning multiple tasks with kernel methods. J. Machine Learning Research, 6: 615--637.

Arvind
Learning Theory and Generalization Bounds
W 11 Jun

Ben-David, S. and Schuller, R. (2003). Exploiting task relatedness for multiple task learning. (NIPS version) (Longer COLT Version).

Piyush
W 25 Jun

Abernethy, Bartlett and Rakhlin (2007). Multitask Learning with Expert Advice. Technical Report (UC Berkeley).

Arvind
W 2 Jul

M. M. Mahmud, Sylvian Ray (2007). Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations. NIPS.

Cecily
Domain Adaptation
W 16 Jul

Rie Kubota Ando and Tong Zhang (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research, Vol 6:1817-1853, 2005.

Seth
W 23 Jul

John Blitzer, Ryan McDonald, and Fernando Pereira (2006). Domain Adaptation with Structural Correspondence Learning. Empirical Methods in Natural Language Processing - EMNLP 2006.

Nathan
W 30 Jul

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira (2006). Analysis of Representations for Domain Adaptation. NIPS
John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn Wortman (2007). Learning Bounds for Domain Adaptation. NIPS.

Piyush
W 06 Aug

Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. & Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. NIPS.

Seth
W 13 Aug

Steffen Bickel, Michael Bruckner and Tobias Scheffer (2007). Discriminative Learning for Differing Training and Test Distributions. ICML.
NIPS 2006 Workshop on Learning when test and training inputs have different distributions

Amit
W 20 Aug

Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten Borgwardt and Bernhard Scholkopf (2007). Correcting Sample Selection Bias by Unlabeled data. NIPS.

Nathan

Paper Summaries

14 May:

This paper utilizes the 20 ng data. They learn weights for term vectors using logistic regression. I had a much nicer summary, but being wiki-ly challenged as I am , I erased it when my hotspot internet connection expired, and I don't have time now to generate a new one.

21 May 08: Learning Gaussian Processes from Multiple Tasks (Amit)

Introduction The problem of multi-task learning here is formulated on a hierarchical Bayesian framework. The framework exploits the equivalence between parametric linear models and non-parametric Gaussian models.

Structure of the Paper In Section 2 we will see how can we view linear models as Gaussian Processes(GPs). The advantages of using Gaussian Processes over linear models is GPs directly work on the kernel matrix for finite training points. Thus model complexity is independent of the dimensionality of x, which allows us to work even in infinite dimensional feature space. Second non-linear functions can be handled using kernels K(x,x')=<x,x'>. Section3 tells us how can we learn from multiple tasks using probabilistic framework. Section 4 gives us a generative story which uses linear models for multi-task. Here they assume a normal-inverse-Wishart distribution as the hyper prior over multiple tasks. This distribution is the conjugate prior for a multivariate Gaussian distribution. EM algorithm is used to estimate parameters and the complexity of the above model is O(kmd3). m is number of multi-tasks and k is number of EM iterations and d is the dimensionality of data. Section5 uses Gaussian processes to derive models both for transductive as well as inductive setting. This model is useful only when n < < d i.e. number of training points is comparatively lot less than dimensionality of feature space. The advantage of using a Gaussian process is its complexity only depends on just training data size rather than dimensionality of data size. the complexity of this model is O(kmn3). Rest of the paper talks about experiment section in which they show results on a toy problem and text categorization task. They found their Gaussian process model to be better than other models.


28 May 08: Multi-Task Learning for Classification with Dirichlet Process Priors (Nathan)

The Goal

The authors consider two cases of the Mult-task learning problem. The first is symmetric multi-task learning (SMTL) which deals with the situation of training classifiers for a set of tasks jointly. The second case is referred to as asymmetric multi-task learning (AMTL) which uses the posterior density function from the first case (SMTL) as the prior for a new task (not previous seen in the SMTL step.) The later has the advantage of not requiring all the data from the previous M tasks and thus is arguably more efficient.

The Paper

The authors produce two algorithms for inference in each situation described above, both algorithms are applications of hierarchical Bayesian models utilizing Dirichlet Process Priors (as the title might suggest :) In the symmetric case, a truncated stick-breaking formulation of the Dirichlet Process is used share information across tasks in the form of a prior distribution G from which the parameters for a task-specific logistic regression classifier is drawn. In short, the classifiers are trained on local (task-specific) training data, but also have access to the other tasks via the parameter drawn from the DP prior G. In order to actually compute G an approximation method named variational inference is used and will briefly be described.

For the asymmetrical case, essentially, the SMTL algorithm is ran for M tasks, then the posterior from the first step is used to formulate a prior for the M+1 task. In order to do inference over this model, a Markov Chain Monte Carlo algorithm is used, namely the Metropolis-Hastings algorithm. The AMTL approach is relatively inexpensive IF one already happens to have the posterior from a SMTL experiment laying around.

The Dirichlet Process and its formulation will also be addressed as well as a criterion for determining if tasks are similar that was also presented in this paper.


Other useful links

Overview of Dirichlet Processes

Variational Inference in DP Mixtures

Lots of Variational inference links

June 4, 08 : Learning Multiple Tasks with Kernel Methods (Arvind)

So far we have been looking at Bayesian way of doing a multi-task learning. In this paper, we will see how we can use kernel to learn multiple tasks. We all know how to use kernel to learn a single task (or a function ), in this paper, author extends this idea of learning a single task to learn multiple tasks simultaneously using special kernels called multi-task kernel functions.

Author first considers a linear case and shows that if all learning functions (tasks) are assumed to be linear (fl(x) = ul'x), then multi-tasks learning problem can be formulated in the form of a regularization problem (as given in eq 6):

R(u)=\frac{1}{mn}\sum_{l\in N_n}\sum_{j\in N_m}L(y_{jl},u_l' x_{jl})+ \gamma J(u) -- eq(1)

Regularizer term J(u) controls the degree of relatedness among various tasks. Since it is hard to say anything about the regularizer if problem is cast in this way, author presents another formulation that is more intuitive and gives a better idea of task relatedness. We still have the linear function assumption. In this new formulation, feature vectors(weight vectors) of all tasks are combined into a new feature vector w that is common among all tasks. Linear function is given by fl(x) = w'Blx where w is common to all problems and Bl is task specific. Using this linear form of function gives us the following formulation of the multi-task problem:

S(w)=\frac{1}{mn}\sum_{l\in N_n}\sum_{j\in N_m}L(y_{jl},w'B_lx_{jl})+ \gamma w'w -- eq(2)

Author shows that these two formulations are related. In eq(1), tasks-relatedness is given by the regularizer term J(u) while in eq(2), task relatedness is given by Bl.

Later, author shows three different examples of the problem for three different Bl. For all three examples, author gives the relation between J(u) and Bl . Finally to show the superiority of this algorithm, author compares it with single-task version of a kernel machine (learning one kernel function for each task) and one-task (learning one function for all tasks using all data combined.)

June 11, 08 : Exploiting Task Relatedness for Multiple Task Learning (Piyush)

So far, we have seen empirical evidences on how multitask learning can outperform single task learning, in certain cases. This paper is among the first ones that provided theoretical justifications to this fact.

The paper presents a framework that views task relatedness, not from an algorithmic perspective, but on the basic of similarity of example generating distributions, underlying each of the tasks. It defines a data generation mechanism that serves to determine the notion of task relatedness. Essentially, this definition of task relatedness makes use of a set of transformations F. It assumes that the data for each task can be thought of as been generated by a distribution obtained by picking a transformation function from this set and applying it to some "fixed" distribution P. Although this notion doesn't capture task relatedness for all MTL scenarios but holds for several interesting real world MTL problems. This notion helps to provide generalization error bounds for each task. This is in contrast with a previous result (corollary 13, this paper) which only bounds the average generalization error over the different tasks.

The paper defines a generalized VC-dimension parameter for the MTL case and shows that the generalized VC-dimension parameter can be significantly less than the standard VC-dimension. It provides an analysis giving the relationship of this parameter with the standard VC-dimension and provides conditions under which the generalization guarantees (based on per task sample size needed) for MTL are better than the case of single task learning.

June 25, 08 : Multitask Learning with Expert Advice (Arvind)

In this paper we see how we can do the multitask learning using many experts in online settings (sequential prediction). Problem can be stated as following: A forecaster (ultimate expert) is to make a sequence of predictions given access to a number of experts where each expert makes its own prediction at every round. The forecaster combines the predictions of the experts to form its own prediction, taking into account each expert's past performance. There are two variations of this online problem, one is sequential multitask problem where task specific examples come in a sequence (Forecaster is asked to make sequence of predictions on task one, then another sequence of predictions on task two and so on) and another is shifting multitask problem where order of tasks is arbitrary.

A single task version of the sequential prediction problem is when there is only one task and many experts. Forecaster makes predictions taking into account each expert's opinion. In multitask version there are many (K) tasks and authors claim that there is a trade off between the number of experts to be used for prediction and the complexity of the prediction algorithm. We can have a case where number of experts is equal to the number of tasks and each expert makes prediction for its own tasks (unrelated case) while in another case, we can just choose one expert making prediction for every task (fully related). We are interested in a case where number of experts (m) is more than one but less than number of tasks.

Once we have number of experts to be used for prediction, we can define a mapping π that maps each task to the one of the m experts and we call this mapping a hyperexpert. Given access to N experts, we would like to make prediction using these hyperexperts (note that these hyperexperts are m-sized subset of N experts which also include a mapping from tasks to experts). Since there are exponential number of subsets, it is hard to enumerate all subsets (or hyperexperts) and make the prediction the way we did for single task version. Authors provide an algorithm which makes use of sampling techniques to approximate the prediction which was hard in its exact form. Authors show experimentally that their algorithm gets good approximation and converge quickly to the exact case.


July 17, 08 : A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data (Seth)

This paper tackles the notion of semi-supervised learning by augmenting a target labeled problem with generated related problems from unlabeled data. The idea is that a general lower dimensional predictive structure Θ can be generated from the related unlabeled auxiliary problems and used to improve the performance of the target labeled problem.

The first part of the paper delves into the theoretical foundations of supervised learning and augments them to the multi-task case. The standard method of finding a predictor f\in \mathcal{H} given a set of training examples \{(\mathbf{X}_i, Y_i)\} is by minimizing a particular loss function L:

\hat{f} = \mbox{arg } \min_{f \in \mathcal{H}}\sum_{i=1}^{n} L(f(\mathbf{X}_i, Y_i))

In the multi-task setting this can easily be augmented for each of the l tasks or problems as follows:

\hat{f}_{l,\theta} = \mbox{arg } \min_{f \in \mathcal{H}_{l, \theta}}\sum_{i=1}^{n_l} L(f(\mathbf{X}_i^{l}, Y_i^{l}))

where there are nl data samples from each of the m tasks given a fixed structural parameter θ.

An algorithm is subsequently proposed using linear models that finds the common low-dimensional feature space θ shared by the problem set using the following general formulation

[\hat{\theta},\{\hat{f_l}\}] = \mbox{arg } \min_{\theta \in \Gamma, \{f \in \mathcal{H}_{l, \theta}\}}\sum_{l=1}^{m} \frac{1}{n_l} \sum_{i=1}^{n_l} L(f(\mathbf{X}_i^{l}, Y_i^{l}))

where the functions f \in \mathcal{H}_{l, \theta} are linear functions with a shared parameter θ.

Using that formulation the paper then described an alternating structure optimization where some of the parameters are fixed, and the others are optimized in an alternating fashion. This then produces the algorithm used during the experiments.

The paper then examines the notion of the creation of these auxiliary problems using two methodologies: partially supervised, and unsupervised.

Using these tools, the paper then shows the results of three different experiments and compares them to other similar algorithms. The first was text categorization using the 20ng and Reuters-RCV1. The second was named entity chunking using the CoNLL'03 shared task corpora. And the final was handwritten digit classification using the MNUST data.

Overall this method proved to be extremely useful when the number of available labeled samples was low.

July 28, 08 : Domain Adaptation with Structural Correspondence Learning (Nathan)

In many Natural Language Processing tasks we are confronted with the problem of lacking adequate (read: does not exist) labeled data for a particular domain. In order to combat this problem the authors of this paper have developed a method of domain adaptation that when coupled with Structural Correspondence Learning (SCL) improves the performance of classifiers on the target domain without the need for labeled data.

In short, SCL is a method of inducing a set of cross-domain features. This paper works the example of Part of Speech tagging where the job is to produce the correct part of speech tag for each word in a sentence. A lot of work in PoS tagging has focused on a few select domains, such as news articles from the Wall Street Journal or New York Times. hence there are labeled data sets for these domains. What if we want to tag something other than news articles, say the MEDLINE corpus? This is where SCL helps us. The task of tagging news articles is called the supervised task and the task of tagging the medical corpus is the target task. SCL models correlations between the supervised and target features, identifying what the authors defined as pivot features which can serve as a foundation for adapting to the new domain. Pivot features are features that behave in the same manner across both domains.

SCL Algorithm

1. Choose m pivot features. Create m binary prediction problems, p_{l}(x), l = 1 \dots m
2. For l = 1 to m do
    \hat{w}_{l} = argmin_{w} (\sum_{j} L(w \cdot x_{j}, p_{l}(x_{j})) + \lambda \vert \vert w \vert \vert^{2})
3. W = [\hat{w}_{1} \vert \hat{w}_{2} \vert \dots \vert \hat{w}_{m}], [UDVT] = SVD(W), \theta = U^{T}_{[1:h,:]}
4. Return f a predictor trained on the original features xt and the new shared features θxt.

To learn correlations between pivot and non-pivot features, alternating structural optimization is used (ASO) which is the method that was discussed last seminar by Seth. There were some differences between the Ando and Zhang ASO formulation and this paper's. For instance, the low rank representation of our features (the value h above) was set to 25 to provide a faster run-time. Only positive entries were used in the pivot weight vectors to compute the SVD, again this was for time and space constraints. One last optimazation included using only half of the ASO features in the final model, these features are selected through stochastic gradient descent, then choosing the features with the largest weight variance across different labels. Also, the authors needed to rescale the projection features to have average L1 norm on the training set that was 5 times that of the binary-valued features, this was in order to duplicate the Ando/Zhang results and the rescaling parameter was determined through a tuning set.

July 31, 08 : Analysis of Representations and Learning Bounds for Domain Adaptation (Piyush)

Analysis of Representations: The first paper analyzes (theoretically and empirically) conditions that can lead to an effective domain adaptation. It is shown that a good feature representation, shared by both source and target domains, is crucial for this to be possible. The idea behind choosing a common representation for source and target is expected to help since it makes the two domains appear to have similar distributions, thereby enabling effective domain adaptation. The paper also formalizes this intuition by providing a theoretical analysis of the generalization bound on the target domain. The generalization bound (see below) on the target domain consists (mainly) of two terms: empirical error of a chosen hypothesis on the source domain, and a measure of divergence between the distributions associated with source and target domains. Conventionally, the divergence measure is defined in terms of the L1 (variational) distance which can't be computed using only finite samples. Thus the authors propose using an alternative (the \mathcal{A}-distance) which can act as a proxy to the L1 distance. The \mathcal{A}-distance can be computed using finite samples from the two distributions. Using this approximation, the generalization bound comes out to be (with probability 1 - δ):

 \epsilon_T(h) \le \hat{\epsilon_S}(h) + \frac{4}{m}\sqrt{\left(d\log \frac{2em}{d} + \log \frac{4}{d}\right)} + \lambda + d_{\mathcal{H}}(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T) + 4 \sqrt{\frac{d \log(2m') + \log ( \frac{4}{\delta} )}{m'}}

As mentioned above, for the above expression, the first and the fourth terms are of interest. m is the number of labeled examples (under the feature representation) from the source domain, e is the base of natural log, d is the VC dimension of the considered hypothesis space \mathcal{H}, λ (assumed to be small) is a measure of closeness of the hypothesis h with the true labeling function, d_{\mathcal{H}} denotes the \mathcal{A}-distance considered with respect to our hypothesis space \mathcal{H}, and \tilde{\mathcal{U}}_S and  \tilde{\mathcal{U}}_T are unlabeled samples of size m' each, from source and target respectively.

Finally, it is also shown that computing the \mathcal{A}-distance between two distributions is closely related to learning a classifier that achieves minimum error on the binary classification problem of separating points generated from the two distributions.

d_{\mathcal{H}}(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T)  = 2(1 - 2\min_{h'\in\mathcal{H}}err(h'))

The authors also note that since it's NP-hard to even approximate the error of the optimal hyperplane classifier (for arbitrary distributions), they approximate the classifier by minimizing a convex upper bound on the error (as we do in standard classification settings). In particular, they minimize a modified Huber loss using SCG.

The analysis also justifies why the Structural Correspondence Learning (SCL) technique (discussed last week) works well in practice for domain adaptation. The experiments section considers a PoS tagging problem of adapting from Wall Street Journal (WSJ) text to MEDLINE abstracts. The authors choose two feature representations, random projections and SCL, and show that while SCL can effectively achieve low values for both relevant terms (\mathcal{A}-distance and empirical source error) in the generalization bound expression, the random projection can only minimize the \mathcal{A}-distance but not the empirical source error. This observation is supported by the resulting empirical target errors: SCL gives a much smaller empirical target error as compared to random projections. Thus the experimental results indeed justify what the theoretical generalization bound suggests.

Important note: Today's second paper actually suggests a correction in the above generalization bound and uses \hat{d}_{\mathcal{H}\Delta\mathcal{H}} instead of d_{\mathcal{H}}. \mathcal{H}\Delta\mathcal{H} is defined as the symmetric difference hypothesis space such that \mathcal{H}\Delta\mathcal{H} = \{h\oplus h':h,h' \in \mathcal{H}\}. \mathcal{H}\Delta\mathcal{H} is essentially the set of hypotheses that label all those points as positive for which any pair of hypotheses in \mathcal{H}disagree.

Learning Bounds: The second paper considers a scenario when there is a very small amount of labeled data for target but a large amount of labeled data for source domain. The paper suggests that minimizing only empirical target risk in such a situation may not be a good idea and therefore suggests an algorithm that minimizes a convex combination of source and target empirical risks.

\hat{\epsilon}_{\alpha}(h) = \alpha\hat{\epsilon}_{T}(h) + (1 - \alpha)\hat{\epsilon}_{S}(h)

Let's further assume we have βm labeled examples from the target and (1 − β)m labeled examples from the source domain. We also have m' number of unlabeled examples from both domains. Then a bound on the target risk is given (with probability 1 - δ) for a domain adaptation algorithm that minimizes \hat{\epsilon}_{\alpha}(h).

\epsilon_T(\hat{h}) \leq \epsilon_T(h^*_T) + 2\sqrt{\frac{\alpha^2}{\beta} + \frac{(1-\alpha)^2}{(1-\beta)}}\sqrt{\frac{d \log(2m) - \log \delta}{2m}} + 2(1-\alpha)\left(\frac{1}{2}\hat{d}_{\mathcal{H}\Delta\mathcal{H}}(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T) + 4\sqrt{\frac{d \log(2m') + \log ( \frac{4}{\delta} )}{m'}} + \lambda \right)

Where \hat{h} is the empirical minimizer of \hat{\epsilon}_{\alpha}(h) over all the labeled data (source and target) whereas h^*_T is the target risk minimizer.

Since most of the terms in the above bound are either computationally intractable or very large (for instance, the VC-dimension parameter d), the authors propose approximations for these. λ is very small and can be neglected, \hat{d}_{\mathcal{H}\Delta\mathcal{H}}(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T) can be further approximated by a quantity \zeta(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T) (between 0 and 1; 0 means that the distributions are indistinguishable and 1 means they are perfectly separable), and the VC-dimension parameter d is replaced with another tighter constant C. The resulting approximation to the above bound can then be rewritten as:

f(\alpha) = \sqrt{\frac{C}{m}\left(\frac{\alpha^2}{\beta} + \frac{(1-\alpha)^2}{(1-\beta)}\right)} + (1-\alpha)\zeta(\tilde{\mathcal{U}}_S,\tilde{\mathcal{U}}_T)

The authors finally evaluate the goodness of the predicted bound, given by the above expression, by looking at the achieved empirical target error on a sentiment classification task. Various domain adaptation settings are considered: (a) different distribution distances, (b) different number of target instances, (c) different number of source instances. In all cases, it turns out that the value of α that minimizes the generalization bound in above expression also minimizes the empirical target error. Thus minimizing the expression for f(α) w.r.t. α can help in choosing the "correct" α and then training a classifier to minimize \hat{\epsilon}_{\alpha}(h) can be expected to work well in practice.

Finally, the authors also generalize the above setting by considering the problem of learning from multiple sources (instead of just one source and one target). An empirical α-weighted (again, a convex one) error expression is considered and the generalization bound is given for this setting.

August 5, 08 : Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation (Seth)

The authors describe a method whereby the effects of covariate shift between training and input data are alleviated through importance estimation. A couple of definitions:

Covariate Shift: This is a phenomenon that happens when the training data distribution p_{tr}(\mathbf{x}) differs from the test data distribution p_{te}(\mathbf{x}) but the conditional distribution of output values P(y|\mathbf{x}) remains unchanged.

Importance: In an effort to alleviate the covariate shift problem, wieghted variants of the original distribution p_{tr}(\mathbf{x}) are used in order to restore consistency. These weights w(\mathbf{x}) are called the importance and are estimated using the following:

w(\mathbf{x}) = \frac{p_{te}(\mathbf{x})}{p_{tr}(\mathbf{x})}

The authors of the paper suggest that a naive approach to estimating w(\mathbf{x}) is by estimating the training and test data distributions independently and subsequently finding the corresponding ration. The goal of this paper is to estimate w(\mathbf{x}) without having to estimate either distribution.

The proposed methodology is an optimization problem that minimizes the Kullback-Liebler divergence (similar to information gain) from the true test input density to its estimate. In other words minimize

KL[p_{te}(\mathbf{x}) || \hat{p}_{te}(\mathbf{x})]

where

\hat{p}_{te}(\mathbf{x}) = \hat{w}(\mathbf{x})p_{tr}(\mathbf{x})

In order to model w(\mathbf{x}) the authors choose a standar linear model

w(\mathbf{x}) = \sum_{\ell = 1}^{b}\alpha_\ell\varphi_\ell(\mathbf{x})

where \{\alpha_\ell\}_{\ell=1}^{b} are learned parameters and \{\varphi_\ell(\mathbf{x})\}_{\ell=1}^{b} is a special basis function where \varphi_\ell(\mathbf{x}) \geq 0, \forall (\mathbf{x} \in \mathcal{D} \wedge \ell = 1,2,\ldots,b) (they later use a Gaussian kernel).

The paper then describes the actual optimization problem and suggests a general algorithm for estimating w(\mathbf{x}). They subsequently supply a method for finding a proper basis function \varphi_\ell(\mathbf{x}) using LCV (likelihood cross validation). Later they show empirical results of using their methodology first on an artificial set of data and later with regression and classification benchmark data sets.

August 13, 08 Discriminative Learning for Differing Training and Test Distributions (Amit)

The authors try to address classifications problems for which the training instances are governed by a distribution that is arbitrarily different from test distribution. They derive a solution that is discriminative: neither training nor test distribution are modeled explicitly. They formulate the general problem of covariate shift as an integrated optimization problem. They also derive a kernel logistic regression classifier for differing training and test data.

In earlier approaches, w(\mathbf{x}) = \frac{p_{te}(\mathbf{x})}{p_{tr}(\mathbf{x})}

First w(x) is learned somehow, after that a classifier is trained with weighted training examples. However in this paper, they don't split the problem into two sequential steps. All the earlier work on learning from biased samples has followed the idea of solving two optimization problems sequentially.

August 20, 08 Correcting Sample Selection Bias by Unlabeled data (Nathan)

Today's paper continues on the path of domain adaption, though it's called Covariate Shift in this setting. In short, this approach attempts to reweight the training examples using an importance weight we have seen the past couple of meetings. Specifically, \beta (x) = \frac{P_{test}(x)}{P_{train}(x)}.

Generally, you would need to estimate the densities of P(x)test and Ptrain(x) to get at the true value of β(x), but this is computationally very expensive. The approach taken here is to attempt to match the distributions after they have been projected into feature space, and learn β at the same time. They call this kernel mean matching and it is a surprisingly straightforward optimization problem.

Past Semesters

Personal tools