MLRG/summer09
From ResearchWiki
(→Schedule) |
(→Learning with auxiliary information / May 12 (Piyush)) |
||
| Line 74: | Line 74: | ||
=== Learning with auxiliary information / May 12 (Piyush) === | === Learning with auxiliary information / May 12 (Piyush) === | ||
| - | + | Supervised learning requires labeled data for training and often it is difficult to obtain "enough" labeled data to reliably train a model. Driven by this problem, a lot of efforts in machine learning lately have been on using some kind of "auxiliary information" that could potentially help a supervised learning task when the labeled training data is scarce. A range of techniques have been proposed that use some or the other kind of auxiliary information to assist learning. The auxiliary information can be provided by unlabeled data (semi-supervised learning), multiple views of the data (multiview learning), related domains or tasks (domain adaptation or multitask learning), or techniques such as co-training, self-training, etc. | |
| + | |||
| + | Among these techniques, semi-supervised learning (SSL) in particular has been a well studied area and there has been empirical and theoretical evidence that with a "right" model, unlabeled data can indeed be expected to help. However, SSL assumes that the unlabeled data comes from the same generative distribution as the labeled data and the underlying labels of the unlabeled data belong to the same class as the labeled data. | ||
| + | |||
| + | ==== Self-taught learning ==== | ||
| + | |||
| + | A recently proposed model, [http://www.stanford.edu/~hllee/icml07-selftaughtlearning.pdf self-taught learning], removes both of the above mentioned restrictions of SSL. In particular, it does not assume that the unlabeled data is drawn from the same distribution as the labeled data, and it does not assume that the unlabeled examples have the same underlying class labels as of the labeled training data. For instance, let us consider an image-classification example where the goal is to classify images of elephants vs rhinos. If there is lack of labeled training data for this task, it has been shown that even using unlabeled ''random'' images (i.e., not necessarily of elephants and rhinos) downloaded from internet can improve the accuracy of the supervised learning task at hand (i.e., classifying elephants vs rhinos). | ||
| + | |||
| + | More formally, in self-taught learning, we are given a labeled training set <math>\{(\textbf{x}_l^1,y^1),\ldots,(\textbf{x}_l^N,y^N)\}</math> drawn from some distribution <math>\mathcal{D}</math>. Here each <math>\textbf{x}_i \in R^D</math> is a feature vector and <math>y_i</math> is the corresponding class label. In addition, we are given ''unlabeled'' data <math>\{\textbf{x}_u^1,\ldots,\textbf{x}_u^M\}</math> which importantly need not necessarily come from the same distribution <math>\mathcal{D}</math>. All that is assumed is that the labeled and unlabeled data comes from the same modality (i.e., have the same feature space). | ||
| + | |||
| + | '''Why might ''unrelated'' unlabeled data help?''' | ||
| + | In semi-supervised learning, it can be expected that unlabeled data from the same domain would help. But in the self-taught learning setting, one might wonder why unlabeled data from a potentially ''unrelated'' domain might help. The answer lies in the fact that even though the labeled and the unlabeled data may not exactly be from the same domain, they might still share similar feature characteristics that can help in building a better ''higher-level'' feature representation. As an example, for the task of classifying images, some high-level information about general image characteristics (such as presence or absence of edges at certain places in the image) can provide cues that are not possible to obtain from low-level representation (such as pixel intensities) of an image. Once learned, these feature representations can be helpful for new learning tasks even if these tasks have a very small amount of labeled data, as long as domain of interest is images (or, in general, the same domain these feature representations are learned from). | ||
| + | |||
| + | '''How does self-taught learning work?''' | ||
| + | Self-taught learning is based on an unsupervised feature construction algorithm. This algorithm first solves an optimization problem (similar to [http://www.stanford.edu/~hllee/nips06-sparsecoding.pdf sparse coding]) to learn high-level ''basis representations'' from the unlabeled data: | ||
| + | |||
| + | <math> \min_{b,a} \sum_i||\textbf{x}_u^{(i)} - \sum_ja_j^{(i)}b_j||_2^2 + \beta||a^{(i)}||_1 </math> | ||
| + | |||
| + | s.t. <math>||b_j||_2 \le 1, \forall j \in 1,\ldots,s </math> | ||
| + | |||
| + | In the above, <math>b = \{b_1,\ldots,b_s\}</math> with each <math>b_j \in R^D</math> are basis vectors and <math>a = \{a^{(1)},\ldots,a^{(M)}\}</math> are called ''activations'' with each <math>a^{(i)} \in R^s</math>. Here <math>a_j^{(i)}</math> is the activation of basis vector <math>b_j</math> for example <math>\textbf{x}_u^{(i)}</math>. <math>||.||_2</math> and <math>||.||_1</math> denote the <math>L_2</math> and <math>L_1</math> norm respectively. | ||
| + | |||
| + | These basis vectors can now be used to extract new, more useful, higher-level representations for the labeled training data, by solving the following optimization problem: | ||
| + | |||
| + | <math>\hat{a}(\textbf{x}_l^{(i)}) = arg\ min_{a^{(i)}}||\textbf{x}_l^{(i)} - \sum_ja_j^{(i)}b_j||_2^2 + \beta||a^{(i)}||_1</math> | ||
| + | |||
| + | The above mapping yields a new feature representation for a labeled example <math>\textbf{x}_l^{(i)}</math>. Any learning algorithm can be then applied under the new feature representation of the labeled examples. Experimental results on image/audio/text classification tasks have shown that such an approach results in better classification performance. | ||
| + | |||
| + | Note that the self-taught learning framework still assumes that the unlabeled data shares the ''same feature space'' as the labeled data. Another recent paper, [http://books.nips.cc/papers/files/nips21/NIPS2008_0098.pdf Translated Learning], tries to address the problem when we want to transfer knowledge from a source domain to a target domain but the two domains have different feature spaces (i.e., text classification as source domain task and image classification as the target domain task). The paper proposes a framework that links the two feature spaces using a translator which is like a language model. Experimental results on text-aided image classification and cross-lingual classification show improved performances for the target domains. | ||
==Past Semesters== | ==Past Semesters== | ||
Revision as of 08:11, 19 May 2009
Contents |
CS7941: Assorted Topics in Machine Learning
Time: Tuesdays, noon-1:30
Location: Graphics annex (MEB 3515), except as noted
Each week will include a discussion of 2-4 papers from a recent conference (past 24 months or so) on roughly the same topic. Your job as the presenter is to do a compare/contrast thing and teach us about some area. If you want to run your selections by me ahead of time, that's fine. If not, that's fine, too.
Please don't sign up for dates until you've chosen papers and please try to let the new students sign up first so they can get times/papers that they're most comfortable with.
Feel free to select papers from ML conferences (ICML, NIPS, UAI, etc) *or* applications conferences (ACL, CVPR, SIGIR, ISMB, etc.) but be sure that the focus in the latter case is on the machine learning, not on the application. Of course, if you'd like to choose a technical ML paper and then an applied paper and present them together, that might be even more fun!
Participants
- Hal Daumé III, Assistant Professor, School of Computing
- Arvind Agarwal, PhD Student, School of Computing
- Amit Goyal, PhD Student, School of Computing
- Ruihong Huang, PhD Student, School of Computing
- Jagadeesh J, PhD Student, School of Computing
- Piyush Rai, PhD Student, School of Computing
- Adam R. Teichert, MS Student, School of Computing
- Jiarong Jiang, PhD Student, School of Computing
- Nathan Gilbert, PhD Student, School of Computing
Schedule
| Date | Papers | Presenter |
|---|---|---|
| May 12 (in LCR) | Learning with auxiliary information: Self-taught learning, Translated Learning | Piyush |
| May 19 | Deep Boltzmann Machines and Analyzing human feature learning as nonparametric Bayesian inference | Ruihong |
| Jun 9 | Sparse Coding: Differentiable Sparse Coding, Resolution Limits of Sparse Coding in High Dimensions | Jiarong |
| Jun 23 | TBD | |
| Jun 30 | TBD | |
| Jul 7 | TBD | |
| Jul 21 | TBD | |
| Jul 28 | TBD | |
| Aug 11 | TBD | |
| Aug 18 | TBD |
Paper Summaries
Learning with auxiliary information / May 12 (Piyush)
Supervised learning requires labeled data for training and often it is difficult to obtain "enough" labeled data to reliably train a model. Driven by this problem, a lot of efforts in machine learning lately have been on using some kind of "auxiliary information" that could potentially help a supervised learning task when the labeled training data is scarce. A range of techniques have been proposed that use some or the other kind of auxiliary information to assist learning. The auxiliary information can be provided by unlabeled data (semi-supervised learning), multiple views of the data (multiview learning), related domains or tasks (domain adaptation or multitask learning), or techniques such as co-training, self-training, etc.
Among these techniques, semi-supervised learning (SSL) in particular has been a well studied area and there has been empirical and theoretical evidence that with a "right" model, unlabeled data can indeed be expected to help. However, SSL assumes that the unlabeled data comes from the same generative distribution as the labeled data and the underlying labels of the unlabeled data belong to the same class as the labeled data.
Self-taught learning
A recently proposed model, self-taught learning, removes both of the above mentioned restrictions of SSL. In particular, it does not assume that the unlabeled data is drawn from the same distribution as the labeled data, and it does not assume that the unlabeled examples have the same underlying class labels as of the labeled training data. For instance, let us consider an image-classification example where the goal is to classify images of elephants vs rhinos. If there is lack of labeled training data for this task, it has been shown that even using unlabeled random images (i.e., not necessarily of elephants and rhinos) downloaded from internet can improve the accuracy of the supervised learning task at hand (i.e., classifying elephants vs rhinos).
More formally, in self-taught learning, we are given a labeled training set
drawn from some distribution
. Here each
is a feature vector and yi is the corresponding class label. In addition, we are given unlabeled data
which importantly need not necessarily come from the same distribution
. All that is assumed is that the labeled and unlabeled data comes from the same modality (i.e., have the same feature space).
Why might unrelated unlabeled data help? In semi-supervised learning, it can be expected that unlabeled data from the same domain would help. But in the self-taught learning setting, one might wonder why unlabeled data from a potentially unrelated domain might help. The answer lies in the fact that even though the labeled and the unlabeled data may not exactly be from the same domain, they might still share similar feature characteristics that can help in building a better higher-level feature representation. As an example, for the task of classifying images, some high-level information about general image characteristics (such as presence or absence of edges at certain places in the image) can provide cues that are not possible to obtain from low-level representation (such as pixel intensities) of an image. Once learned, these feature representations can be helpful for new learning tasks even if these tasks have a very small amount of labeled data, as long as domain of interest is images (or, in general, the same domain these feature representations are learned from).
How does self-taught learning work? Self-taught learning is based on an unsupervised feature construction algorithm. This algorithm first solves an optimization problem (similar to sparse coding) to learn high-level basis representations from the unlabeled data:
s.t.
In the above,
with each
are basis vectors and
are called activations with each
. Here
is the activation of basis vector bj for example
. | | . | | 2 and | | . | | 1 denote the L2 and L1 norm respectively.
These basis vectors can now be used to extract new, more useful, higher-level representations for the labeled training data, by solving the following optimization problem:
The above mapping yields a new feature representation for a labeled example
. Any learning algorithm can be then applied under the new feature representation of the labeled examples. Experimental results on image/audio/text classification tasks have shown that such an approach results in better classification performance.
Note that the self-taught learning framework still assumes that the unlabeled data shares the same feature space as the labeled data. Another recent paper, Translated Learning, tries to address the problem when we want to transfer knowledge from a source domain to a target domain but the two domains have different feature spaces (i.e., text classification as source domain task and image classification as the target domain task). The paper proposes a framework that links the two feature spaces using a translator which is like a language model. Experimental results on text-aided image classification and cross-lingual classification show improved performances for the target domains.
Past Semesters
- Spring 2009 (Learning Theory)
- Fall 2008 (Graphical Models)
- Summer 2008 (Multitask Learning)
- Spring 2008 (Kernel Methods)
- Fall 2007 (Unsupervised Bayes)
- Summer 2007 (Manifolds)