Talk:MLRG/summer08
From ResearchWiki
Contents |
Discussion
07 May:
Piyush
- How is multitask learning different from transfer learning? Is it the case that the former considers a setting where the focus is on learning multiple tasks "simultaneously" (exploiting relatedness of tasks), and that the latter learns them sequentially (by relying on some adaptation strategy from one task to the other)?
- My sense is that in, both, transfer learning and domain adaptation, the assumption is that training (source) and test (target) data distributions are different. If that is true, then is there indeed a difference between transfer learning and domain adaptation?
- Can algorithms such as co-clustering and multiview learning be termed as multitask learning?
14 May:
Piyush
- How feasible it would be (at least in the logistic regression case) to follow a true Bayesian approach and *learn* the hyperparameters (the covariance matrix in this case) of the prior, assuming that the related tasks have the same prior distribution?
Nathan
- How do we know that the auxillary problems are well suited to our target problem? Can we say anything about the similarity of the auxillary problems and the target problems?
21 May:
Nathan
- Is there any indication of how good the approximation of K(xi,xj) (eq. 18) actually is?
- (Piyush) I think it would depend on how "similar" points (xi,xj), for the new task, are to the finite set of point (X) already seen. We recover the exact case (K = κCακ), when they actually belong to X.
- What is the significance of pointing out that
? This seems intuitive, but perhaps I'm missing something...
- (Piyush) Probably, there isn't much to it except for the fact that
denotes the set of *distinguished* points in the training data. If all are unique, then
.
- (Piyush) Probably, there isn't much to it except for the fact that
Piyush
- I'm not sure about the role of τ (the sample size of base kernel) in equation 19 (learning new functions). Also, is there a way to interpret the trade-off between m (the number of already learned tasks) and τ in that equation?
- How susceptible can MTL algorithms be (especially ones using GP, for example) w.r.t. noisy or outlier tasks? BTW, there is this paper which uses t-processes (which is a generalization of GP) for robust MTL. I wonder if there can be alternative ways to ensure robustness?
Seth
- This might have been answered in the paper (I might have missed it). What is the difference in complexity between the multi-task linear model and the multi-task GP's?
- (Piyush) In the multi-task linear model case, the dimensionality of parameter space equals the dimensionality (d) of the data whereas for the multi-task GP case, it is the number of points (n). In the former case, the complexity is O(d3) whereas in the latter case, it is O(n3). Thus, for high (or infinite) dimensional spaces, GPs would be advantageous. Also, GPs let you handle nonlinear functions through nonlinear mapping using kernels.
- This is more of a newbie question: What is the overall difference between transductive learning and inductive learning?
- (Piyush) In transductive learning, the goal is to complete the labeling of test data but without having to learn an explicit mapping function. This is (usually) possible and there are ways to do so because we also have the test data available during training. On the other hand, inductive learning works in a setting which doesn't assume an a priori availability of test data during training. Thus the goal there is to learn an explicit mapping function which can be used to do predictions when we finally get the test data.
Arvind
- I do not understand the para just above Eq. 2. How come sufficient statistic of prior completely specify the properties of the funtion f(x). I can see it for posterior.
- (Piyush) Note that the function space can be completely characterized by a mean function and a covariance function. The sufficient statistics of prior p(w) are its mean (0) and covariance matrix (I). This describes a GP prior on the function space, with zero mean function and covariance function
, for a particular choice of Cw = I.
- (Piyush) Note that the function space can be completely characterized by a mean function and a covariance function. The sufficient statistics of prior p(w) are its mean (0) and covariance matrix (I). This describes a GP prior on the function space, with zero mean function and covariance function
28 May:
Nathan
- What is the airspeed velocity of a coconut laden swallow?
- Maybe, would depend on how many coconuts. In the extreme case, it may rather be a free-fall. :)
- Ah, but you've missed the obvious question lurking inside of this one...African or European swallow?
- Actually, I *knew* that it had already been asked. :P
- Oh my, I didn't know thaaaAAAAAAAEEEEEEEugh!!!!!
- Actually, I *knew* that it had already been asked. :P
- Ah, but you've missed the obvious question lurking inside of this one...African or European swallow?
- Maybe, would depend on how many coconuts. In the extreme case, it may rather be a free-fall. :)
Piyush
- Given a set of related tasks, is it possible to assess which hierarchical Bayes MTL algorithm would best capture the relatedness? For example, one based on learning shared parameters (last week's paper on GP), one based on task clustering (the DP paper this week), or something like this paper that uses a latent representation of tasks (uses latent ICA for MTL)?
- (Nathan) Depends on what you mean by relatedness I suppose. The authors of today's paper have a definition that we will discuss, the authors of last week's paper didn't really have a measure of relatedness between tasks, instead they give the idea of sharing information by learning a common prior. I haven't read the last paper you cite, yet.
Seth
- Question 1:
- The paper uses several distributions when sampling several variables
- SMTL Model
- vk˜ Beta destribution
-
multinomial distribution
- ym,n˜ multinomial distribution
- SMTL-1 Model
- α˜ gamma distribution
- etc...
- They do no mention, however, the intuition behind choosing those particular distributions when doing the sampling. My question therefore becomes, why did they choose those distributions to sample those particular variables?
- (Nathan) Well for the vk being drawn from a Beta distribution, this is a necessity for the stick-breaking formulation of the Dirichlet process. As for the multinomial distributions, the cm,k vector is going to be full of binary values, which the multinomial distribution can supply easily. I'm less sure about why ym,n is drawn from a multinomial, it could be just because the multinomial is the conjugate prior of the Dirichlet distribution, which makes some computations easier. As for the α, the only thing I can think of is it is the conjugate prior of a Normal with known mean.
- (Piyush) In particular, since only one of the cm,k is one and all the rest are zero, they use a single draw of a multinomial. ym,n is drawn from a binomial (not from a multinomial) since it's a binary variable and that's how it is defined (drawn from a binomial with probability given by a sigmoid -- see equation 4). In general, continuous distributions (e.g. normal, beta, gamma, etc) are used as priors for random variable having real values, and discrete distributions (e.g. binomial/multinomial, beta/dirichlet, etc) for RV with discrete values. The exact choice (which distribution, what hyperparameters) depends on other characteristics (such as expected value or variance) of the random variable.
- Question 2:
- For SMTL-2 Model, what exactly do the hyper-paramters (τ10,τ20,γ10,γ20,β0) tune?
- (Nathan) The hyper parameters (τ10,τ20 tune the scaling parameter of the DP. γ10,γ20,β0 tune the mean and variance of the Normal dist used to generate the task clusters,
- (Piyush) They just chose the hyperparameters to some reasonable values (see experiments section). The algorithms aren't sensitive to hyperparameters as they use diffuse priors. Diffuse prior on a random variable means that the values the RV can take are all equally probable. Diffuse priors are generally used when you have very little clue about the values the RV can take. Given some intuition, however, in practice they work reasonably well.
- (Nathan) The hyper parameters (τ10,τ20 tune the scaling parameter of the DP. γ10,γ20,β0 tune the mean and variance of the Normal dist used to generate the task clusters,
04 June:
Piyush
- In section 3.1.2 and 3.1.3, the authors give examples of task clustering regularization and graph regularization, as special case of their kernel based MTL approach. Under what kind of settings these respective regularizers would be most appropriate to use?
11 June:
Nathan
- I'm having a hard time understanding exactly what definition 2.1 is saying in terms of relatedness?
- Is there any general statement about what kind of problems can best use F-similarity?
25 June:
Nathan
- In the shifting multitask problem, how is the forecaster going to know that the tasks have shifted? (My take was that this was just an assumption of the problem.)
- If it knows the labels ahead of time, could the forecaster re-order the tasks to its advantage? (This wouldn't be possible in the online setting...)
- (Piyush) It could if it were a batch setting.. But they assumed an online setting where the task-labels and the comparators for the tasks change on-the-fly as the tasks arrive.
Piyush
- In terms of the regret bound, for shifting MTL problem, algo 2 is shown to be better than algo 1. Can we say something about the relative computational efficiency of both algorithms, for a given values of K, N and m?
- Can we apply the random walk algorithm (algo 2) for sequential MTL problem too (in case it is indeed more efficient computationally)?
02 July:
Piyush
- The paper says that the Kolmogorov complexity gives absolute information content of an individual object, whereas information theoretic entropy measures the information content of an ensemble of objects w.r.t. a distribution of ensembles. I was still wondering if the entropy based framework could still be useful to analyze certain MTL settings (the Kolmogorov complexity based approach in this paper considers a sequential transfer setting)?
- Can the Bayes mixture (eq 3.2) be seen as the "averaged prediction" over a set of hypotheses?
- I'm not sure what "t" stands for in eq 3.3? Is it a typo?
- They resort on the approximation scheme saying that "K is computable only in the limit". What do they mean by computability in the limit?
17 July:
Piyush
- Section 4.1 in the paper says that the hypothesis space HΘ (obtained through joint ERM on auxiliary problems) improves the average performance of auxiliary predictors and is expected to help the target problem. Can there be a hope that a different strategy based on improving performance of each auxiliary predictor individually (instead of average across all predictors) could do even better on the target problem?
- (Seth) My sense about it is that there is always hope. However, in this paper, their interest is not necessarily doing well on the auxiliary problems, but only on finding a common structure that will improve performance on the target problem.
- In section 3.3, the paper says that, given the local optimal structural parameter Θ, the solution {w_l,v_l} will still be globally optimal. Why so?
- (Seth) My opinion: since Θ is a structure that is subsequently used to learn the appropriate weight vectors, any good value of Θ would provide sufficient structure to appropriately get a globally optimal
- (Seth) My opinion: since Θ is a structure that is subsequently used to learn the appropriate weight vectors, any good value of Θ would provide sufficient structure to appropriately get a globally optimal
- How is the parameter 'h' (the dimensionality of low-dim predictor representation) chosen?
- (Seth) It appears that the parameter was chosen at random. Later empirical results showed that after h > 60 did not do much to improve overall performance.
Nathan
- In section 4.2.1, the authors detail an unsupervised method for predicting sub-structures by learning the values of masked features. I think this is great for learning relationships amongst features, but how does this help predict the target class?
- (Piyush) The goal is to learn "good structures" in the input space and, by trying to predict the sub-structures, we are essentially doing that - we are learning good predictors that obey "smoothness" of the input space X. The alternating structural optimization procedure then ensures that this learning helps the target problem.
- (Seth) The idea is that there is a common low dimensional structure that ALL problems have in common. Finding this structure Θ could then be used to assist in the prediction of the target class.
28 July:
Piyush
- How the m pivot features are chosen from the large set of possibilities? The paper says that they should be "diverse" enough and gives an example of not choosing only determiners, as in the PoS tagging case. But can there be a more general (and automated?) strategy for the pivot feature selection?
- (Nathan) I think it is going to depend on the problem. The three basic types of pivot features these authors suggest are right-token, left-token and middle-token. This seems like a good idea for PoS tagging, but it probably wouldn't be a good choice for problem areas. In short, I think this would be hard to automate.
31 July:
8 August:
Piyush
- Is there any rationale behind choosing the linear model
for importance weights (or for any other choice, for that matter)?
- For the linear model, looking at the form of
, doesn't it seem like something RKHS-ish is going on implicitly since the expression looks like a solution admitting the representer's theorem? Probably, it might be related to the KMM approach to domain adaptation we will be talking in 2 weeks?
13 August
Piyush
- The weighted likelihood argument that is used to get eq 11 from eq 10 is a bit unclear to me. They exponentiate each conditional likelihood
by
claiming that
is the number of times
should occur in the training data if
were instead generated from the test distribution. To me, weighting is just like multiplying each example by a weight and then training using the weighted examples. Then what's wrong in just weighting each
in the conditional likelihood term as
, instead of exponentiating it?
- (Piyush) Because, during the training, we actually weight the loss term corresponding to each example, and not the example itself.