User:Arvind385
From ResearchWiki
Contents |
Arvind: Nearest Neighbor search (09/07)
We talked about the problem of finding k nearest neighbors given a query point in high dimensional space. There has been good amount of work on algorithms for this problem in both approximate and exact cases. Most of these algorithms achieve a complexity which is logarithmic in n and exponential in dimension d. This paper presents two approximate algorithms which achieve much better complexity. First algorithm has a pre-processing complexity of O(nlogd)2d and a query time of O((dlog2d)(d + logn)) while second algorithm achieves a near-linear pre-processing time and a query time that improves asymptotically on brute force search in all dimensions.
Here we would talk about the first algorithm. In this algorithm, query processing is deterministic while pre-processing involves random sampling. Basic idea of this algorithm is to choose a set of L vectors randomly from Sd − 1, project all points onto these one dimensional randomly chosen vectors and then use lemma 2.1
if
for some
to find the nearest neighbors. This lemma 2.1 basically says that if two vectors are projected on randomly chosen vectors then there is high probability that projection of smaller vector will be small. This means that the point which was closer to the origin in the original high dimensional space, will also be closer (with high probability) if computed using its projections on random vectors (much less than the original dimension d) than actually computed using distance in d. Algorithm uses this lemma to compare the distance of two points from the query point.
Preprocessing : Here are the main steps of preprocessing stage. 1)First choose a set of L vectors
uniformly at random from Sd − 1. 2) For each vector vl, project centers of all pairs (compute vl.pij where pij is the center of pair (i,j)) on this vector and sort them. Lets call this sorted list Sl. 3) Once we have Sl for each random vector, we select realizable traces from these lists
. A trace, σ is defined as sequence of primitive intervals σl (an interval between two adjacent point in the sorted list). A trace is realizable if there exists a point
such that for every l, vl.q lies in the primitive interval σl of Sl. 4) Now for each realizable trace, build a complete digraph with directed edge from i to j > i if i σ-dominates j for more than half of the sorted list otherwise an edge from j to i. Finally construct an apex ordering and store the apex orderings with corresponding realizable traces.
Query Processing:Once we have preprocessed data, finding k nearest neighbors given a query point q is fairly simple. For every random vector l, we do a binary search to insert the value of vl.q into sorted list Sl. Let us say that σl is the primitive interval into which this falls, From this, we get a trace
for L vectors. Next, we look up the apex ordering of the digraph corresponding to σ and return the first k entries which gives k-nearest neighbors.
21 Sep 07: Gibbs Likelihoods for Bayesian Tracking(Arvind)
This paper addresses the problems of modeling the appearance of humans and distinguishing human appearance from the appearance of general scenes. Given a 3D model of the person projected into an image we model the likelihood of observing various image cues(filter responses i.e. edge, ridge and motion) given the locations and orientations of the limbs. More specifically, we would want to sample some pixels from the image (from both regions foreground and background) that contains the human appearance and compute the likelihood p(f | x) on these points.
There have been work in the past to model this likelihood using Bayesian methods where it is assumed that all features (filter responses) are conditionally independent. Given that we have three different kind of filter responses, edge, ridge and motion; joint conditional density using Bayesian method, can be approximated by p(f | φ) = p(fe | φ)p(fr | φ)p(fm | φ) It is well known and can also be guessed that these filter responses are not conditionally independent. Beside the direct violation of this assumption, modeling joint conditional density (probability that we observe all filter responses) based on this assumption has some other difficulties as well i.e. many features or measurements are required for robust tracking which makes the dimensionality of joint data space high, training data may be too limited to fully populate probability space etc.
To address these problems, we model the likelihood using a Gibbs model
where λ(i) are weight functions and φ(i) can be thought of as "selector functions" that choose bins of marginal histograms along the various directions. This basically means if feature fi takes values from
and we choose to discretize them into 32 bins, then the selector function φ(f,x)(i) would be equal to j if fi falls into jth bin. Good thing about selector function is that it gives us the freedom of choosing its values along different directions. We can choose values along the direction of marginals which actually gives us the original Bayesian model because it does not account for any correlation among features or we can choose them along other directions i.e. diagonal of f1,f2 (selector function φ(f,x) = j if (f1 + f2) / 2 falls into jth bin) which models some kind of correlation between feature f1 and f2. (Refer section 3 of the paper for more technical details)
Experiments shows that modeling joint conditional density based on Gibbs likelihoods which lets us model the dependencies between features, capture the distribution that underlies the data well while not suffering from overfitting.
Random Graphs II
Today we will discuss web graph models and see how random graphs (a little different from what we discussed in the previous lecture) can be applied to model these web graphs. There are many applications of these web graphs; one of the most seen example is modeling of social network.
Web graphs
Web graphs W are the graphs where each node represents the web page and an edge represents the link between web pages. These web graphs are massive in size (i.e. contain billions of nodes and edges) and are dynamic or evolving with the nodes and edges appearing and disappearing over time. In this paper, we will discuss properties these web graphs and some of the recent stochastic models which are used to model these web graphs.
Properties of web graphs
- On-line property: The number of nodes and edges changes with time. - Most of the time, this property is satisfied by the construction of the model.
- Power law degree distribution:. The degree distribution follows a power law, with an exponent > 2. More specifically, if P(k) is the proportion of the nodes of degree k in the graph, then
where c > 0 and β > 0 are real constants. Intuitively this means that most of the pages have fewer links while there are fewer pages that have too many links.
- Small world property: The average distance (or diameter) is much smaller than the order of the graph. This means that average path length of the graph is much smaller that the longest path in the graph. Some authors have observed that for web graphs, when average path length is around 20, maximum path length could be as high as 1000.
- Many dense bipartite subgraphs: The number of distinct bipartite cliques or cores is large when compared to a random graph with the same number of nodes and edges.
Stochastic models
Having define the properties of web graphs, now we will move on to some of the recent stochastic models that have been used to model these web graphs but before we do so, we present a very general mathematical framework that will be used as a basis for all of the models we will describe later. The model in this framework possesses a set of real number parameters and has a fixed finite graph as an additional parameter (this graph is used as the base graph, nodes and edges are later on added to this graph). The model generates by some stochastic process a sequence of finite graphs Gt indexed by
. Model is defined as follows:
-
-
is an induced subgraph of
-
-
Idea of the above framework is very simple. Choose an initial graph and add nodes and edges chosen by some stochastic process (this process is what differs one model from another) to the existing graph in such a way that existing graph is a subgraph of the new graph. Now we will see some of the important models used to model web graphs based on the framework described above.
Preferential Attachment Model As name suggests, in this model, graph is formed by adding nodes and edges one by one to the existing graph. We start with the graph of a single node and then add one node and one edge (or more edges depending on the kind of model) to the existing graph giving a new graph. Key point of this model is to select the node (or nodes) that will be used to add edge(s). In a very simple preferential attachment model known as Linearized Chord Diagram (LCD), at time step
, an edge is added from the new node
to the existing node vi where vi chosen uniformly at random from the following distribution
One can show that web graphs built using LCD model posses "power law property" and "small world property". Variations of LCD model exist but idea remains the same.
Off-line web graph models
These models are the generalized version of ER model (random graphs
, discussed in the previous lecture). Let
be a degree sequence of some graph of order n, we can build a web graph using this w. The edge between two nodes vi and vj is chosen independently with probability pij where pij is proportional to wiwj. Our original random graph is a special case of this model where expected degree of all nodes is same and is equal to pn, more specifically,
. Note that this graph is not dynamic which means that graph is not built over the time rather whole graph is built at once. Though these off-line graphs violate a key property of web graphs, we study them for their simplicity. These off-line graphs have also shown to have "power law property" and "small world property"
Growth Deletion Model
In all of the models presented above, we only add nodes and edges but never delete them which is unrealistic. An evolving graph model incorporating in its design both the addition and deletion of nodes and edges may more accurately model the evolution of the web graph. There have been models which take deletion into account. We will present one such model (other models are based on the same idea) G(p1,p2,p3,p4,m) with parameters m and probabilities pi(s) satisfying p1 + p2 + p3 + p4 = 1, p3 < p1 and p4 < p2. We follow the same framework as in LCD (addition of nodes and edges to graph at time t to get a new graph
) except that here we also delete the edges. Idea is as follows: With probability p1, add m edges from the new node
to existing nodes chosen by preferential attachment. With probability p2, add m new edges with end points to be chosen among the existing nodes by preferential attachment. With probability p3 delete a node chosen uniformly at random and with probability p4, delete m edges chosen uniformly at random. This model is again shown to have "small world property" and "power law property".
Boosting: Strong and Weak Learning / 12 March (Arvind Agarwal)
Recall, in PAC learning, we are interested in an algorithm that outputs good hypothesis (which gives less error with respect to the data distribution) with high confidence. Mathematically, algorithm outputs a hypothesis h such that
. Note that there are two parameters that qualify and control the performance of a PAC learning algorithm - the error parameter ε and the confidence parameter δ. The smaller the values of these parameters, the stronger the guarantee on the hypothesis output by the learning algorithm. In our definition of PAC learning, we demanded that a learning algorithm be able to achieve any small arbitrarily small values of these parameters and, that running time be polynomial in 1 / ε and 1 / δ.
Let say that we do not have such an algorithm, instead have a weaker but still useful algorithm L that achieves the PAC criteria not for any values of ε and δ but a fixed ε0 and δ0. Now question is: Is there any way that we can use L as a subroutine and get a new algorithm L' that achieves the PAC criteria for any values of ε and δ. Answer to the above question is positive in the strongest possible sense: even an efficient algorithm whose hypotheses are slightly better than random guessing, can be used to obtain an efficient algorithm meeting PAC criteria for any values of ε and δ. Algorithm L is known as weak PAC learning algorithm and L' is known as strong PAC learning algorithm
We are using weak learning algorithm L to obtain a strong learning algorithm, In other words, we want to use L whose hypotheses originally were only slightly better than random guessing to obtain an algorithm whose hypotheses have small error with high confidence. We want to use L to reduce the error and at the same time increase the confidence. There are two problems that we have to deal with:
Boosting the Confidence
Boosting the Accuracy
More to come...