Tonight, I went to attend a local meetup event. The talk was given by a faculty from my own department, Jeff Phillips, an assistant professor working in the area of big data analysis. The topic of his talk is about sketch data structure - data structures which uses a much smaller amount of memory to summarize the represented data.This is a cool technique and I am very interested to learn more.

Sketch data structures are commonly used in the following three use cases.

- frequent items, like IP addresses
- distinct items (harder to do)
- approximate distribution of items

And it is now also becoming popular to use sketches to represent matrixes and graphs

- matrix sketch (hot)
- graph sketch (new)

Here is a list of sketches Jeff talked in his talk.

reservoir sampling: keep k samples over a streaming items

keep the first k items;

for j_th item where j>k, keep it at probability of k/j.

Frequent items sketches

Assume we want to find top k frequent items

Misra-Gries sketch (under-count)

initialize k counters = 0

for i_th item

if (we have a counter that is not 0 for this item)

increase the counter by 1

elsif (we have a counter that is 0)

increase the counter by 1 and assign this counter to this item

else

decrease all counters by 1spacesaving sketch (over-count)

initialize k counters = 0

for i_th item

if (we have a counter that is not 0 for this item)

increase the counter by 1

else

find the counter with minimal value;

increase the counter by 1 and assign this counter to this item

Two generic sketches to estimate frequency of any item

count-min sketch (over-count, report minimal value, support substraction)

initialize a two-dimensional t*k array of counters

pick t independent hash fuctions

for i_th item

foreach hash function h

increase the counter of h(item)for any item, report the minimal value of h(item) as the estimation of its frequency

count sketch

initialize a two-dimensional t*k array of counters

pick t independent hash fuctions h_1,…,h_t, each maps an item to {1,…,k} pick t independent hash functions s_1,…,s_t, each maps an item to {+1, -1}

for i_th item

foreach hash function h_i

h_i(item) += s_i(item)for any item, report the median value of h(item) as the estimation of its frequency

Count sketch: “Finding Frequent Items in Data Streams”, Moses Charikar, Kevin Chen and Martin Farach-Colton, ICALP ‘02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming.