Sketch Data Structures
Jun 9, 2015
Xing Lin
2 minute read

Tonight, I went to attend a local meetup event. The talk was given by a faculty from my own department, Jeff Phillips, an assistant professor working in the area of big data analysis. The topic of his talk is about sketch data structure - data structures which uses a much smaller amount of memory to summarize the represented data.This is a cool technique and I am very interested to learn more.

Sketch data structures are commonly used in the following three use cases.

  • frequent items, like IP addresses
  • distinct items (harder to do)
  • approximate distribution of items

And it is now also becoming popular to use sketches to represent matrixes and graphs

  • matrix sketch (hot)
  • graph sketch (new)

Here is a list of sketches Jeff talked in his talk.

  • reservoir sampling: keep k samples over a streaming items

    keep the first k items;
    for j_th item where j>k, keep it at probability of k/j.

Frequent items sketches
Assume we want to find top k frequent items

  • Misra-Gries sketch (under-count)

    initialize k counters = 0
    for i_th item
    if (we have a counter that is not 0 for this item)
    increase the counter by 1
    elsif (we have a counter that is 0)
    increase the counter by 1 and assign this counter to this item
    else
    decrease all counters by 1

  • spacesaving sketch (over-count)

    initialize k counters = 0
    for i_th item
    if (we have a counter that is not 0 for this item)
    increase the counter by 1
    else
    find the counter with minimal value;
    increase the counter by 1 and assign this counter to this item

Two generic sketches to estimate frequency of any item

  • count-min sketch (over-count, report minimal value, support substraction)

    initialize a two-dimensional t*k array of counters
    pick t independent hash fuctions
    for i_th item
    foreach hash function h
    increase the counter of h(item)

    for any item, report the minimal value of h(item) as the estimation of its frequency

  • count sketch

    initialize a two-dimensional t*k array of counters
    pick t independent hash fuctions h_1,…,h_t, each maps an item to {1,…,k} pick t independent hash functions s_1,…,s_t, each maps an item to {+1, -1}
    for i_th item
    foreach hash function h_i
    h_i(item) += s_i(item)

    for any item, report the median value of h(item) as the estimation of its frequency

Count sketch: “Finding Frequent Items in Data Streams”, Moses Charikar, Kevin Chen and Martin Farach-Colton, ICALP ‘02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming.