http://www.cs.utah.edu/~suresh
suresh at cs utah edu
Ph: 801 581 8233
Room 3404, School of Computing
50 S. Central Campus Drive,
Salt Lake City, UT 84112.
Sketching Techniques for Large-Scale NLP
Saturday April 10th 2010, 12:25 am
Filed under: Papers

[author]Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume and Suresh Venkatasubramanian[/author]
6th Web as Corpus Workshop (in conjunction with NAACL-HLT 2010)

Abstract:
In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in context of streaming data. We explore sketch techniques, especially Count-Min Sketch, which approximates the frequency of a word-pair in the corpus without explicitly storing the word-pairs themselves. We further use the idea of a conservative update with Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word-pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB main memory). The number of these counters is upto 30 times less than the stream size which is really a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.

Links: PDF



No Comments so far



Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)