Filed under: Papers
[author]Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume and Suresh Venkatasubramanian[/author]
6th Web as Corpus Workshop (in conjunction with NAACL-HLT 2010)
Abstract:
In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in context of streaming data. We explore sketch techniques, especially Count-Min Sketch, which approximates the frequency of a word-pair in the corpus without explicitly storing the word-pairs themselves. We further use the idea of a conservative update with Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word-pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB main memory). The number of these counters is upto 30 times less than the stream size which is really a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.
Links: PDF
Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>