http://www.cs.utah.edu/~suresh
suresh at cs utah edu
Ph: 801 581 8233
Room 3404, School of Computing
50 S. Central Campus Drive,
Salt Lake City, UT 84112.
Streaming for large scale NLP: Language Modelling
Monday January 19th 2009, 6:54 pm
Filed under: Papers

[author]Amit Goyal, Hal Daume and Suresh Venkatasubramanian[/author]
North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT) 2009 (to appear)

In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (4 GB RAM) desktop machine. Statistical machine translation experimental results corroborate that the resulting high-n approximate small language model is as effective as models obtained from other count pruning methods.

Links: PDF



No Comments so far



Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)