publications Hal Daumé III
about me
cv + bio
publications
research
teaching
students
software
photos
calendar
contact me
links
All of my publications are available here. My Erdös number is at most 3 (me to Suresh Venkatasubramanian to Andy Yao to Paul Erdös).

Click here for a listing without summaries. The SRC files are gzipped tars containing the tex, figures and required style files.

New Papers:

Search-based Structured Prediction (Machine Learning J. 2009)
A Bayesian Statistics Approach to Multiscale Coarse Graining (J. of Chemical Physics 2009)
Multi-Label Prediction via Sparse Infinite CCA (NIPS 2009)
Markov Random Topic Fields (ACL 2009)
Bayesian Multitask Learning with Latent Hierarchies (UAI 2009)
Unsupervised Search-based Structured Prediction (ICML 2009)
Exponential Family Hybrid Semi-Supervised Learning (IJCAI 2009)
Streamed Learning: One-Pass SVMs (IJCAI 2009)
Non-Parametric Bayesian Areal Linguistics (NAACL 2009)
Streaming for Large Scale NLP: Language Modeling (NAACL 2009)
The Infinite Hierarchical Factor Regression Model (NIPS 2008)
Cross-Task Knowledge-Constrained Self Training (EMNLP 2008)
Structure Compilation: Trading Structure for Features (ICML 2008)
Name Translation in Statistical Machine Translation: Learning When to Transliterate (ACL 2008)
Unsupervised Part of Speech Tagging Without a Lexicon (NIPS Workshop on Grammar Induction, Representation of Language and Language Learning)
Fast Search for Infinite latent Feature Models (NIPS Workshop on Non-parametric Bayes)
Semi-supervised or Semi-unsupervised? (NAACL Workshop on Semi-supervised Learning for NLP, 2009)
Perceptron-based Coherence Prediction (Chip Multiprocessor Memory Systems and Interconnects, ICSA 2008)

Thesis

bib
ps
pdf
html
Practical Structured Learning for Natural Language Processing #
Ph.D. Thesis
Committee: D. Marcu, K. Knight, E. Hovy, S. Schaal, G. James, A. McCallum.
This thesis describes an algorithm for solving many of the complex prediction problems encountered in NLP applications. The algorithm comes with strong theoretical guarentees, is empirically effective in applications such as IE and summarization, is efficient and is easy to implement.

Journal Papers

new
bib
ps
pdf
html
Search-based Structured Prediction #
Machine Learning Journal (2009)
With: John Langford and Daniel Marcu.
We describe an algorithm, Searn for solving structured prediction problems for cases where neither the loss nor the features decompose over the structure. This algorithm comes with reduction guarantees about performance, and strong empirical results in both standard sequence labeling tasks and a novel document summarization task.
new
bib
html
A Bayesian Statistics Approach to Multiscale Coarse Graining #
Journal of Chemical Physics (2009)
With: Pu Liu, Qiang Shi and Gregory Voth.
We apply Bayesian analysis to model a multiscale coarse-grained force field in a thermodynamically consistent way. The robustness and accuracy of the Bayesian MS-CG algorithm is demonstrated for three different systems, including simple liquid methanol, polyalanine peptide solvated in explicit water, and a much more complicated peptide assembly with 32 NNQQNY hexapeptides.
bib
ps
pdf
tgz
Domain Adaptation for Statistical Classifiers #
Journal of Artificial Intelligence Research (2006)
With: Daniel Marcu.
We address the problem of moving classifiers from one domain to another (compromising the second "i" in i.i.d.) by treating the data distribution as a mixture of three sources: in-domain, out-of-domain and general. Inference is based on conditional EM for conditional models (maximum entropy) and we get quite good performance on three natural language tasks.
Note: There are errors in some of the equation derivations; I will put out an official errata soon!
bib
ps
pdf
tgz
Induction of Word and Phrase Alignments for Automatic Document Summarization #
Computational Linguistics (2005)
With: Daniel Marcu.
We develop a model for automatically building word and phrase alignments for the task of automatic document summarization, based on the segment HMM (or semi-HMM). This explicates and extends the EMNLP paper below, with more experiments and some syntactically motived transition models.
bib
ps
pdf
tgz
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior #
Journal of Machine Learning Research (2005)
With: Daniel Marcu.
We develop three models for the supervised clustering problem based on the Dirichlet process as a prior distribution over the potentially infinite sets encountered in this problem. On three of four data sets, we achieve quite impressive results. This extends and completes the NIPS workshop paper below.

Conference Papers

new
bib
pdf
Multi-Label Prediction via Sparse Infinite CCA #
Neural Information Processing Systems 2009
With: Piyush Rai.
Using an infinte variant of CCA, we propose a non-parametric Bayesian framework for semi-supervised dimensionality reduction for multitask learning and multi-label prediction.
new
bib
pdf
Markov Random Topic Fields #
Association for Computational Linguistics 2009
We show how to integrate topic models in an undirected graph for topic mining in -- for instance -- scientific publications. We explore several different model parameterizations.
new
bib
pdf
Bayesian Multitask Learning with Latent Hierarchies #
Uncertainty in Artificial Intelligence 2009
We show that hierarchical multitask learning can be accomplished even when the hierarchical structure is not known in advance. We use the coalescent to make this happen.
new
bib
pdf
Unsupervised Search-based Structured Prediction #
International Conference on Machine Learning 2009
We extend the Searn algorithm to handle unsupervised learning; experiments on dependency parsing are promising and easy to implement.
new
bib
pdf
Exponential Family Hybrid Semi-Supervised Learning #
International Joint Conference on Artificial Intelligence 2009
With: Arvind Agarwal.
We generalize the coupled prior approach to hybrid models probabilistic models to arbitrary exponential family distributions. Experiments show that this is a good idea.
new
bib
pdf
Streamed Learning: One-Pass SVMs #
International Joint Conference on Artificial Intelligence 2009
With: Piyush Rai and Suresh Venkatasubramanian.
We present a one-pass approach to learning in the support vector machine framework. Our algorithm leverages the computational geometry view of SVMs as minimum enclosing balls, plus results on streaming MEBs.
new
bib
pdf
Non-Parametric Bayesian Areal Linguistics #
North American Association for Computational Linguistics 2009
We present a model of linguistic phylogenetics that takes into account areas affects in a non-parametric way. Our model simultaneously discovers phylogenetic trees, areal clusters and areal features.
new
bib
pdf
Streaming for Large Scale NLP: Language Modeling #
North American Association for Computational Linguistics 2009
With: Amit Goyal and Suresh Venkatasubramanian.
We describe an approach to language modeling based on the streaming model of algorithms. We show that a single-pass, high efficiency approximate counting method can lead to tiny language models that perform as well as forever-to-build entropy-pruned language models.
new
bib
pdf
The Infinite Hierarchical Factor Regression Model #
Neural Information Processing Systems 2008
With: Piyush Rai.
We address shortcomings in factor analysis by: (1) we do not assume a known number of factors; (2) we do not assume factors are independent; (3) we do not assume all features are relevant to the factor analysis. We apply this to microarray analysis data.
new
bib
pdf
Cross-Task Knowledge-Constrained Self Training #
Empirical Methods in NLP 2008
We present a simple algorithm for self-training with knowledge that tells us when outputs for multiple tasks are "compatible." We can learn to do NER on a tiny amount of labeled data by this method.
new
bib
pdf
Structure Compilation: Trading Structure for Features #
International Conference on Machine Learning 2008
With: Percy Liang and Dan Klein.
We investigate the trade-off between structured models and feature-rich models, empirically and theoretically. We also present a way to trade structure for features using unlabeled data.
new
bib
pdf
Name Translation in Statistical Machine Translation: Learning When to Transliterate #
Association for Computational Linguistics 2008
With: Ulf Hermjakob and Kevin Knight.
We transliterate names in an end-to-end MT system, where names are explicitly tagged with there to transliterate or not. Our transliterator outperforms 3 of 4 human translators.
bib
ps
pdf
Bayesian Agglomerative Clustering with Coalescents #
Neural Information Processing Systems 2007
With: Yee Whye Teh and Daniel Roy.
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.
bib
ps
pdf
tgz
Frustratingly Easy Domain Adaptation #
Association for Computational Linguistics 2007
When both source and target labeled data are available, a very simple "merge" can lead to state-of-the-art results on the target task. This is essentially a reduction for domain adaptation and can be implimented in 10 lines of Perl. Slides available as OpenOffice and PDF.
Note: There is related work in multitask learning by Evgeniou and Pontil.
bib
ps
pdf
html
tgz
A Bayesian Model for Discovering Typological Implications #
Association for Computational Linguistics 2007
With: Lyle Campbell.
Based on the WALS data, we automatically discover typological implications of the form "verb-object implies noun-adjective." Many of our discovered implications are well known; some are not. We introduce a hierarchical prior to cope with the sampling problem. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Fast search for Dirichlet process mixture models #
Conference on AI and Statistics (2007)
When using DPs for mixture modeling, one often only cares about getting the MAP cluster assignment. In such cases, we show that it is a good idea just to do an efficient search, rather than performing (expensive) sampling. A surprising result emerges: nearly greedy search actually works incredibly well! Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Bayesian Query-Focused Summarization #
Association for Computational Linguistics 2006
With: Daniel Marcu.
We describe the BayeSum system (previously called BQFS in our DUC and MSE papers) for query-focused sentence extraction in a Bayesian framework (the model looks like a topic model if you squint a little). Achieves competitive performance on white-box experiments and also leads to good systems in DUC. Also, from an IR perspective, provides a formalism for statistically grounded query expansion. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model #
Human Language Technologies/Empirical Methods in NLP 2005
With: Daniel Marcu.
We apply the LaSO technique to the EDT problem (entity identification and coreference); doing so allows us to use many more interesting features than have been previously available in models for this task, and we get very good results on the ACE benchmark data (with very efficient algorithms).
bib
ps
pdf
tgz
Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction #
International Conference on Machine Learning 2005
With: Daniel Marcu.
We describe a machine learning framework for producing structured outputs (such as sequences, parse trees, sentences, etc.) when search is intractable. The framework is based on combining online updates with a generic search algorithm, and leads to very efficient, very effective models. Slides in OpenOffice or PDF format.
Note: See Alan Fern's web page for follow-up work.
bib
ps
pdf
A Phrase-Based HMM Approach to Document/Abstract Alignment #
Empirical Methods in NLP 2004
With: Daniel Marcu.
We present a model called the phrase-based HMM (see under unpublished work below) for producing word-to-word and phrase-to-phrase alignments between documents and abstracts. This is in some sense my seminal accomplishment in summarization, and one of the pieces of work I am currently most proud of. I have also made available the annotation software and guide. The slides from an extended version of the talk are available in OpenOffice format, or for view in your browser (warning: animations don't work in the HTML version, which makes some slides impossible to read).
bib
ps
pdf
NP Bracketing by Maximum Entropy Tagging and SVM Reranking #
Empirical Methods in NLP 2004
With: Daniel Marcu.
NP Bracketing is the task of identifying all base- and embedded-NPs. We do this task first by ME tagging using an underspecified tagset, and then SVM-based reranking of hypotheses. We get best performance to date, and offer advantages over full parsers.
bib
ps
pdf
Web Search Intent Induction via Automatic Query Reformulation #
North American Association for Computational Linguistics 2004 Short Paper
With: Eric Brill.
We describe an algorithm that uses queries made by other people to help nail down intents for underspecified queries, and show its effectiveness in a real system. This is a short version of an unpublished paper available in Postscript or PDF format.
bib
ps
pdf
The Importance of Lexicalized Syntax Models for Natural Language Generation Tasks #
International Conference on Natural Language Generation 2002
With: Kevin Knight, Irene Langkilde-Geary, Daniel Marcu and Kenji Yamada.
We compare the performance difference of n-gram language models and syntax-based langauge models on pure natural language generation, summarization and machine translation. Our results show that lexicalized syntax-based models greatly improve readability.
bib
ps
pdf
A Noisy-Channel Model for Document Compression #
Association for Computational Linguistics 2002
With: Daniel Marcu.
We present a document compression system that uses a hierarchical noisy-channel model of text production based on combining discourse trees and syntax trees. We apply this system to summarization.
bib
ps
pdf
Integrated Information Management: An Interactive, Extensible Architecture for Information Retrieval #
Human Language Technologies 2001
With: Eric Nyberg.
We present the Integrated Information Management (IIM) architecture for component-based development of IR applications, which is general enough to model different types of IR tasks, beyond indexing and retrieval.

Workshop Papers

new
bib
pdf
Unsupervised Part of Speech Tagging Without a Lexicon #
NIPS Workshop on Grammar Induction, Representation of Language and Language Learning
With: Adam Teichert.
We show how to use simple typological knowledge to improve unsupervised part of speech tagging. The basic result is that we can do without seed lexicons if we know a little linguistics.
new
bib
Fast Search for Infinite latent Feature Models #
NIPS Workshop on Non-parametric Bayes
With: Piyush Rai.
We show that, like A* and beam search can solve MAP for Dirichlet Processes, it can also be applied to the Indian Buffet Process.
new
bib
pdf
Semi-supervised or Semi-unsupervised? #
NAACL Workshop on Semi-supervised Learning for NLP, 2009
Are you doing learning with labeled and unlabeled data in such a way that it would work with only unlabeled data or such that it would work with only labeled data? Why not both? This is an invited position paper for the SSLNLP workshop.
new
bib
pdf
Perceptron-based Coherence Prediction #
Chip Multiprocessor Memory Systems and Interconnects, ICSA 2008
With: Devyani Ghosh and John Carter.
We present a learning solution to the coherence problem in multicore systems, achieve very high precision and recall on the "will you need this data" prediction task.
bib
ps
pdf
tgz
Search-Based Structured Prediction as Classification #
NIPS 2005 Workshop on Advances in Structured Learning for Text and Speech Processing
With: John Langford and Daniel Marcu.
We extend and formalize the LaSO framework developed in the ICML 2005 paper and obtain significantly stronger theoretical guarantees, even for non-0/1 loss. Relationships to reinforcement learning are also discussed.
bib
ps
pdf
tgz
Bayesian Summarization and DUC and a Suggestion for Extrinsic Evaluation #
Document Underanding Conference (DUC) 2005
With: Daniel Marcu.
Since we submitted essentially the same system to DUC as to MSE, we briefly describe the minor differences, then spend the rest of the paper advocating a new evaluation technique, called Filtering by Summary. This is an extrinsic evaluation that is easy to deploy, requires little training, and isn't costly.
bib
ps
pdf
tgz
Bayesian Multi-Document Summarization at MSE #
ACL 2005 Workshop on Multilingual Summarization Evaluation (MSE)
With: Daniel Marcu.
We describe our entry into the MSE competition; it is based on our Bayesian Query-Focused Summarization model (under review), adapted to the multidocument setting. The model performed very well in the evaluation, coming in first according to the human evaluation and fifth or third according to the two automatic evaluations.
bib
ps
pdf
Supervised clustering with the Dirichlet process #
NIPS 2004 Learning With Structured Outputs Workshop
With: Daniel Marcu.
In problems such as reference matching, identity uncertainty and coreference resolution, you need to be able to learn to partition sets based on supervised examples. We present a Bayesian model for learning how to do this based on the semi-parametric Dirichlet process prior. A more complete version of this paper is in preparation.
bib
ps
pdf
Generic Sentence Fusion is an Ill-Defined Summarization Task #
Text Summarization Branches Out Workshop (ACL 2004)
With: Daniel Marcu.
We perform a series of human experiments to assess the legitimacy of the task of fusing two sentences without context or task. We find that there is no agreement between humans on this task, even in the limited case of 2 to 1 sentence fusion.
bib
ps
pdf
A Tree-Position Kernel for Document Compression #
Document Underanding Conference (DUC) 2004
With: Daniel Marcu.
We introduct the Tree-Position kernel for performing document compression, in a model which is a generalization of the document compression model from ACL 2002 (below). Works well by itself, but not in a full system. The slides from the talk are available as Postscript or PDF.
bib
pdf
GLEANS: A Generator of Logical Extracts and Abstracts for Nice Summaries #
Document Underanding Conference (DUC) 2002
With: Abdesammad Echihabi, Daniel Marcu, Dragos Stefan Munteanu and Radu Soricut.
We describe a summarization system that functions by mapping documents into a canonical database-like representation, categorizes the documents, and generates abstracts and extracts based on generic templates.

Book Review

ps
pdf
Book Review: Automatic Summarization (by Inderjeet Mani) #
Machine Translation
I reviewed the book Automatic Summarization by Inderjeet Mani for the MT journal. The version which appears here is slightly different than the one finally published, but the content is the same.

Unpublished Papers

The following papers are not published anywhere, nor have they been peer reviewed. I put them up because I think (hope!) people might find them useful.
bib
ps
pdf
html
Searn in Practice #
With: John Langford and Daniel Marcu.
We recently introduced an algorithm, Searn, for solving hard structured prediction problems. This report is designed to showcase how Searn can be applied to a wide variety of techniques and what really goes on behind the scenes. We show how to apply Searn to three common NLP problems: (1) sequence labeling, (2) parsing and (3) machine translation.
bib
ps
pdf
Carefully Approximated Bayes Factors for Feature Selection in MaxEnt Models #
I describe a method of feature selection in maximum entropy (logistic regression) models based on the Bayes factor. I approximate data evidence using the Laplace approximation, and using two reasonable assumptions manage to get a very efficient procedure that has many benefits, including automatic stopping.
bib
ps
pdf
Notes on CG and LM-BFGS Optimization of Logistic Regression #
These are notes on the implementation I have written of conjugate gradient and limited memory BFGS optimization for logistic regression (aka maximum entropy) classifiers. The notes were created because it is actually quite difficult to find good references on efficient implementation of these algorithms, though discussion of them exists everywhere.
ps
pdf
Support Vector Machines for Natural Language Processing #
These are notes to accompany a lecture I gave on the use of SVMs in an intro graduate level NLP class. The slides I used can be found here. Both sources contain approximately the same information, but the notes are much more in-depth and can be understood without a soundtrack.
bib
ps
pdf
From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less #
This is a tutorial on the mathematics behind RKHSs and their use in machine learning (eg., support vector machines and gaussian processes).
bib
ps
pdf
html
Yet Another Haskell Tutorial #
A tutorial on the Haskell programming language, designed for people who have experience programming (just not in a functional language).
bib
ps
pdf
A Phrase-Based HMM #
This describes the derivation of the inference algorithms used in the EMNLP 2004 document/summary alignment paper above. Unfortunately, there was no room in that document to explain these in detail, so they are relegated to this unpublished version.
html
Some notes on binning for Good-Turing smoothing #
This reports a small experiment done to explain binning in Good-Turing smoothing for a lecture on that topic in a graduate intro to NLP class.
bib
ps
pdf
Asymmetry of Coordination #
A final paper for a graduate linguistics syntax class, examines the issue of coordination in English and Japanese and proposes a unified structure for this phenomena. Also reviews several proposals to date.

My Co-authors and the percentage of my papers they are on:

Daniel Marcu (36%)
Piyush Rai (7%)
John Langford (5%)
Suresh Venkatasubramanian (3%)
Kevin Knight (3%)
Amit Goyal (1%)
Arvind Agarwal (1%)
Gregory Voth (1%)
Pu Liu (1%)
Qiang Shi (1%)
Dan Klein (1%)
Devyani Ghosh (1%)
John Carter (1%)
Percy Liang (1%)
Ulf Hermjakob (1%)
Daniel Roy (1%)
Lyle Campbell (1%)
Yee Whye Teh (1%)
Eric Brill (1%)
Abdesammad Echihabi (1%)
Dragos Stefan Munteanu (1%)
Irene Langkilde-Geary (1%)
Kenji Yamada (1%)
Radu Soricut (1%)
Adam Teichert (1%)
Eric Nyberg (1%)

My publishing productivity by year:

2009: 13; 2008: 5; 2007: 4; 2006: 4; 2005: 7; 2004: 10; 2002: 4; 2001: 1

My dozen most popular papers (by download count):

  1. Yet Another Haskell Tutorial (119048)
  2. Practical Structured Learning for Natural Language Processing (4448)
  3. Frustratingly Easy Domain Adaptation (3168)
  4. Notes on CG and LM-BFGS Optimization of Logistic Regression (2943)
  5. Fast search for Dirichlet process mixture models (2119)
  6. Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction (2094)
  7. Bayesian Query-Focused Summarization (1853)
  8. Bayesian Agglomerative Clustering with Coalescents (1820)
  9. A Bayesian Model for Discovering Typological Implications (1714)
  10. Searn in Practice (1561)
  11. Domain Adaptation for Statistical Classifiers (1230)
  12. A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model (1182)
quick links
   nlp blog
   searn
   nlp/ml meeting
   ml (cs5350)
   ai (cs5300)
   anlp (cs5964)
   mlrg (cs7941)
   algo (cs7936)
   whattosee
   thesis
   jmlr
   haskell tutorial
conferences
   nips 09
   psb 10
   soda 10
   aistats 10
   recomb 10
   naacl-hlt 10
   cvpr 10
   icml 10
   colt 10
   ismb 10
   aaai 10
   acl 10
   conll 10
   coling 10
   sigir 10
   kdd 10
   emnlp 10
   uai 10
last updated on eight november, two thousand nine; contact me AT hal3 DOT name