publications Hal Daumé III
about me
cv + bio
publications
research
teaching
students
software
photos
calendar
contact me
links
All of my publications are available here. My Erdös number is at most 4 (me to J. Langford to A. Blum to J. Spencer to P. Erdös).

Click here for a listing without summaries. The SRC files are gzipped tars containing the tex, figures and required style files.

New Papers:

Search-based Structured Prediction (Machine Learning Journal Submitted)
Structure Compilation: Trading Structure for Features (ICML 2008)
Name Translation in Statistical Machine Translation: Learning When to Transliterate (ACL 2008)
Bayesian Agglomerative Clustering with Coalescents (NIPS 2007)
Frustratingly Easy Domain Adaptation (ACL 2007)
A Bayesian Model for Discovering Typological Implications (ACL 2007)

Thesis

bib
ps
pdf
html
Practical Structured Learning for Natural Language Processing #
Ph.D. Thesis
Committee: D. Marcu, K. Knight, E. Hovy, S. Schaal, G. James, A. McCallum.
This thesis describes an algorithm for solving many of the complex prediction problems encountered in NLP applications. The algorithm comes with strong theoretical guarentees, is empirically effective in applications such as IE and summarization, is efficient and is easy to implement.

Journal Papers

new
bib
ps
pdf
html
Search-based Structured Prediction #
Machine Learning Journal (Submitted)
With: John Langford and Daniel Marcu.
We describe an algorithm, Searn for solving structured prediction problems for cases where neither the loss nor the features decompose over the structure. This algorithm comes with reduction guarantees about performance, and strong empirical results in both standard sequence labeling tasks and a novel document summarization task.
bib
ps
pdf
tgz
Domain Adaptation for Statistical Classifiers #
Journal of Artificial Intelligence Research (2006)
With: Daniel Marcu.
We address the problem of moving classifiers from one domain to another (compromising the second "i" in i.i.d.) by treating the data distribution as a mixture of three sources: in-domain, out-of-domain and general. Inference is based on conditional EM for conditional models (maximum entropy) and we get quite good performance on three natural language tasks.
Note: There are errors in some of the equation derivations; I will put out an official errata soon!
bib
ps
pdf
tgz
Induction of Word and Phrase Alignments for Automatic Document Summarization #
Computational Linguistics (2005)
With: Daniel Marcu.
We develop a model for automatically building word and phrase alignments for the task of automatic document summarization, based on the segment HMM (or semi-HMM). This explicates and extends the EMNLP paper below, with more experiments and some syntactically motived transition models.
bib
ps
pdf
tgz
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior #
Journal of Machine Learning Research (2005)
With: Daniel Marcu.
We develop three models for the supervised clustering problem based on the Dirichlet process as a prior distribution over the potentially infinite sets encountered in this problem. On three of four data sets, we achieve quite impressive results. This extends and completes the NIPS workshop paper below.

Conference Papers

new
bib
pdf
Structure Compilation: Trading Structure for Features #
International Conference on Machine Learning 2008
With: Percy Liang and Dan Klein.
We investigate the trade-off between structured models and feature-rich models, empirically and theoretically. We also present a way to trade structure for features using unlabeled data.
new
bib
pdf
Name Translation in Statistical Machine Translation: Learning When to Transliterate #
Association for Computational Linguistics 2008
With: Ulf Hermjakob and Kevin Knight.
We transliterate names in an end-to-end MT system, where names are explicitly tagged with there to transliterate or not. Our transliterator outperforms 3 of 4 human translators.
new
bib
ps
pdf
Bayesian Agglomerative Clustering with Coalescents #
Neural Information Processing Systems 2007
With: Yee Whye Teh and Daniel Roy.
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.
new
bib
ps
pdf
tgz
Frustratingly Easy Domain Adaptation #
Association for Computational Linguistics 2007
When both source and target labeled data are available, a very simple "merge" can lead to state-of-the-art results on the target task. This is essentially a reduction for domain adaptation and can be implimented in 10 lines of Perl. Slides available as OpenOffice and PDF.
Note: There is related work in multitask learning by Evgeniou and Pontil.
new
bib
ps
pdf
html
tgz
A Bayesian Model for Discovering Typological Implications #
Association for Computational Linguistics 2007
With: Lyle Campbell.
Based on the WALS data, we automatically discover typological implications of the form "verb-object implies noun-adjective." Many of our discovered implications are well known; some are not. We introduce a hierarchical prior to cope with the sampling problem. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Fast search for Dirichlet process mixture models #
Conference on AI and Statistics (2007)
When using DPs for mixture modeling, one often only cares about getting the MAP cluster assignment. In such cases, we show that it is a good idea just to do an efficient search, rather than performing (expensive) sampling. A surprising result emerges: nearly greedy search actually works incredibly well! Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Bayesian Query-Focused Summarization #
Association for Computational Linguistics 2006
With: Daniel Marcu.
We describe the BayeSum system (previously called BQFS in our DUC and MSE papers) for query-focused sentence extraction in a Bayesian framework (the model looks like a topic model if you squint a little). Achieves competitive performance on white-box experiments and also leads to good systems in DUC. Also, from an IR perspective, provides a formalism for statistically grounded query expansion. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model #
Human Language Technologies/Empirical Methods in NLP 2005
With: Daniel Marcu.
We apply the LaSO technique to the EDT problem (entity identification and coreference); doing so allows us to use many more interesting features than have been previously available in models for this task, and we get very good results on the ACE benchmark data (with very efficient algorithms).
bib
ps
pdf
tgz
Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction #
International Conference on Machine Learning 2005
With: Daniel Marcu.
We describe a machine learning framework for producing structured outputs (such as sequences, parse trees, sentences, etc.) when search is intractable. The framework is based on combining online updates with a generic search algorithm, and leads to very efficient, very effective models. Slides in OpenOffice or PDF format.
Note: See Alan Fern's web page for follow-up work.
bib
ps
pdf
A Phrase-Based HMM Approach to Document/Abstract Alignment #
Empirical Methods in NLP 2004
With: Daniel Marcu.
We present a model called the phrase-based HMM (see under unpublished work below) for producing word-to-word and phrase-to-phrase alignments between documents and abstracts. This is in some sense my seminal accomplishment in summarization, and one of the pieces of work I am currently most proud of. I have also made available the annotation software and guide. The slides from an extended version of the talk are available in OpenOffice format, or for view in your browser (warning: animations don't work in the HTML version, which makes some slides impossible to read).
bib
ps
pdf
NP Bracketing by Maximum Entropy Tagging and SVM Reranking #
Empirical Methods in NLP 2004
With: Daniel Marcu.
NP Bracketing is the task of identifying all base- and embedded-NPs. We do this task first by ME tagging using an underspecified tagset, and then SVM-based reranking of hypotheses. We get best performance to date, and offer advantages over full parsers.
bib
ps
pdf
Web Search Intent Induction via Automatic Query Reformulation #
North American Association for Computational Linguistics 2004 Short Paper
With: Eric Brill.
We describe an algorithm that uses queries made by other people to help nail down intents for underspecified queries, and show its effectiveness in a real system. This is a short version of an unpublished paper available in Postscript or PDF format.
bib
ps
pdf
The Importance of Lexicalized Syntax Models for Natural Language Generation Tasks #
International Conference on Natural Language Generation 2002
With: Kevin Knight, Irene Langkilde-Geary, Daniel Marcu and Kenji Yamada.
We compare the performance difference of n-gram language models and syntax-based langauge models on pure natural language generation, summarization and machine translation. Our results show that lexicalized syntax-based models greatly improve readability.
bib
ps
pdf
A Noisy-Channel Model for Document Compression #
Association for Computational Linguistics 2002
With: Daniel Marcu.
We present a document compression system that uses a hierarchical noisy-channel model of text production based on combining discourse trees and syntax trees. We apply this system to summarization.
bib
ps
pdf
Integrated Information Management: An Interactive, Extensible Architecture for Information Retrieval #
Human Language Technologies 2001
With: Eric Nyberg.
We present the Integrated Information Management (IIM) architecture for component-based development of IR applications, which is general enough to model different types of IR tasks, beyond indexing and retrieval.

Workshop Papers

bib
ps
pdf
tgz
Search-Based Structured Prediction as Classification #
NIPS 2005 Workshop on Advances in Structured Learning for Text and Speech Processing
With: John Langford and Daniel Marcu.
We extend and formalize the LaSO framework developed in the ICML 2005 paper and obtain significantly stronger theoretical guarantees, even for non-0/1 loss. Relationships to reinforcement learning are also discussed.
bib
ps
pdf
tgz
Bayesian Summarization and DUC and a Suggestion for Extrinsic Evaluation #
Document Underanding Conference (DUC) 2005
With: Daniel Marcu.
Since we submitted essentially the same system to DUC as to MSE, we briefly describe the minor differences, then spend the rest of the paper advocating a new evaluation technique, called Filtering by Summary. This is an extrinsic evaluation that is easy to deploy, requires little training, and isn't costly.
bib
ps
pdf
tgz
Bayesian Multi-Document Summarization at MSE #
ACL 2005 Workshop on Multilingual Summarization Evaluation (MSE)
With: Daniel Marcu.
We describe our entry into the MSE competition; it is based on our Bayesian Query-Focused Summarization model (under review), adapted to the multidocument setting. The model performed very well in the evaluation, coming in first according to the human evaluation and fifth or third according to the two automatic evaluations.
bib
ps
pdf
Supervised clustering with the Dirichlet process #
NIPS 2004 Learning With Structured Outputs Workshop
With: Daniel Marcu.
In problems such as reference matching, identity uncertainty and coreference resolution, you need to be able to learn to partition sets based on supervised examples. We present a Bayesian model for learning how to do this based on the semi-parametric Dirichlet process prior. A more complete version of this paper is in preparation.
bib
ps
pdf
Generic Sentence Fusion is an Ill-Defined Summarization Task #
Text Summarization Branches Out Workshop (ACL 2004)
With: Daniel Marcu.
We perform a series of human experiments to assess the legitimacy of the task of fusing two sentences without context or task. We find that there is no agreement between humans on this task, even in the limited case of 2 to 1 sentence fusion.
bib
ps
pdf
A Tree-Position Kernel for Document Compression #
Document Underanding Conference (DUC) 2004
With: Daniel Marcu.
We introduct the Tree-Position kernel for performing document compression, in a model which is a generalization of the document compression model from ACL 2002 (below). Works well by itself, but not in a full system. The slides from the talk are available as Postscript or PDF.
bib
pdf
GLEANS: A Generator of Logical Extracts and Abstracts for Nice Summaries #
Document Underanding Conference (DUC) 2002
With: Abdesammad Echihabi, Daniel Marcu, Dragos Stefan Munteanu and Radu Soricut.
We describe a summarization system that functions by mapping documents into a canonical database-like representation, categorizes the documents, and generates abstracts and extracts based on generic templates.

Book Review

ps
pdf
Book Review: Automatic Summarization (by Inderjeet Mani) #
Machine Translation
I reviewed the book Automatic Summarization by Inderjeet Mani for the MT journal. The version which appears here is slightly different than the one finally published, but the content is the same.

Unpublished Papers

The following papers are not published anywhere, nor have they been peer reviewed. I put them up because I think (hope!) people might find them useful.
bib
ps
pdf
html
Searn in Practice #
With: John Langford and Daniel Marcu.
We recently introduced an algorithm, Searn, for solving hard structured prediction problems. This report is designed to showcase how Searn can be applied to a wide variety of techniques and what really goes on behind the scenes. We show how to apply Searn to three common NLP problems: (1) sequence labeling, (2) parsing and (3) machine translation.
bib
ps
pdf
Carefully Approximated Bayes Factors for Feature Selection in MaxEnt Models #
I describe a method of feature selection in maximum entropy (logistic regression) models based on the Bayes factor. I approximate data evidence using the Laplace approximation, and using two reasonable assumptions manage to get a very efficient procedure that has many benefits, including automatic stopping.
bib
ps
pdf
Notes on CG and LM-BFGS Optimization of Logistic Regression #
These are notes on the implementation I have written of conjugate gradient and limited memory BFGS optimization for logistic regression (aka maximum entropy) classifiers. The notes were created because it is actually quite difficult to find good references on efficient implementation of these algorithms, though discussion of them exists everywhere.
ps
pdf
Support Vector Machines for Natural Language Processing #
These are notes to accompany a lecture I gave on the use of SVMs in an intro graduate level NLP class. The slides I used can be found here. Both sources contain approximately the same information, but the notes are much more in-depth and can be understood without a soundtrack.
bib
ps
pdf
From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less #
This is a tutorial on the mathematics behind RKHSs and their use in machine learning (eg., support vector machines and gaussian processes).
bib
ps
pdf
html
Yet Another Haskell Tutorial #
A tutorial on the Haskell programming language, designed for people who have experience programming (just not in a functional language).
bib
ps
pdf
A Phrase-Based HMM #
This describes the derivation of the inference algorithms used in the EMNLP 2004 document/summary alignment paper above. Unfortunately, there was no room in that document to explain these in detail, so they are relegated to this unpublished version.
html
Some notes on binning for Good-Turing smoothing #
This reports a small experiment done to explain binning in Good-Turing smoothing for a lecture on that topic in a graduate intro to NLP class.
bib
ps
pdf
Asymmetry of Coordination #
A final paper for a graduate linguistics syntax class, examines the issue of coordination in English and Japanese and proposes a unified structure for this phenomena. Also reviews several proposals to date.

My Co-authors and the percentage of my papers they are on:

Daniel Marcu (51%)
John Langford (8%)
Kevin Knight (5%)
Dan Klein (2%)
Percy Liang (2%)
Ulf Hermjakob (2%)
Daniel Roy (2%)
Lyle Campbell (2%)
Yee Whye Teh (2%)
Eric Brill (2%)
Abdesammad Echihabi (2%)
Dragos Stefan Munteanu (2%)
Irene Langkilde-Geary (2%)
Kenji Yamada (2%)
Radu Soricut (2%)
Eric Nyberg (2%)

My publishing productivity by year:

2008: 2; 2007: 4; 2006: 5; 2005: 7; 2004: 10; 2002: 4; 2001: 1

My dozen most popular papers (by download count):

  1. Yet Another Haskell Tutorial (97603)
  2. Search-based Structured Prediction (3353)
  3. Practical Structured Learning for Natural Language Processing (3348)
  4. Frustratingly Easy Domain Adaptation (2280)
  5. Notes on CG and LM-BFGS Optimization of Logistic Regression (2004)
  6. Fast search for Dirichlet process mixture models (1826)
  7. Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction (1533)
  8. Bayesian Query-Focused Summarization (1477)
  9. A Bayesian Model for Discovering Typological Implications (1261)
  10. Searn in Practice (1213)
  11. Bayesian Agglomerative Clustering with Coalescents (1004)
  12. Domain Adaptation for Statistical Classifiers (936)
quick links
   nlp blog
   searn
   nlp/ml meeting
   ml (cs5350)
   ai (cs5300)
   anlp (cs5964)
   mlrg (cs7941)
   algo (cs7936)
   whattosee
   thesis
   jmlr
   haskell tutorial
conferences
   coling 08
   kdd 08
   emnlp 08
   cikm 08
   nips 08
   soda 09
   eacl 09
   aistats 09
   recomb 09
   naacl-hlt 09
   icml 09
   colt 09
   sigir 09
   cvpr 09
   ismb 09
   kdd 09
   uai 09
   ijcai 09
   acl 09
   iccv 09
last updated on twenty one august, two thousand eight; contact me AT hal3 DOT name