publications Hal Daumé III
about me
cv + bio
publications
research
teaching
students
software
photos
calendar
contact me
links
All of my publications are available here. My Erdös number is at most 3 (me to Suresh Venkatasubramanian to Andy Yao to Paul Erdös).

Click here for a listing without summaries. The SRC files are gzipped tars containing the tex, figures and required style files.

New Papers:

Infinite Predictor Subspace Models for Multitask Learning (Conference on Artificial Intelligence and Statistics 2010)
Kernelized Sorting for Natural Language Processing (Conference on Artificial Intelligence 2010)
Extracting Multilingual Topics from Unaligned Corpora (European Conference on Information Retrieval 2010)
Multi-Label Prediction via Sparse Infinite CCA (NIPS 2009)
Toward Plot Units: Automatic Affect State Analysis (NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text 2010)
Sketching Techniques for Large Scale NLP (NAACL Workshop on Web as a Corpus 2010)
Domain Adaptation meets Active Learning (NAACL Workshop on Active Learning for NLP 2010)
Unsupervised Part of Speech Tagging Without a Lexicon (NIPS Workshop on Grammar Induction, Representation of Language and Language Learning 2010)
Fast Search for Infinite latent Feature Models (NIPS Workshop on Non-parametric Bayes 2010)
Factor Regression Combining Heterogeneous Sources of Information (NIPS Workshop on Learning from Multiple Sources 2010)

Thesis

bib
ps
pdf
html
Practical Structured Learning for Natural Language Processing #
Ph.D. Thesis
Committee: D. Marcu, K. Knight, E. Hovy, S. Schaal, G. James, A. McCallum.
This thesis describes an algorithm for solving many of the complex prediction problems encountered in NLP applications. The algorithm comes with strong theoretical guarentees, is empirically effective in applications such as IE and summarization, is efficient and is easy to implement.

Journal Papers

bib
ps
pdf
html
Search-based Structured Prediction #
Machine Learning Journal (2009)
With: John Langford and Daniel Marcu.
We describe an algorithm, Searn for solving structured prediction problems for cases where neither the loss nor the features decompose over the structure. This algorithm comes with reduction guarantees about performance, and strong empirical results in both standard sequence labeling tasks and a novel document summarization task.
bib
html
A Bayesian Statistics Approach to Multiscale Coarse Graining #
Journal of Chemical Physics (2009)
With: Pu Liu, Qiang Shi and Gregory Voth.
We apply Bayesian analysis to model a multiscale coarse-grained force field in a thermodynamically consistent way. The robustness and accuracy of the Bayesian MS-CG algorithm is demonstrated for three different systems, including simple liquid methanol, polyalanine peptide solvated in explicit water, and a much more complicated peptide assembly with 32 NNQQNY hexapeptides.
bib
ps
pdf
tgz
Domain Adaptation for Statistical Classifiers #
Journal of Artificial Intelligence Research (2006)
With: Daniel Marcu.
We address the problem of moving classifiers from one domain to another (compromising the second "i" in i.i.d.) by treating the data distribution as a mixture of three sources: in-domain, out-of-domain and general. Inference is based on conditional EM for conditional models (maximum entropy) and we get quite good performance on three natural language tasks.
Note: There are errors in some of the equation derivations; I will put out an official errata soon!
bib
ps
pdf
tgz
Induction of Word and Phrase Alignments for Automatic Document Summarization #
Computational Linguistics (2005)
With: Daniel Marcu.
We develop a model for automatically building word and phrase alignments for the task of automatic document summarization, based on the segment HMM (or semi-HMM). This explicates and extends the EMNLP paper below, with more experiments and some syntactically motived transition models.
bib
ps
pdf
tgz
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior #
Journal of Machine Learning Research (2005)
With: Daniel Marcu.
We develop three models for the supervised clustering problem based on the Dirichlet process as a prior distribution over the potentially infinite sets encountered in this problem. On three of four data sets, we achieve quite impressive results. This extends and completes the NIPS workshop paper below.

Conference Papers

new
bib
pdf
Infinite Predictor Subspace Models for Multitask Learning #
Conference on Artificial Intelligence and Statistics 2010
With: Piyush Rai.
We present a non-parametric Bayesian model for multitask learning based on the assumption that task parameters live in a common, latent subspace.
new
bib
pdf
Kernelized Sorting for Natural Language Processing #
Conference on Artificial Intelligence 2010
With: Jagadeesh Jagarlamudi and Seth Juarez.
We adapt the kernelized sorting algorithm for NLP tasks, and introduce a semi-supervised variant. We demonstrate its performance on document matching, transliteration and image processing.
new
bib
pdf
Extracting Multilingual Topics from Unaligned Corpora #
European Conference on Information Retrieval 2010
With: Jagadeesh Jagarlamudi.
We show how topic models can be used to model unaligned multilingual corpora, such as Wikipedia.
new
bib
pdf
Multi-Label Prediction via Sparse Infinite CCA #
Neural Information Processing Systems 2009
With: Piyush Rai.
Using an infinte variant of CCA, we propose a non-parametric Bayesian framework for semi-supervised dimensionality reduction for multitask learning and multi-label prediction.
bib
pdf
Markov Random Topic Fields #
Association for Computational Linguistics 2009
We show how to integrate topic models in an undirected graph for topic mining in -- for instance -- scientific publications. We explore several different model parameterizations.
bib
pdf
Bayesian Multitask Learning with Latent Hierarchies #
Uncertainty in Artificial Intelligence 2009
We show that hierarchical multitask learning can be accomplished even when the hierarchical structure is not known in advance. We use the coalescent to make this happen.
bib
pdf
Unsupervised Search-based Structured Prediction #
International Conference on Machine Learning 2009
We extend the Searn algorithm to handle unsupervised learning; experiments on dependency parsing are promising and easy to implement.
bib
pdf
Exponential Family Hybrid Semi-Supervised Learning #
International Joint Conference on Artificial Intelligence 2009
With: Arvind Agarwal.
We generalize the coupled prior approach to hybrid models probabilistic models to arbitrary exponential family distributions. Experiments show that this is a good idea.
bib
pdf
Streamed Learning: One-Pass SVMs #
International Joint Conference on Artificial Intelligence 2009
With: Piyush Rai and Suresh Venkatasubramanian.
We present a one-pass approach to learning in the support vector machine framework. Our algorithm leverages the computational geometry view of SVMs as minimum enclosing balls, plus results on streaming MEBs.
bib
pdf
Non-Parametric Bayesian Areal Linguistics #
North American Association for Computational Linguistics 2009
We present a model of linguistic phylogenetics that takes into account areas affects in a non-parametric way. Our model simultaneously discovers phylogenetic trees, areal clusters and areal features.
bib
pdf
Streaming for Large Scale NLP: Language Modeling #
North American Association for Computational Linguistics 2009
With: Amit Goyal and Suresh Venkatasubramanian.
We describe an approach to language modeling based on the streaming model of algorithms. We show that a single-pass, high efficiency approximate counting method can lead to tiny language models that perform as well as forever-to-build entropy-pruned language models.
bib
pdf
The Infinite Hierarchical Factor Regression Model #
Neural Information Processing Systems 2008
With: Piyush Rai.
We address shortcomings in factor analysis by: (1) we do not assume a known number of factors; (2) we do not assume factors are independent; (3) we do not assume all features are relevant to the factor analysis. We apply this to microarray analysis data.
bib
pdf
Cross-Task Knowledge-Constrained Self Training #
Empirical Methods in NLP 2008
We present a simple algorithm for self-training with knowledge that tells us when outputs for multiple tasks are "compatible." We can learn to do NER on a tiny amount of labeled data by this method.
bib
pdf
Structure Compilation: Trading Structure for Features #
International Conference on Machine Learning 2008
With: Percy Liang and Dan Klein.
We investigate the trade-off between structured models and feature-rich models, empirically and theoretically. We also present a way to trade structure for features using unlabeled data.
bib
pdf
Name Translation in Statistical Machine Translation: Learning When to Transliterate #
Association for Computational Linguistics 2008
With: Ulf Hermjakob and Kevin Knight.
We transliterate names in an end-to-end MT system, where names are explicitly tagged with there to transliterate or not. Our transliterator outperforms 3 of 4 human translators.
bib
ps
pdf
Bayesian Agglomerative Clustering with Coalescents #
Neural Information Processing Systems 2007
With: Yee Whye Teh and Daniel Roy.
We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman's coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over others, and demonstrate our approach in document clustering and phylolinguistics.
bib
ps
pdf
tgz
Frustratingly Easy Domain Adaptation #
Association for Computational Linguistics 2007
When both source and target labeled data are available, a very simple "merge" can lead to state-of-the-art results on the target task. This is essentially a reduction for domain adaptation and can be implimented in 10 lines of Perl. Slides available as OpenOffice and PDF.
Note: There is related work in multitask learning by Evgeniou and Pontil.
bib
ps
pdf
html
tgz
A Bayesian Model for Discovering Typological Implications #
Association for Computational Linguistics 2007
With: Lyle Campbell.
Based on the WALS data, we automatically discover typological implications of the form "verb-object implies noun-adjective." Many of our discovered implications are well known; some are not. We introduce a hierarchical prior to cope with the sampling problem. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Fast search for Dirichlet process mixture models #
Conference on AI and Statistics (2007)
When using DPs for mixture modeling, one often only cares about getting the MAP cluster assignment. In such cases, we show that it is a good idea just to do an efficient search, rather than performing (expensive) sampling. A surprising result emerges: nearly greedy search actually works incredibly well! Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
Bayesian Query-Focused Summarization #
Association for Computational Linguistics 2006
With: Daniel Marcu.
We describe the BayeSum system (previously called BQFS in our DUC and MSE papers) for query-focused sentence extraction in a Bayesian framework (the model looks like a topic model if you squint a little). Achieves competitive performance on white-box experiments and also leads to good systems in DUC. Also, from an IR perspective, provides a formalism for statistically grounded query expansion. Slides available as OpenOffice and PDF.
bib
ps
pdf
tgz
A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model #
Human Language Technologies/Empirical Methods in NLP 2005
With: Daniel Marcu.
We apply the LaSO technique to the EDT problem (entity identification and coreference); doing so allows us to use many more interesting features than have been previously available in models for this task, and we get very good results on the ACE benchmark data (with very efficient algorithms).
bib
ps
pdf
tgz
Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction #
International Conference on Machine Learning 2005
With: Daniel Marcu.
We describe a machine learning framework for producing structured outputs (such as sequences, parse trees, sentences, etc.) when search is intractable. The framework is based on combining online updates with a generic search algorithm, and leads to very efficient, very effective models. Slides in OpenOffice or PDF format.
Note: See Alan Fern's web page for follow-up work.
bib
ps
pdf
A Phrase-Based HMM Approach to Document/Abstract Alignment #
Empirical Methods in NLP 2004
With: Daniel Marcu.
We present a model called the phrase-based HMM (see under unpublished work below) for producing word-to-word and phrase-to-phrase alignments between documents and abstracts. This is in some sense my seminal accomplishment in summarization, and one of the pieces of work I am currently most proud of. I have also made available the annotation software and guide. The slides from an extended version of the talk are available in OpenOffice format, or for view in your browser (warning: animations don't work in the HTML version, which makes some slides impossible to read).
bib
ps
pdf
NP Bracketing by Maximum Entropy Tagging and SVM Reranking #
Empirical Methods in NLP 2004
With: Daniel Marcu.
NP Bracketing is the task of identifying all base- and embedded-NPs. We do this task first by ME tagging using an underspecified tagset, and then SVM-based reranking of hypotheses. We get best performance to date, and offer advantages over full parsers.
bib
ps
pdf
Web Search Intent Induction via Automatic Query Reformulation #
North American Association for Computational Linguistics 2004 Short Paper
With: Eric Brill.
We describe an algorithm that uses queries made by other people to help nail down intents for underspecified queries, and show its effectiveness in a real system. This is a short version of an unpublished paper available in Postscript or PDF format.
bib
ps
pdf
The Importance of Lexicalized Syntax Models for Natural Language Generation Tasks #
International Conference on Natural Language Generation 2002
With: Kevin Knight, Irene Langkilde-Geary, Daniel Marcu and Kenji Yamada.
We compare the performance difference of n-gram language models and syntax-based langauge models on pure natural language generation, summarization and machine translation. Our results show that lexicalized syntax-based models greatly improve readability.
bib
ps
pdf
A Noisy-Channel Model for Document Compression #
Association for Computational Linguistics 2002
With: Daniel Marcu.
We present a document compression system that uses a hierarchical noisy-channel model of text production based on combining discourse trees and syntax trees. We apply this system to summarization.
bib
ps
pdf
Integrated Information Management: An Interactive, Extensible Architecture for Information Retrieval #
Human Language Technologies 2001
With: Eric Nyberg.
We present the Integrated Information Management (IIM) architecture for component-based development of IR applications, which is general enough to model different types of IR tasks, beyond indexing and retrieval.

Workshop Papers

new
bib
pdf
Toward Plot Units: Automatic Affect State Analysis #
NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text 2010
With: Amit Goyal, Ellen Riloff and Nathan Gilbert.
We discuss the task of affect state analysis in the context of the Plot Units formalism, and how it differs from sentiment analysis. We show how existing resources, together with linguistically-inspired projection rules, and affective verbs extracted from large corpora can be put together to build a system for affect state identification in fables.
new
bib
pdf
Sketching Techniques for Large Scale NLP #
NAACL Workshop on Web as a Corpus 2010
With: Amit Goyal, Jagadeesh Jagarlamudi and Suresh Venkatasubramanian.
We demonstrate the power of hashing techniques (including the count-min sketch) for counting word pairs. This leads to small, efficient models for computing pairwise mutual information of any two words.
new
bib
pdf
Domain Adaptation meets Active Learning #
NAACL Workshop on Active Learning for NLP 2010
With: Piyush Rai, Avishek Saha and Suresh Venkatasubramanian.
We present a simple algorithm for active domain adaptation that achieves significant reductions in sample complexity.
new
bib
pdf
Unsupervised Part of Speech Tagging Without a Lexicon #
NIPS Workshop on Grammar Induction, Representation of Language and Language Learning 2010
With: Adam Teichert.
We show how to use simple typological knowledge to improve unsupervised part of speech tagging. The basic result is that we can do without seed lexicons if we know a little linguistics.
new
bib
pdf
Fast Search for Infinite latent Feature Models #
NIPS Workshop on Non-parametric Bayes 2010
With: Piyush Rai.
We show that, like A* and beam search can solve MAP for Dirichlet Processes, it can also be applied to the Indian Buffet Process.
new
bib
pdf
Factor Regression Combining Heterogeneous Sources of Information #
NIPS Workshop on Learning from Multiple Sources 2010
With: Amrish Kapoor and Piyush Rai.
We combine text and microarray data in a factor regression model to achieve higher prediction accuracy.
bib
pdf
Semi-supervised or Semi-unsupervised? #
NAACL Workshop on Semi-supervised Learning for NLP, 2009
Are you doing learning with labeled and unlabeled data in such a way that it would work with only unlabeled data or such that it would work with only labeled data? Why not both? This is an invited position paper for the SSLNLP workshop.
bib
pdf
Perceptron-based Coherence Prediction #
Chip Multiprocessor Memory Systems and Interconnects, ICSA 2008
With: Devyani Ghosh and John Carter.
We present a learning solution to the coherence problem in multicore systems, achieve very high precision and recall on the "will you need this data" prediction task.
bib
ps
pdf
tgz
Search-Based Structured Prediction as Classification #
NIPS 2005 Workshop on Advances in Structured Learning for Text and Speech Processing
With: John Langford and Daniel Marcu.
We extend and formalize the LaSO framework developed in the ICML 2005 paper and obtain significantly stronger theoretical guarantees, even for non-0/1 loss. Relationships to reinforcement learning are also discussed.
bib
ps
pdf
tgz
Bayesian Summarization and DUC and a Suggestion for Extrinsic Evaluation #
Document Underanding Conference (DUC) 2005
With: Daniel Marcu.
Since we submitted essentially the same system to DUC as to MSE, we briefly describe the minor differences, then spend the rest of the paper advocating a new evaluation technique, called Filtering by Summary. This is an extrinsic evaluation that is easy to deploy, requires little training, and isn't costly.
bib
ps
pdf
tgz
Bayesian Multi-Document Summarization at MSE #
ACL 2005 Workshop on Multilingual Summarization Evaluation (MSE)
With: Daniel Marcu.
We describe our entry into the MSE competition; it is based on our Bayesian Query-Focused Summarization model (under review), adapted to the multidocument setting. The model performed very well in the evaluation, coming in first according to the human evaluation and fifth or third according to the two automatic evaluations.
bib
ps
pdf
Supervised clustering with the Dirichlet process #
NIPS 2004 Learning With Structured Outputs Workshop
With: Daniel Marcu.
In problems such as reference matching, identity uncertainty and coreference resolution, you need to be able to learn to partition sets based on supervised examples. We present a Bayesian model for learning how to do this based on the semi-parametric Dirichlet process prior. A more complete version of this paper is in preparation.
bib
ps
pdf
Generic Sentence Fusion is an Ill-Defined Summarization Task #
Text Summarization Branches Out Workshop (ACL 2004)
With: Daniel Marcu.
We perform a series of human experiments to assess the legitimacy of the task of fusing two sentences without context or task. We find that there is no agreement between humans on this task, even in the limited case of 2 to 1 sentence fusion.
bib
ps
pdf
A Tree-Position Kernel for Document Compression #
Document Underanding Conference (DUC) 2004
With: Daniel Marcu.
We introduct the Tree-Position kernel for performing document compression, in a model which is a generalization of the document compression model from ACL 2002 (below). Works well by itself, but not in a full system. The slides from the talk are available as Postscript or PDF.
bib
pdf
GLEANS: A Generator of Logical Extracts and Abstracts for Nice Summaries #
Document Underanding Conference (DUC) 2002
With: Abdesammad Echihabi, Daniel Marcu, Dragos Stefan Munteanu and Radu Soricut.
We describe a summarization system that functions by mapping documents into a canonical database-like representation, categorizes the documents, and generates abstracts and extracts based on generic templates.

Book Review

ps
pdf
Book Review: Automatic Summarization (by Inderjeet Mani) #
Machine Translation
I reviewed the book Automatic Summarization by Inderjeet Mani for the MT journal. The version which appears here is slightly different than the one finally published, but the content is the same.

Unpublished Papers

The following papers are not published anywhere, nor have they been peer reviewed. I put them up because I think (hope!) people might find them useful.
bib
ps
pdf
html
Searn in Practice #
With: John Langford and Daniel Marcu.
We recently introduced an algorithm, Searn, for solving hard structured prediction problems. This report is designed to showcase how Searn can be applied to a wide variety of techniques and what really goes on behind the scenes. We show how to apply Searn to three common NLP problems: (1) sequence labeling, (2) parsing and (3) machine translation.
bib
ps
pdf
Carefully Approximated Bayes Factors for Feature Selection in MaxEnt Models #
I describe a method of feature selection in maximum entropy (logistic regression) models based on the Bayes factor. I approximate data evidence using the Laplace approximation, and using two reasonable assumptions manage to get a very efficient procedure that has many benefits, including automatic stopping.
bib
ps
pdf
Notes on CG and LM-BFGS Optimization of Logistic Regression #
These are notes on the implementation I have written of conjugate gradient and limited memory BFGS optimization for logistic regression (aka maximum entropy) classifiers. The notes were created because it is actually quite difficult to find good references on efficient implementation of these algorithms, though discussion of them exists everywhere.
ps
pdf
Support Vector Machines for Natural Language Processing #
These are notes to accompany a lecture I gave on the use of SVMs in an intro graduate level NLP class. The slides I used can be found here. Both sources contain approximately the same information, but the notes are much more in-depth and can be understood without a soundtrack.
bib
ps
pdf
From Zero to Reproducing Kernel Hilbert Spaces in Twelve Pages or Less #
This is a tutorial on the mathematics behind RKHSs and their use in machine learning (eg., support vector machines and gaussian processes).
bib
ps
pdf
html
Yet Another Haskell Tutorial #
A tutorial on the Haskell programming language, designed for people who have experience programming (just not in a functional language).
bib
ps
pdf
A Phrase-Based HMM #
This describes the derivation of the inference algorithms used in the EMNLP 2004 document/summary alignment paper above. Unfortunately, there was no room in that document to explain these in detail, so they are relegated to this unpublished version.
html
Some notes on binning for Good-Turing smoothing #
This reports a small experiment done to explain binning in Good-Turing smoothing for a lecture on that topic in a graduate intro to NLP class.
bib
ps
pdf
Asymmetry of Coordination #
A final paper for a graduate linguistics syntax class, examines the issue of coordination in English and Japanese and proposes a unified structure for this phenomena. Also reviews several proposals to date.

My Co-authors and the percentage of my papers they are on:

Daniel Marcu (32%)
Piyush Rai (11%)
Suresh Venkatasubramanian (6%)
Jagadeesh Jagarlamudi (5%)
Amit Goyal (5%)
John Langford (5%)
Kevin Knight (3%)
Adam Teichert (1%)
Amrish Kapoor (1%)
Avishek Saha (1%)
Ellen Riloff (1%)
Nathan Gilbert (1%)
Seth Juarez (1%)
Arvind Agarwal (1%)
Gregory Voth (1%)
Pu Liu (1%)
Qiang Shi (1%)
Dan Klein (1%)
Devyani Ghosh (1%)
John Carter (1%)
Percy Liang (1%)
Ulf Hermjakob (1%)
Daniel Roy (1%)
Lyle Campbell (1%)
Yee Whye Teh (1%)
Eric Brill (1%)
Abdesammad Echihabi (1%)
Dragos Stefan Munteanu (1%)
Irene Langkilde-Geary (1%)
Kenji Yamada (1%)
Radu Soricut (1%)
Eric Nyberg (1%)

My publishing productivity by year:

2010: 6; 2009: 14; 2008: 5; 2007: 4; 2006: 4; 2005: 7; 2004: 10; 2002: 4; 2001: 1

My dozen most popular papers (by download count):

  1. Yet Another Haskell Tutorial (119048)
  2. Practical Structured Learning for Natural Language Processing (4448)
  3. Frustratingly Easy Domain Adaptation (3168)
  4. Notes on CG and LM-BFGS Optimization of Logistic Regression (2943)
  5. Fast search for Dirichlet process mixture models (2119)
  6. Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction (2094)
  7. Bayesian Query-Focused Summarization (1853)
  8. Bayesian Agglomerative Clustering with Coalescents (1820)
  9. A Bayesian Model for Discovering Typological Implications (1714)
  10. Searn in Practice (1561)
  11. Domain Adaptation for Statistical Classifiers (1230)
  12. A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model (1182)
quick links
   nlp blog
   searn
   nlp/ml meeting
   ml (cs5350)
   ai (cs5300)
   anlp (cs5964)
   mlrg (cs7941)
   algo (cs7936)
   whattosee
   thesis
   jmlr
   haskell tutorial
conferences
   psb 10
   soda 10
   aistats 10
   recomb 10
   naacl-hlt 10
   cvpr 10
   icml 10
   colt 10
   ismb 10
   aaai 10
   acl 10
   conll 10
   coling 10
   sigir 10
   kdd 10
   emnlp 10
   uai 10
last updated on eleven july, two thousand ten; contact me AT hal3 DOT name