Applications of NLP
CS 5964/6964
Fall 2007
Instructor: Hal Daumé III: me AT hal3 DOT name
Office Hours: MEB 3126; TBA (or by appointment)
Schedule: Monday/Wednesday, 9:10 - 10:30am
Location: MEB 3147
Mailing list: Cs5964 -- PLEASE subscribe (but don't post)!
Teach-Cs5964 -- send questions here!
TA: Scott Alfeld (office hours T 11-2, R 11-1 in Cade Lab)


 Background and Description

Natural language processing (NLP) is a diverse field that blends computer science with linguistics. Systems that can make use of the vast amounts of language data (text, speech, etc.) out in the world are becoming increasingly important. Applications such as machine translation, question answering and automatic document summarization are coming to be used by average Joes on their desktop computers without them really even knowing it.

This course is about developing systems for solving high-level natural language processing problems. It complements the Natural Language Processing course that has been offered in the past -- students are encouraged to take both, in either order (there is no prerequisite structure). At the end of the course, you will have built four substantial NLP applications by combining learned linguistic- and data-processing skills with useful tools.

Grading: This course is entirely homework- and project-based. There are three small homework assignments and three projects, one each for each of the segments. Homeworks are worth one point each; projects, three. There are no exams.

Readings: There are no required textbooks for this class. Occasionally we will make use of handouts and online tutorials. (Result: you save money!)

Prerequisites: There are none! Well, you have to know how to program. Since this course is largely project-based, you're expected to write a fair amount of code. However, we'll also make heavy use of existing tools (such as finite state toolkits), which means that the majority of the programming will be creating input files from text. I recommend using Python or Perl or something like that that deals well with text, but I don't care. There will be a small amount of math, but nothing beyond basic probabilities: chain rule, Bayes rule (which we'll review).

 Topics Covered

The course is broken into three segments:

 Syllabus (tentative)

The following syllabus is subject to change, but likely not by very much. Homework assignments and projects are due by 11:59pm on the date listed on the syllabus. Readings should be completed by the date listed on the syllabus (i.e., you should have read NLTK 1 by the beginning of class on 22 Aug).

Date Topics Readings HW Notes
20 Aug Introduction to natural language processing
Overview of class
- HW1 out
LANGUAGE AS A SEQUENCE
22 Aug Basic linguistic theory
Words, sentences, morphology, tagging
Corpora and tools
Unix for Poets
POS tag list
-
27 Aug String processing techniques
Text-to-sound conversion
Finite state machines for language
NLTK 1 HW1A due -
29 Aug Probability 101
Conditional, Bayes rule, chain rule
Estimating probabilities from data
Carmel - -
5 Sep Probability in strings
Noisy-channel framework
Probabilistic automata
- P1 out -
10 Sep Language modeling
Distinguishing good strings from bad
- HW1B due -
12 Sep Language modeling II
Sparse data problem, smoothing
Goodman - -
17 Sep Probabilistic string transformations
Entity tagging
- - -
19 Sep Probabilistic string transformations II
Automatic speech recognition
- - -
24 Sep Catch-up - HW1C due -
MACHINE TRANSLATION
26 Sep Incomplete data
Cryptanalysis, tranliteration
- HW2 out -
1 Oct Incomplete data II
EM algorithm
EM notes - -
3 Oct Word-based alignment models
IBM models 1 and 2
SMT (pp.11-26) HW2A due -
15 Oct Word-based alignment models II
HMM model
IBM models 3 and 4
SMT (pp.30-45) P1 due -
17 Oct Machine translation decoding
Integration with language models
- P2 out -
22 Oct Catch-up - - -
24 Oct Toward phrase-based translation
Combination of alignments
SMT (pp.61-71) HW2B due -
29 Oct Phrase-based translation
Beam search
Discriminative training
SMT (pp.89-99) - -
31 Oct Evaluation
BLEU score
SMT (pp.157-175) -
5 Nov Syntax-based translation
Current research directions
- - -
7 Nov Catch-up - HW2C due -
NLP ON THE WEB
12 Nov Information Retrieval
Inverted indices, TF-IDF
- HW3 out
P3 out
14 Nov Single-document summarization
Vector space model
Sentence extraction
- P2 due -
19 Nov Headline generation
Keyword extraction using automata
- HW3A due
21 Nov Single-document summarization
Discourse and coherence
Responding to queries
- - -
26 Nov Question answering I
Knowledge-lean approaches
- -
28 Nov Question answering II
Knowledge-rich approaches
- -
3 Dec Tree-tranducers and syntatic transformations - - -
5 Dec CLASS CANCELLED - P3, HW3B due -

 Homework Assignments

See the syllabus above for due dates. There are three small homeworks and four programming projects. Please see the handin instructions.

SegmentAssignmentTopic
1 Homework 1Basic linguistics and probability (Solution; hw1c-wordbigram.pl and hw1c-cblm.pl -- rename .txt to .pl)
Project 1Language modeling and tagging (Solution and code)
2 Homework 2Incomplete data (Solution and hw2b-make-ej.txt)
Project 2Machine translation (Solution and my outputs and code)
3 Homework 3NLP on the Web (Solution)
Project 3Headline generation (data) (Solution and my outputs and code)

 Useful Links and Software

This course is reasonably different from existing courses are other universities. The closest are probably: We will make use of the following software:
 Policies

The university document concerning adding, dropping, etc. is available here.

Cheating: Any assignment that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. Any student who is caught cheating will be given an E in the course and referred to the University Student Behavior Committee. Please don't take that chance - if you're having trouble understanding the material, please let us know and we will be more than happy to help.

ADA: The University of Utah conforms to all standards of the Americans with Disabilities Act (ADA). If you wish to qualify for exemptions under this act, notify the Center for Disabled Students Services, 160 Union.