Utah High School ACM Programming Contest

2003 Take-Home Problem

Question Answering (QA) System

Project Details


The project for the take-home problem is to design and build a question answering system. For this project, we will use actual reading comprehension tests that are given to grade school children in the United States. These reading comprehension tests consist of short stories with 5 questions following each story. The materials come from levels 2-5 of ``The 5 W's'' books written by Linda Miller and obtained from Remedia Publications for research purposes. Here is a sample reading comprehension test problem (with question 5 missing, as explained in The Input section).

IMPORTANT: these on-line materials cannot be distributed to anyone else or used for any purpose other than this take-home problem. If you wish to use this data for other purposes, please contact us and we will tell you what you need to do.

The Remedia exams include an answer key that gives the correct answer for each question. Creating a Q/A system that can identify exact answers is difficult, so for this project we will focus on answer sentence identification. Your Q/A system should identify the sentence in the story that best answers each question. This is a much easier task and, from a practical perspective, nearly as useful for most real-world applications!

The Remedia books contain a key with the exact answers to each question, but we will use an answer key created by the MITRE corporation for answer sentence identification. The MITRE folks marked the sentence(s) in each story that they thought best answered each question. The MITRE answer key sometimes lists more than one correct sentence for a question — we have removed these stories from the data sets. The MITRE answer key also contains no correct sentences for about 11% of the questions! This occurs when the answer really spans two (or more) sentences and either one alone is insufficient. Consider the following example:

Question: What is the name of our national library?
Excerpts from the original story:
Sentence A: But the Library of Congress was built for all the people.
Sentence B: From the start, it was our national library.

The exact answer is ``Library of Congress''. But neither Sentence A nor Sentence B is sufficient by itself to answer the question. The pronoun ``it'' must be resolved across these sentences to determine the correct answer. This example illustrates a problem with trying to identify answer sentences instead of exact answers. Because of this problem with the answer sentence identification task, we have removed all stories with unanswerable questions to make your task easier. Be aware that this will inflate your results somewhat compared to those reported in the papers we have made available.


The Answer Key
Here is the answer key for the Library of Congress story. The sentence that best answers each question is surrounded by sgml tags labeled with the question number. For example, the tags <ANSQ3> and </ANSQ3> surround the sentence that best answers question #3.

Note that in some cases the answer to a question may come from the by-line (e.g., ANSQ4 in Figure 2. Answers to WHEN and WHERE questions are often found in the by-line, so you should strip off the by-line and treat it as a separate sentence in the text. For example, in The Output section you will notice that (ABERDEEN, S.D., September 14, 1963) is a legal answer to Question 4.

Judging answers is subjective in nature, so you may sometimes disagree with MITRE's decisions in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use MITRE's judgements as "The Truth".


The Data Sets
You will be using three sets of data at different points in the project.

  1. Training Set: ~20 stories and answer key
  2. Test Set #1: ~20 stories and answer key
  3. Test Set #2: ~45 stories and answer key

A recommendation - build your project in at least two phases:

  1. Development Phase: You will be given the Training Set and Test Set 1 to use in developing your Q/A systems. You may use these stories and the answer keys in any way that you wish. The training data can be found in
    /<somepath>/training_set/stories/
    where <somepath> corresponds to the location you put the training set and test set 1 when you unzipped the data files.
  2. Preliminary Evaluation: A week after starting the take-home problem, or with 2 weeks left to the deadline, conduct a preliminary evaluation of your Q/A system, using both the Training Set and Test Set 1. Doing so should be useful for assessing your progress and seeing how well your system works.
  3. Final Evaluation: Based on the results you achieve in your preliminary evaluation, fine-tune your system. On March 17, each team will hand in the final code for their Q/A system. We will run the Q/A systems on all the stories in the Training Set, Test Set #1, and Test Set #2. Your final project score will be based on the performance of your Q/A system on all of the test sets.

The purpose of evaluating your systems on all test sets is to balance specificity with generality. You will have approximately three weeks to try to get your Q/A systems to perform well on the Training Set and Test Set #1. Hopefully, everyone will be able to do fairly well on those data sets. Test Set #2 will be a blind test set that we will use in evaluating the systems on March 17, but will not be released. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2. WARNING: You will be given the answer keys for the Training Set and Test Set #1, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your Q/A systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.


The Input
Your Q/A system should accept a single input file which will contain one story and its associated questions. Each story file will be formatted like this. The first sentence is the headline of the story, which in some cases may contain the answer to a question. After that, the main story begins. Each story begins with a by-line that contains the date and/or location of the story. The answers to many WHEN and WHERE questions come from the headers, so you should strip off the header and treat it as a separate sentence.

At the end of each story is a set of 4 questions. There will always be 4 questions per story, numbered 1 through 4. You can identify the first question by looking for a line that begins with ``1.''. We have eliminated the "Why" questions (question number 5 in all stories) as they tend to be extremely difficult to get right.


The Output
As its output, your program should output to the screen the answer sentences for each question. It should be formatted as follows:

<question 1 answer>
<question 2 answer>
<question 3 answer>
<question 4 answer>

For example, real output might look like this:

Elvis loves french fries.
Daffy is a duck.
The natural language processing course rocks!
(ABERDEEN, S.D., September 14, 1963)

Make sure that each answer is printed as a single line. And make sure that you print each sentence EXACTLY as it appears in the story! Otherwise, your output may not be scored correctly. Each answer should be an exact sentence from the story, or the complete by-line from the story (as illustrated by the fourth answer sentence above). Refer to the basic system for a working sample of how your output should be done.


SUBMISSION
The take home problem should be submitted on the day of the contest when your team checks in. The following must be submitted:

  1. Source code, documenation, and an executable on floppy or CD.
  2. An organized print out of all your source code and documentation files. With your school and team members clearly identified.

So that we can better understand your program, include comments and use good programming style. Your comments should explain your algorithms and design. Also, write a "README" file explaining what you did and why, and anything else that you tried that you didn't use in your final solution. Explain what you learned while working on the take-home problem. Finally, so that we can continue to improve the contest, please give us some feedback about what you thought about the problem: how difficult or easy it was, what the hardest and easiest parts were, and other topics you would like to see in a take-home problem.

We will recompile your program from source if necessary. Please make sure that your source code compiles with one of the following compilers: Microsoft Visual C++, GCC/G++ (free), or Java JDK 1.3 (free).


SCORING
The performance of each Q/A system will be scored based on the percentage of questions answered correctly. An answer will be scored as correct if it exactly matches one of the sentences marked as correct for the corresponding question in the answer key.

Each project will be graded according to the following criteria:

  1. 10% of the score will be based on the performance of your Q/A system on the training set during the final evaluation.
  2. 20% of the score will be based on the performance of your Q/A system on Test Set #1 during the final evaluation.
  3. 50% of the score will be based on the performance of your Q/A system on Test Set #2 during the final evaluation.
  4. 20% of the score will be based on the quality of your code, which includes comments, modularity, and readability. Also included in this is the quality of your README file.

To compute the final score for the take-home problem, each Q/A system will be ranked on the criteria listed above relative to the other Q/A systems in the competetion. This score will be scaled so that the team with the highest average receives 200 points.


CAVEAT AND ENCOURAGEMENT
Building an effective question answering system is hard! Question answering is not even close to being a solved problem in NLP, and the performance of the very best research systems is still quite poor. To date, the best performance that we have seen on the Remedia exams is about 40% accuracy.

Therefore, do not expect your systems to achieve high accuracy! Performance around 40% would be terrific, but even that might not be attainable. Keep in mind, however, that randomly choosing a sentence would produce only about 5% accuracy, if there are 20 sentences per story on average. So anything higher than the baseline means that your system is doing something right. :)

We encourage everyone to have fun with this project and experiment with lots of ideas. Creativity is highly encouraged! We chose this project because question answering is an extremely important NLP application. Right now, it is an extremely hot research area, but hopefully in the next 10 years the technology will be good enough to start incorporating into real products. This project will give you exposure to a cutting edge research area, understanding of a crucial application area for NLP, and valuable experience building a real NLP system.


THE BASIC SYSTEM
To get you started we have provided you with a basic question answering system implemented in both C++ and Java. The basic system works by performing key word matching. It reads the story into a vector of sentences where each sentence is an instance of the "Sentence" object. Then for each question, the vector of sentences is traversed to find the sentence that has the most words in common with the current question. That sentence is determined to be the answer sentence and is output to the screen.

To facilitate this method, the "Sentence" class strips all punctuation from each sentence and divides the sentence into a vector of words. The original form of the sentence is also maintained for easy output if it is determined to be an answer to a question. Vectors are used in this implementation rather than arrays because they are easier to add and remove from and you don't have to know what size they should be before you create them. You can access vector elements similarly to how you access array elements.

You can use this code as a basis for your solution or you can start completely from scratch. Either way, your program should behave the same as the basic system. It should take the name of the story as a command line parameter and output one sentence per line answering each question in the order that they appear in the story. Note that the output sentences must match the sentence in the story exactly.


POSSIBLE TECHNIQUES TO IMPLEMENT
You may be wondering, "Okay, so now what do we do?". The following brief overviews, along with these papers are intended to help you answer such questions. You may also find it useful to type these techniques into your favorite search engine and dig deeper. Note that different methods can be used to answer each type of question (Who, What, When, Where). For example, the Where questions are more likely to have a place in the answer sentence so this information could be utilized to develop a specific technique for Where questions.

Pattern Matching
Pattern matching works well for questions whose answers have a small set of regular forms. For example, a question might ask, "When did they reach the shore?" A reasonable approach for applying pattern matching to answer this question might be to determine what words in the correct answer sentence could indicate that there is a time or a date present, e.g. today, Monday, am, morning, etc. Make a data structure containing these types of expressions, and look for them in sentences to find potential answer sentences. For more information on pattern matching, you may want to look at this paper which discusses patterns in QA.

Morphological Analysis
Morphological analysis boils down to this: figure out what the roots of words in the question sentences are, and compare them to the roots of words in the story sentences. Suppose a question asks, "Where did young Chris live?" You might determine that the word live is as simplified as it can be, and then in the story, you may find a sentence that says "Christopher Robin is alive and well." Here you can determine that alive has live as its root, and might consider the sentence as an answer. You may also find the sentence "He lives in England." Applying morphological analysis to this sentence may tell you that the root of lives is live, so this sentence could also be an answer.

Stop Words
Stop Words are words that don't help much in determining the answer. Suppose a question asks, "Where is the whitehouse?" and a story sentence that says, "The state of Virginia is where the third president of the United States is from." Depending on how you do your word matching, you can get a match on up to 6 words in the story sentence. Suppose the correct answer sentence is, "Another nice place to visit while in Washington, D.C. would be the whitehouse." The correct sentence will only match 2 words. You can make a list of words that are common to many sentences and are not likely to help find answer sentences. This list is called a stopword list. A stopword list is used to somehow discount the words in it, allowing you to concentrate on matching words that are important.

Named Entity Recognition
Named Entity Recognition provides a means of recognizing proper nouns. Suppose you have a question that asks, "Where is the capitol of the U.S.?" Perhaps the corresponding story has a sentence stating, "The President lives in Washington, D.C., the nation's capitol." Named Entity Recognition can help determine that Washington, D.C. is a proper noun, or even that it is a place, depending on how you implement it. This information can then help you look for sentences with the right kind of named entity — a person, a place, etc.

Part-of-Speech (POS) Tagging
Part-of-Speech tagging alone will not help much. But combine it with other techniques, and it can certainly help. For example, if you apply POS tagging to the story and the question, you may find that nouns in a story sentence that match nouns in a question sentence are a good indicator that the story sentence could be an answer. Or you may find that prepositions are helpful in finding answers to certain types of questions.

There is a POS tagger developed by Eric Brill (bottom of the page), formerly a faculty member of the Department of Computer Science at Johns Hopkins University, available for free if you want to do POS tagging without spending a lot of time learning about POS tagging. Feel free to use this tagger, but please do read the copyright and readme documents included.

Important: Note that use of Eric Brill's POS tagger is at your discretion. We assume no responsibility for your decision to use it, problems you may encounter using it, or violations or abuses of the copyright.


This page best viewed with a browser that supports CSS.

Utah High School ACM Programming Contest