Information extraction systems typically rely on a dictionary of
extraction patterns to identify relevant information. In most
IE systems, these dictionaries are constructed manually, which is
extremely tedious and time-consuming. To address this
knowledge-engineering bottleneck, we have developed a system called
AutoSlog that automatically builds dictionaries of extraction
patterns
for new domains. AutoSlog uses an annotated corpus and simple
linguistic rules. A training corpus for AutoSlog must be annotated by
a person to indicate which noun phrases need to be extracted from a
text. For example, given the sentence ``The mayor was kidnapped by
armed men'', a person would mark the ``mayor'' as a kidnapping victim
and the ``armed men'' as perpetrators. AutoSlog then proposes patterns
that are capable of extracting these noun phrases. In the previous
sentence, AutoSlog would create one pattern ``X was kidnapped'' to
extract the mayor, and a second pattern ``was kidnapped
by Y'' to extract the armed men. Because the patterns are general,
they will extract similar information from new texts as well.
AutoSlog has been used to build dictionaries for three different
domains: terrorism, joint ventures, and microelectronics. Given a
training corpus for the terrorism domain, AutoSlog produced a
dictionary with only 5 person-hours of effort that achieved 98% of the
performance of a hand-crafted dictionary that required approximately
1500 person-hours to build. We are currently working with a new
version of AutoSlog, called AutoSlog-TS that generates
dictionaries of extraction patterns using only preclassified texts,
and does not require the detailed text annotations that AutoSlog did.