Research

Current Research

My research is primarily in the area of Natural Language Processing. Specifically, I develop methods to accurately identify specific events and associated information in free text (Information Extraction). I'm working with Dr. Ellen Riloff on pattern-based approaches for event-oriented Information Extraction.

I am a member of the NLP research group at the University of Utah. We meet regularly to discuss interesting research from the field of Natural Language Processing. Slides from my presentations at past meetings can be found here:

Table 1: Links to my past presentations.
Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatedness 2003-09-10
Measuring Semantic Relatedness Using a Medical Taxonomy 2003-10-08
Identifying Subjective Agents in Text 2004-10-04
Improving Extraction Recall by Caseframe Extrapolation 2005-03-08
Extracting Sources of Opinions from the World News 2005-04-12
Detecting Overlap in Features Using a Subsumption Hierarchy (Poster) 2006-03-31
Learning Domain-Specific Information Extraction Patterns from the Web 2006-07-22
Feature Subsumption for Opinion Analysis (Poster) 2006-07-23
Feature Subsumption for Opinion Analysis 2006-10-23
Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions (Poster) 2007-03-30
Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions 2007-06-29

Previous Research

I completed a Master's degree in Computer Science from the University of Minnesota Duluth. I did a thesis under the guidance of Dr. Ted Pedersen, in the area of Natural Language Processing.

As part of my research, I studied the use of a number of WordNet-based measures of semantic relatedness in Natural Language Processing. Semantic Relatedness refers to the notion of similarity of words, or of the concepts they refer to. Humans are able to judge the relatedness of words (concepts) relatively easily, and are often in general agreement as to how related two words are. For example, few would disagree that "pencil" is more related to "paper" than it is to "boat". Miller and Charles (1991) confirmed this fact in a cognitive study, and attributed this human perception of relatedness to the overlap of contextual representations of concepts in the human mind. However, it remains an open question as to how we can create automatic computational methods that assign relatedness values or scores to pairs of concepts.

A number of measures of relatedness have been proposed by researchers, many of them relying on information taken from the lexical database WordNet, and possibly augmented with corpus based statistics. In my research, I evaluated a number of these measures, such as those proposed by Resnik (1995), Jiang and Conrath (1997) and Lin (1998). I compared these measures along with three others in the context of a human relatedness study and in Word Sense Disambiguation experiments.

Word Sense Disambiguation pertains to the task of identifying the intended meaning of a word in a given context. Most words have multiple meanings. For example, the word "board" could mean a "board of directors" or an object one can write on. In WordNet, the word "line" has 29 different meanings as a noun! However, in a given context the speaker or writer intends to refer to only one meaning or sense of the word. This task of picking the intended sense of a word in its given context is called Word Sense Disambiguation. We humans perform this task everyday, effortlessly (and are barely aware of it). But, surprisingly, this is an extremely difficult task for a computer. Banerjee and Pedersen (2003) built a system that selects the intended sense of a word as that sense which is most related to the meanings of the context words. They adapted Lesk's (1985) algorithm to WordNet to measure the relatedness of word senses.

In my research, I compared a number of measures of semantic relatedness in the Word Sense Disambiguation algorithm. I also developed and evaluated a new measure based on context vectors that combines the content of dictionary definitions with statistical information derived from large corpora. This measure is unusually flexible and robust, in that it does not depend on the structure of any particular dictionary, and it can incorporate information derived from any given corpus of text.

You can read about all of this work in my thesis, and in our paper presented at CicLing-03. My thesis defense slides can be seen here. All the code written during the research was released under the GPL and can be found below.

Released Code

Table 2: Table of released code.
WordNet::Similarity WordNet::Similarity is a set of Perl modules that use WordNet to compute the semantic similarity of words or concepts.
WordNet::SenseRelate::TargetWord WordNet::SenseRelate is a set of Perl modules that attempt to detect the correct meaning of a word in a given context using measures of semantic relatedness.

Some of this code is hosted at Comprehensive Perl Archive Network and SourceForge.Net

This site is maintained by Siddharth Patwardhan

Last updated: 2009-02-15

Valid HTML 4.01! Valid CSS!