Research
Managing Scientific Discovery Process
The Information Management group has been working on building new cyberinfrastructure that streamlines the creation, execution and sharing of complex visualizations, data mining and other large-scale data analysis applications. We developed VisTrails (www.vistrails.org), a new open source, scientific workflow and provenance management system that was designed to manage rapidly evolving workflows common in exploratory applications. VisTrails provides novel mechanisms for capturing and interacting with provenance that greatly simplify the data exploration process. The system has been downloaded over 8,000 times since its beta release in January, 2007. VisTrails has been adopted as part of the cyberinfrastructure in large scientific projects, as well as a teaching and learning tool in graduate and undergraduate courses, both in the U.S. and abroad.
Large-Scale Web Information Integration
There has been an explosive growth in the volume of structured information on the Web. This information often resides in the hidden (or deep) Web, stored in databases and exposed only through queries over Web forms. A recent study by Google estimates that there are several millions of such form interfaces. However, the high quality information in online databases can be hard to find: it is out of reach for traditional search engines, whose index include only content in the surface Web.
Our group is combining techniques from machine learning, information retrieval and databases to build infrastructure that automates, to a large extent, the process of discovering and organizing hidden-Web data sources, a necessary step to large-scale retrieval and integration of Web information. This infrastructure will enable people and applications to more easily find the right databases and consequently, the hidden information they are seeking on the Web. We have used our hidden-Web infrastructure to build DeepPeep (www.deeppeep.org), a new search engine for Web forms.