next up previous contents
Next: Introduction Up: report Previous: List of Figures   Contents


List of Tables

  1. Features used for segmenting English and Hindi
  2. Features used for segmenting Tamil and English
  3. Features used for segmenting Gujarati and Hindi

Abstract:

Documents are mainly paper-based type of media containing information of various kinds such as text, graphics, pictures, mathematical formulae, and tables. Nowadays, most documents are being stored in electronic form because of their ease of storage, search, retrieval, modification and transmission. However, even today, a large number of documents (government files, books, magazines, newspapers etc) exist in print format. Such documents lack the long time persistence of electronic documents, their ease of storage, retrieval etc. Hence, there is a need to develop efficient systems for converting document images to electronic form. In a multi-lingual country like India, a document page may contain more than one script. For Optical Character Recognition of such a document, it is first necessary to separate the different scripts. Hence, multi-script document segmentation has a direct application in India. Here, we present an algorithm for segmenting a multi-script document into regions of the same script. Analyzing a multi-script document poses a great challenge and if solved could prove to be very useful in the Indian context to convert the vast number of available books, maps, etc. into electronic form and make them accessible to the masses. It could also be used for indexing document pages or web images, understanding documents etc.


next up previous contents
Next: Introduction Up: report Previous: List of Figures   Contents
2002-06-03