next up previous contents
Next: Some Properties of Indian Up: report Previous: Introduction   Contents

Related Work

The first problem is to segment a document page into text, figures, tables, etc known as page segmentation. Lot of research has gone into solving the problem of page segmentation and a number of algorithms have been proposed for the same. Page segmentation algorithms can be categorized into three classes: top-down approaches, bottom-up approaches and hybrid approaches [3]. Top-down algorithms start from the document image and iteratively split it into a number of smaller images. The splitting procedure stops when some criterion is met. Examples of top-down approaches are X-Y cut [4] and the shape-directed-covers-based [5] algorithm. Bottom-up algorithms start from document image pixels and cluster the pixels into connected components which are then clustered into words, lines, or final zone segmentations. Examples of bottom-up approaches are the Docstrum algorithm [6], the Voronoi diagram based algorithm [7] and the run-length smearing algorithm [8] and the text string separation algorithm [9]. Hybrid approaches are a mixture of the above two approaches. The split-and-merge algorithm [10] is one such algorithm. All the above approaches work in the image (spatial) domain. These algorithms however work on a particular layout only and do not perform well when the document images contain sparse characters. The problem of page segmentation can be thought of as segmenting a document image into its different texture regions. This can be performed efficiently in an appropriate transform domain by considering different components of the page to be different textures. A lot of work has been done on the use of multi-channel filtering techniques [11,12] and the design of Gabor filters for texture segmentation [12]. A multi-channel filtering technique for detecting address block location on envelopes was proposed by Jain [13]. His approach is presented in more detail in Section 4. However, not much work has been done on the segmentation of a multi-script document. Spitz suggested a method for separating Han based and Latin based scripts [14]. He used optical density distribution of characters and frequently occurring word shape characteristics for segmentation. An automatic technique for separating regions containing Indian language scripts has been proposed by Pal and Chaudhuri [15]. Their system is able to identify Bangla, Devanagari, Roman and Urdu scripts. The approach builds a tree structure where at first Roman and Urdu script lines are separated from Bangla and Devanagari using the headline feature. Next, Roman and Urdu scripts are identified by combined analysis of topological and statistical features whereas Bangla and Devanagari words are separated by a stroke feature based approach. Chaudhary et al. have proposed a combination of two trainable classification schemes for identification of Indian scripts. Both the schemes use connected components extracted from the textural regions. The first classifier uses a Gabor filter based feature extraction scheme for the connected components. The second scheme uses the shape of the connected components by taking into account the distribution of black pixels around every pixel of a connected component. The decision of the two classifiers are then combined using a Linear Regression method.
next up previous contents
Next: Some Properties of Indian Up: report Previous: Introduction   Contents
2002-06-03