next up previous contents
Next: Document Segmentation using Texture Up: report Previous: Some Properties of Indian   Contents

Document Segmentation

When the text is printed or written on plain background, the text can be extracted by simple binarization of the image (i.e, by a thresholding the image into two levels). Such documents can be easily converted to electronic form using an OCR system. However, often text is not printed on plain backgrounds, for example maps, official certificates, advertisements etc. Such images cannot be directly fed to the OCR. In such cases, the text must be extracted from the image and fed separately to the OCR. The image is first split into text and non-text regions. Next the text region is further split into pure text regions, tables, mathematical equations labels etc. In case of multi-script documents, the pure text regions also cannot be fed directly to the OCR and must to be split into different regions depending on the script. Similarly, the non-text regions must also be split into images, graphs, charts, background regions etc.

Subsections

2002-06-03