Next: Document Segmentation using Texture
Up: report
Previous: Some Properties of Indian
  Contents
When the text is printed or written on plain background, the text can be extracted by simple binarization of the image (i.e, by a thresholding the image into two levels). Such documents can be easily converted to electronic form using an OCR system. However, often text is not printed on plain backgrounds, for example maps, official certificates, advertisements etc. Such images cannot be directly fed to the OCR. In such cases, the text must be extracted from the image and fed separately to the OCR. The image is first split into text and non-text regions. Next the text region is further split into pure text regions, tables, mathematical equations labels etc. In case of multi-script documents, the pure text regions also cannot be fed directly to the OCR and must to be split into different regions depending on the script. Similarly, the non-text regions must also be split into images, graphs, charts, background regions etc.
Subsections
2002-06-03