Next: Document Segmentation
Up: report
Previous: Related Work
  Contents
Indian languages have essentially a common alphabet, though they use different forms to express it. The alphabet has about a dozen vowels, and about three dozen consonants and a rich combination of them yield a wide variety of character forms. The concept of upper/lower case characters is absent in Indian scripts. Most Indian scripts are written from left to right with the exception of Urdu which is written from right to left.
Figure 1:
Script Samples : Roman, Devanagari, Tamil and Kannada
 |
The three most popular scripts in the Indian subcontinent are Hindi, Bangla and Urdu [15]. Hindi and Bangla alone are used by a total of about 500 million people. The script of Hindi is Devanagari (which is also used to write Nepali, Marathi and Sindhi), while that of Bangla is called Bangla (also used to write Assamese and Manipuri). There is a strong structural similarity between Urdu and Arabic, the third most popular language in the world. Hindi and Bangla are the fourth and fifth most popular languages in the world respectively.
Indian scripts differ from one another significantly. Some scripts, like Hindi, Bengali and Assamese have horizontal and vertical linear features, while others like Telugu, Tamil and Malayalam have complicated curves. Many characters in Bangla and Devanagari script have a horizontal line at the upper part. Figure 1 shows the Roman script and scripts of some of the popular Indian languages.
Different Indian scripts also have different textural properties. Thus, analysis of the texture content of multi-script documents has a lot of promise in segmenting the different script regions.
Next: Document Segmentation
Up: report
Previous: Related Work
  Contents
2002-06-03