1 / 23

Presenter Gurpreet Singh Lehal, Department of Computer Science, Punjabi University, Patiala

Gurmukhi to Shahmukhi Transliteration System . Presenter Gurpreet Singh Lehal, Department of Computer Science, Punjabi University, Patiala. Introduction. Punjabi language is written in two mutually incomprehensible scripts. Gurmukhi ( ਪੰਜਾਬੀ ) Shahmukhi ( پنجابی )

libitha
Download Presentation

Presenter Gurpreet Singh Lehal, Department of Computer Science, Punjabi University, Patiala

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gurmukhi to Shahmukhi Transliteration System Presenter Gurpreet Singh Lehal, Department of Computer Science, Punjabi University, Patiala

  2. Introduction • Punjabi language is written in two mutually incomprehensible scripts. • Gurmukhi (ਪੰਜਾਬੀ) • Shahmukhi (پنجابی) • Nearly 2 crore people in India use Gurmukhi script, while 6 crore people in Pakistan use Shahmukhi script. • Most of the medieval Punjabi literature in Shahmukhi while modern literature is in Gurmukhi • Necessary to break this script wedge as very few Punjabi readers can read both the scripts.

  3. Issues in Shahmukhi-Gurmukhi Transliteration • Multiple mappings for some Gurmukhi characters in Shahmukhi character set • No exact equivalent mappings in Gurmukhi for some Shahmukhi characters • Missing nukta symbols in Gurmukhi text • Difference between pronunciation and orthography • Transliteration of proper nouns Next

  4. Multiple Mappings for Gurmukhi Characters

  5. Multiple Mappings for Gurmukhi Characters • These lesser frequently occurring similar sounding Shahmukhi characters have 1.45% frequency of occurrence. • Average word length of a Shahmukhi word is 3.55 • Any rule based system which does not take care of these characters, will have 1.45*3.55 = 5.15% error rate at word level, due to multiple mappings. Back

  6. Zero Mapping for Some Shahmukhi Characters • There are certain Shahmukhi characters such as ع (ain) and ء (hamza) for which there no exact equivalent Gurmukhi characters. Though hamza can be generated using character combination rules, but there are no specific rules for ain. As for example, consider the following cases: • ਔਰਤ -> عورت • ਸ਼ੁਰੂ -> شُرُوع • ਰਫ਼ੀ -> رفیع • ਅਲਵਿਦ-> الوِداع • It is not possible to specify the rules in above examples for generation of ain in Shahmukhi words and it depends from case to case. From the statistics, we find that ain has 0.42% of frequency of occurrence. Back

  7. Missing Nukta Symbols in Gurmukhi Text • Five consonants in Gurmukhi scripts (ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼) were added to the original 35 characters in Gurmukhi to accommodate the sounds from Arabic and Persian languages. • Punjabi speakers now do not make a distinction between ਖ ਖ਼, ਗ ਗ਼ and ਫ ਫ਼and drop the nukta symbol • ਖ਼ ਗ਼ ਜ਼and ਫ਼ have combined character frequency of occurrence of only 0.17% as compared to 1.35% for their counterparts in Shahmukhi • we found that the word ਫ਼ਕੀਰoccurs 69 times in the Gurmukhi corpus while the word ਫਕੀਰ occurs 147 times in the Gurmukhi corpus. So ਫਕੀਰ will be transliterated to پھکِیرin Shahmukhi, which is wrong while the actual transliteration is فقیر , which is obtained if the correct Gurmukhi spellings ਫ਼ਕੀਰ are used. Back

  8. Difference Between Pronunciation and Orthography • In certain cases, the Gurmukhi words are written with short vowels, while they are pronounced with long vowels. The equivalent words in Shahmukhi are also written with long vowels and so the rule based mapping system which converts those short vowels in Gurmukhi Shahmukhi gives wrong results in such cases. • Some examples of such words are ਗੁਰੂ, ਬਿਮਾਰ and ਖ਼ੁਰਾਕ.They are pronounced as ਗੂਰੂ ਬੀਮਾਰਖ਼ੂਰਾਕbut written with short vowels, while the corresponding words in Shahmukhi are written with long vowels as گورو, بیمارand خوراک respectively. Back

  9. Transliteration of Proper Nouns • Frequently the proper nouns such as names of persons and places have typical spellings in Shahmukhi and it is not possible to formulate transliteration rules for generation of such spellings. • As for example, • ਅਬਦੁੱਲਾ -> عبداللہ • ਬੁਸ਼ਰਾ ਰਹਿਮਾਨ-> بُشریٰ رحمٰن • ਹੈਦਰਾਬਾਦ ->حیدرآباد

  10. Gurmukhi Word Pre Processing Spell Check Gurmukhi Word Gurmukhi Spell Checker Word Normalization Normalized Gurmukhi Word Processing Gurmukhi-Shahmukhi Dictionary Lexicon Lookup Rule Based Conversion to Shahmukhi Transliterated Shahmukhi Word Post Processing Shahmukhi Corpus Check Shahmukhi Spellings Spell Corrected Shahmukhi Word System Architecture

  11. Lexical Resources Used • Gurmukhi spell checker • Root words : 41,253 • Gurmukhi Normalised Spellings • Words : 1193 • Gurmukhi-Shahmukhi dictionary • Terms: 10,254 • Shahmukhi Corpus • Total Words : 97,63,294 • Unique words : 1,93,679

  12. Pre-Processing Stage • The Gurmukhi word is cleaned and prepared for transliteration by passing it through the Gurmukhi spell checker and normalizing it according to the Shahmukhi spellings and pronunciation. • ਗਜਲ ->ਗ਼ਜ਼ਲ • ਗੁਰੂਆਂ ->ਗੂਰੂਆਂ • ਖੁਸ਼ੀ -> ਖ਼ੁਸ਼ੀ ->ਖ਼ੂਸ਼ੀ • Resources Used • Gurmukhi Spell Checker • Gurmukhi Normalised Spellings

  13. Processing Stage • The normalised Gurmukhi word is transliterated to Shahmukhi by using: • Dictionary Lookup • Mapping rules • Gurmukhi-Shahmukhi Dictionary used for directly transliterating frequently occuring Gurmukhi words and transliterating Gurmukhi words with typical Shahmukhi spellings:

  14. Processing Stage • Mapping Rules • Gurmukhi letters are directly mapped to similar sounding Shahmukhi characters. For example, • ਆ->آ, ਇ -> اِਈ -> ایਉ ->اُ, ਊ->اُو, ਏ->اے • ਕ -> ک, ਖ -> کھ, ਗ -> گ, ਘ -> گھ • In case of multiple equivalent Shahmukhi characters, the most frequently occuring Shahmukhi character is selected. • Thus ਹ is mapped to ہ and ਜ਼ is mapped to ز • As for example, the word ਕਮਰੇ will be converted to Shahmukhi as follows: • ਕ + ਮ + ਰ + ੇ -> ک + م + ر + ے = کمرے • If two vowels in Gurmukhi come together, then the character hamza is placed in between them in Shahmukhi. • ਕੋ + ਈ ->کو + ئی • ਆ + ਓ -> آ+ئو

  15. Processing Stage • Mapping Rules • Besides, these simple mapping rules, some special pronunciation based rules have also been developed. Some of these rules are: • ਇ + ਆ-> یا (ਲਾਇਆ ->لایا) • ਿ + ਓ-> یو (ਵਾਲਿਓ-> والیو) • ੰ + ਪ -> مپ (ਪੰਪ -> پمپ) • ੰ + ਨ-> نّ (ਸੁੰਨ ->سُنّ)

  16. Post-Processing Stage • The spellings of the Shahmukhi word generated in the processing stage are checked and corrected. The major source of spellings errors in the transliterated Shahmukhi words is the multiple character mapping in Shahmukhi. • As for example the words ਸਲਾਹ, ਮਜ਼ਬੂਤ, and ਮਤਲਬwill be transliterated as سلاہ, ، مزبوت and متلب while the actual spellings are صلاح, مضبوط, and مطلب respectively. • Resources Used • Shahmukhi word frequency list

  17. Post-Processing Stage • For the Gurmukhi characters with multiple Shahmukhi mappings, word forms using all the possible mappings are generated and the word with the highest frequency of occurrence in the Shahmukhi word frequency list is selected. • For example, consider the word ਤਾਕਤ. Both the characters ਤ and ਕ have multiple Shahmukhi mappings. The Shahmukhi word generated in the processing stage is تاکت. From this word, all its forms are generated. • A search for each of these words in the Shahmukhi corpus reveals that while the word طاقت has 2045 occurences, none of the other form has a single occurrence. Thus the word ਤਾਕਤis transliterated to طاقت.

  18. ت ط ਾ ਾ ا ا ਕ ਕ ق ک ق ک ਤ ਤ ਤ ਤ ت ط ط ط ت ط ت ت تاکت تاکط تاقت تاقط طاقت طاکط طاکت طاقط 0 0 0 0 0 0 2045 0 Back

  19. An Example • To illustrate how the Gurmukhi word is transformed in each of the three stages, we take the following sample sentence in Gurmukhi. • ਪੁਲਿਸ ਨੂੰ ਮੂਸਾ ਖਾਨ ਬਿਮਾਰ ਸਿਹਤ ਅਤੇ ਜਖਮੀ ਹਾਲਤ ਵਿਚ ਮਿਲਿਆ • After Gurmukhi spell checking in pre-processing stage, the sentence becomes • ਪੁਲਿਸ ਨੂੰ ਮੂਸਾ ਖ਼ਾਨ ਬਿਮਾਰ ਸਿਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ ਹਾਲਤ ਵਿਚ ਮਿਲਿਆ • The text after normalization becomes • ਪੂਲੀਸ ਨੂੰ ਮੂਸਾ ਖ਼ਾਨਬੀਮਾਰ ਸਿਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ ਹਾਲਤ ਵਿਚ ਮਿਲਿਆ

  20. An Example • Since the word ਮੂਸਾhas typical spellings the output after dictionary lookup in Processing stage is : • ਪੂਲੀਸ ਨੂੰمُوسیٰਖ਼ਾਨ ਬੀਮਾਰ ਸਿਹਤ ਅਤੇ ਜ਼ਖ਼ਮੀ ਹਾਲਤ ਵਿਚ ਮਿਲਿਆ • The output from the processing stage after rule based transliteration is : • پولیس نوں مُوسیٰ خان بیمار سِہت اتے زخمی ہالت وِچ مِلیا • The final output after running the spell checker in the post processing stage is : • پولیس نوں مُوسیٰ خان بیمار صِحت اَتے زخمی حالت وِچ مِلیا • The output we got from another existing system is:

  21. Failure Cases The system fails for the following cases: • Multiple spellings for Gurmukhi words • For example, ਕਤਰਾcan get mapped to both قطرہandکترا, similarly the word ਅਰਬ get mapped to both عرب and ارب. The correct spellings can be selected after context analysis only. • Words with typical spellings, which are not present in Gurmukhi-Shahmukhi dictionary and Shahmukhi corpus

  22. Experimental Results • We have tested our system on 121 pages of text compiled from newspapers, books and poetry. The results are compared with Puran and Utrans, the two transliteration systems available on the net. • Transliteration Accuracy • We observed that the main of sources of improvements in the transliteration accuracy over the existing systems have been: • Pre-processing stage, wherein the wrong Gurmukhi spellings are corrected and spellings of some of the words are modified according to their pronunciation. • Development of transliteration rules for special cases • Usage of Gurmukhi-Shahmukhi dictionary • Correction of Shahmukhi spellings with the help of the Shahmukhi corpus.

  23. THANK YOU

More Related