1 / 41

Work at TACOLA Lab

Work at TACOLA Lab. Team Members T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K, Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani. Tamil Language Processing. Tamil Language Processing

eros
Download Presentation

Work at TACOLA Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Work at TACOLA Lab Team Members T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K, Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani

  2. Tamil Language Processing Tamil Language Processing Morphological analyser Normal Words, Compound Words, Colloquial Words Parser Simple, Complex and Compound Sentences Semantic analysis based on UNL Language Technology Blog Mining Ontology Based Information Extraction Personalized Search Parallelization for NLP Processing Emotion detection form text Carnatic Music Processing Raga Modelling Singer, Genre Identification Music Emotion Recognition Tamil Language Oriented Tools Dictionary Text Compaction UNL Based Work UNL for semantic representation Nested UNL Concept based Search Bi-lingual Search Event Processing Discourse Analysis Summarization Question answering Thirukural Search Lyric Oriented Processing Lyric Mining Lyrics for Tunes Pleasantness Dr.T.V.Geetha, Anna University

  3. Papers for TIC 2011 Tamil Language Oriented Tools • Agaraadhi: A Novel Online Dictionary Framework • An Efficient Tamil Text Compaction System. (Surukkupai) • Kuralagam, A Concept Relation Based Search Framework for Thirukural. • Popularity Based Scoring Model for Tamil Word Games Tamil Language Processing • Template based Multilingual Summary Generation. • On Emotion detection from Tamil Text. • Tamil Summary Generation for Cricket Match. Lyric Oriented Processing • Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis. • Special Indices for LaaLaLaa Lyric Analysis & Generation Framework. Dr.T.V.Geetha, Anna University

  4. AGARAADHIA NOVEL ONLINE DICTIONARY FRAMEWORK Elanchezhiyan.K Karthikeyan.S T.V.Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University

  5. OBJECTIVES Dr.T.V.Geetha, Anna University Agaraadhi, a dictionary framework for indexing and retrieving Tamil words, their meaning, analysis and related information. Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about.

  6. INTRODUCTION Dr.T.V.Geetha, Anna University • Agaraadhi dictionary has more than 3 lac words in various domains such as • General, • Literature, • Medical, • Engineering, • Computer Science, • Birds Name and More… • The Agaraadhi is a Tamil English bilingual dictionary.

  7. INTRODUCTION CONT… Dr.T.V.Geetha, Anna University • The Agaraadhi is a Tamil English bilingual dictionary with 20 features. such as • morphological analysis, • morphological generation, • word usage statistics, • word pleasantness analysis, • spell checking, • similar word finder, • word usage in literature, • picture dictionary, • number to text conversion, • phonetic transliteration, • live usage analysis from micro blogs and more…

  8. AGARAADHI FRAMEWORK CONT… Dr.T.V.Geetha, Anna University

  9. AGARAADHI FEATURES Dr.T.V.Geetha, Anna University • Morphological Analyser • gives the morphological features of the query word such as root word, parts of speech, gender, tense and count. • If the Query word is padithaan, Morphological Analyser gives as padi as root, word represents male gender and query word is past tense and so on. • Morphological GeneratorTamil morphological generator tackles different syntactic categories such as nouns, verbs, post positions, adjectives, adverbs. • The generator is used to generate possible morphological variations of the query word. • Spell Checker • used to check the spelling of Tamil words and to provide alternative suggestions for the wrongly spelt words. • If root word not in dictionary - generates all the possible suggestions with minimum variations from the given word

  10. AGARAADHI FEATURES Dr.T.V.Geetha, Anna University • Word Suggestions • gives the list of equivalent or related words for the given query word. • Word Pleasantness • score generator provides how easy it is to pronounce the word. • Word Popularity Score • shows the word usage in the web based on frequency distribution of the word across the popular blogs, news articles, social nets etc. • Word Usage Statistics • shows the usage of the word in the social network over the past one week. • Word Usage in Literature • finds the usage of words in popular literature such as Thirukural, Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie songs.

  11. AGARAADHI FEATURES Dr.T.V.Geetha, Anna University • Word of the Day • A rare word is randomly chosen and is displayed in the opening page to facilitate users to learn a new word every day. • Number to Text Converter • converts a number to Tamil word equivalent as well as in English text. For example in Tamil we represent oru Arpputham (அற்புதம்) for 100 million, Kumbam (கும்பம்) for 10 billion and finally up to Anniyan (அந்நியம்) for one zilli • Picture Dictionary • Pictures, photos or line drawings to depict popular words have been included in the dictionary to enable efficient learning for children using this tool.

  12. RESULTS Dr.T.V.Geetha, Anna University • Query word: pookkal (பூக்கள்) • http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%AA%E0%AF%82%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%B3%E0%AF%8D+&ln=ta&Submit.x=8&Submit.y=7 • Query word: mazhai (மழை) • http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21&Submit.y=4 • Query word: fruit • http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en

  13. FUTURE WORK Dr.T.V.Geetha, Anna University Providing APIs for programmers and developing mobile apps for Agaraadhi framework will open a good platform for many researchers and developers working in Tamil Computing area.

  14. REFERENCE Dr.T.V.Geetha, Anna University Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. Anandan, R. Parthasarathi, and Geetha, Morphological Generator for Tamil. Tamil Inayam, Malaysia, 2001. J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky, Statistical Analysis and  visualization of Tamil Usage in Live Text Streams, Tamil Internet Conference, Coimbatore, 2010.

  15. An Efficient Tamil Text Compaction System N.M.Revathi G.P.Shanthi Elanchezhiyan.K T V Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University

  16. OBJECTIVES Dr.T.V.Geetha, Anna University • Why Compacting? • limited message length in blog sites and tiny user interface of mobile phones. • saves online storage space and hence reduction in cost. • The paper proposes • a text compaction system for Tamil, first of its kind in Tamil. • Idea of compaction • Getting the shortest word has no specific rule it is mainly aimed at understanding. • can be obtained by omitting letters, replacing prefix and suffix through suitable symbols and numbers.

  17. FRAMEWORK ARCHITECTURE Dr.T.V.Geetha, Anna University

  18. FRAMEWORK CONT.. Dr.T.V.Geetha, Anna University • Input Processing • The morphological analyzer removes the suffix (if present) added to the word and delivers the root word (RW).

  19. FRAMEWORK CONT.. Dr.T.V.Geetha, Anna University • Identification of the category & Extraction of compact word • Three categories of words ; common Tamil words, abbreviations/acronyms, numbers. • abbreviations /acronyms by comparing it with the keys of the hashmap. • With the help of the hash key and a mapping algorithm, the compact word is retrieved. • Otherwise belongs to either the common tamil word or numbers • If numbers - Numerical analyser for text to number conversion. • Output Processing : • Tamil tool Morphological Generator to add the suitable suffix to cater to the rules of the language.

  20. RESULT AND ANALYSIS Tested with over 10,000 words. The final result is reduced to 40% of the original text. Dr.T.V.Geetha, Anna University

  21. REFERENCES Dr.T.V.Geetha, Anna University Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore . Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs. Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messagesby Robert E. Beasley,Franklin College.

  22. Kuralagam - Concept Relation based Search Engine for Thirukkural Elanchezhiyan.K T.V.Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University

  23. Objectives Dr.T.V.Geetha, Anna University • Kuralagam is a conceptual search framework for Thirukkural – based on UNL Framework. • Searching with keywords – in kurals and intepretations • Concept based search based on CoReX – conceptual indexing based on UNL • Bilingual search – English and Tamil • Showing Relationships between the concepts.

  24. Kuralagam Framework Dr.T.V.Geetha, Anna University

  25. Offline Processing Dr.T.V.Geetha, Anna University • Web Crawler • A Thirukkural statistics crawler • crawls the news and blog documents - to find the usage of each individual Thirukkural. • The usage recorded for measuring the popularity score for each Thirukkural • Enconversion – Based on UNL • Indexed – based on CoReX Framework

  26. UNL & Enconversion UNL is an intermediate language processes knowledge across languagebarriers. captures semantics by converting natural language terms present in the document to concepts. concepts are connected to the other concepts through UNL relations - 46 UNL relations plf(Place From), plt(Place To), tmf(Time from), tmt(Time to) etc Process of converting a natural language text to UNL graph is known as Enconversion reverse process is known as Deconversion. Dr.T.V.Geetha, Anna University

  27. An Example speaks more... Ex:John was playing in the garden john(iof>person) agt play(icl>action) plc garden(icl>place) Dr.T.V.Geetha, Anna University

  28. Indexer Dr.T.V.Geetha, Anna University • The Kuralagam Indexer is designed based on CoReX Techniques. • The Indexer stores and manages the UNL graphs in two different indices. • Concept only index (C index), and • Concept-Relation-Concept index (CRC index)

  29. Online Processing Dr.T.V.Geetha, Anna University • Query Translation and Expansion • converts the user query to UNL graph. • uses CRC (Concept Relation Concept) CoReX indices to fetch similarity thesaurus and co-occurrence list to populate the Multi list Data Structure. • Search and Ranking • fetches the Thirukkural number and its details. • Thirukkurals for a given query are fetched using the two types of concept relation indices namely CRC and C. • The query concept is expanded using related CRC indices pointing to the query concept. • helps in retrieving many Thirukkurals conceptually related to the query – not possible with key word Thirukkural search engines. • The ranking is based on • priority to the indices in the order CRC>C • usage score • frequency occurrence of the query concept

  30. Tab Layout Dr.T.V.Geetha, Anna University

  31. Performance Evaluation Dr.T.V.Geetha, Anna University The accuracy of the Thirukkural search engine was measured using the average precision and mean average precision. The comparisons between concept based search and keyword based search were measured using Average Precision methodology

  32. Average Precision Dr.T.V.Geetha, Anna University

  33. Reference Dr.T.V.Geetha, Anna University 1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky Vairamuthu. CoReX: A Concept Based Semantic Indexing Technique. In SWM-08. 2008. India. 2. Foundation, U., the Universal Networking Language (UNL) Specifications Version 3 3ed. December 2004: UNL Computer Society, 2004. 8(5).Center UNDL Foundation 3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. 4. T.Dhanabalan, K.Saravanan, and T.V.Geetha. 2002. Tamil to UNL Enconverter, ICUKL, Goa, India. 5. Andrew, T. and S. Falk. User performance versus precision measures for simple search tasks. In 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval 2006. Seattle, Washington, USA.

  34. Template Based MultiLingual Summary Generation Subalalitha C.N E.Umamaheswari T V Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University

  35. Aim To generate a multi lingual summary using based on Universal Networking Language (UNL) Framework Dr.T.V.Geetha, Anna University

  36. The Architechture Dr.T.V.Geetha, Anna University

  37. Multi Lingual Summary Generation using UNL Template based Information Extraction • Seven tourism specific templates have been designed and used • Templates filled using semantic information inherent in UNL input graphs • Template information is language independent and can be used with any desired language. Dr.T.V.Geetha, Anna University

  38. Example Templates for Tourism Domain Dr.T.V.Geetha, Anna University

  39. SummaryGeneration • The template information is converted to target language using respective UNL-target language dictionaries. • UNL-target language dictionaries contains root words. • Natural language term from the root word is obtained using target language information like case suffixes and language technology tools like morphological generator • (சென்னை+இல்=சென்னையில்) • When these converted template information is fitted into target language specific dynamic sentence patterns, a summary is generated. Dr.T.V.Geetha, Anna University

  40. Performance Evaluation • Tested with 33,000 Tamil and English text documents enconverted to UNL graphs. • The performance of the methodology proposed has been evaluated using human judgement. • The accuracy of the summary generated has achieved 90% . • Further Enhancements • Query specific summary • Comparing the performance with human generated summaries. Dr.T.V.Geetha, Anna University

  41. References [1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, CoRe – Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010. [2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to multilingual text summarization”, Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1,2007 [3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences , Columbia University, 2005 [4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for multidocument multilingual text summarization [5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL Center UNDL Foundation December 2004. Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single Document Summarization” Workshop on Document Summarization, March, 2005, IIIT Allahabad. [7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm ” International Journal of Computer Applications (0975 – 8887) Volume 17– No.2, March 2011 . [8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents Implemented using Timestamps ” First International Conference on Emerging Trends in Engineering and Technology, IEEE 2008 Dr.T.V.Geetha, Anna University

More Related