1 / 34

Issues in Multilingual Thesauri

Issues in Multilingual Thesauri. Managing Content. Managing Content relevant and related to an organization Documentary Resources Internally generated reports and other resources Web Resources CMS combine a variety of tools & technologies. Managing Content. Involves Capturing Storing

quana
Download Presentation

Issues in Multilingual Thesauri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Multilingual Thesauri

  2. Managing Content • Managing Content relevant and related to an organization • Documentary Resources • Internally generated reports and other resources • Web Resources • CMS combine a variety of tools & technologies

  3. Managing Content • Involves • Capturing • Storing • Managing • Preserving; and • Delivering Information

  4. Managing Content • Document management • Collaboration • Web content management • Records management - long-term storage Need for Vocabulary management; • Consistency in content representation • By Creators – authors • By Indexers • By Searchers Thesauri are important tools for this purpose

  5. LINGUSITC DIVERSITY IN GLOBAL INFORMATION NETWORKS AND UNIVERSAL ACCESS TO INFORMATION IN CYBERSPACE ARE AT THE CORE OF CONTEMPORARY DEBATES AND CAN BE A DETERMINING FACTOR IN THE DEVELOPMENT OF A KNOWLEDGE-BASED SOCIETY UNESCO

  6. “… multilingual tools are getting importance as increasingly diverse groups from different cultural and linguistic backgrounds seek access to equally diverse pieces of information…” • Jorna & Davies • J.Doc.

  7. Multilingual Thesauri • Multilingual Thesauri support, among other things: • Cross-walk between KO tools • Cross-cultural communication (including comparative studies) • Navigation between semantically related concepts (Terms) • Semantic navigation between concepts in a domain and related knowledge resources (bibliographical metadata, etc)

  8. Multilingual Thesauri [Contd.] • Intelligent query expansion • Linguistic Research Future • Improved natural language processing • Language recognition • Improved parsing • Concept resolution • Inferencing / Reasoning - Ontology

  9. Background • Early DRTC interest in Thesaurus Building • F-Thes • OM Information System • The Present Project • Digital Library of Tamil Classics Characteristics: • More than one language • Culture-Specific Domains

  10. Subject Coverage Time / Period Structure & Presentation F-THES Religious Mysticism No period restriction Structure defined to generate independent language thesauri, if required; Context specifying elements used only occasionally TAMTH Entire universe of subjects Sangam Period Structure based on Tamil terms as the base / source (descriptor) with corresponding terms in English language; Context specifying elements used for every Descriptor

  11. Background [Contd.] • The objective: • To employ the new thesaurus for vocabulary management: • In Indexing • User Interfaces • formulating search expressions and search strategies • Facilitating navigation between related terms (Narrower, Broader and other Related terms) • Value addition via links to relevant lexical tools

  12. Issues • Humanities vis-à-vis Sciences • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs

  13. Issues • Focus on: • Vocabulary management in bilingual and multilingual thesauri in culture-specific domains; • Special aspects of the Tamil language in this regard; • Alternative ways of linking descriptors to lengthy lists of NTs and RTs; • Advantages of integrated use of two or more knowledge organization tools • Many of the issues discussed here are unique to Thesauri in the domains of Humanities

  14. Issues • Humanities vis-à-vis Sciences • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs

  15. The Approach • Combining existing thesauri • Merging two or more existing thesauri • Linking existing thesauri to each other • Translating an existing thesaurus into one or more other languages • Building a new thesaurus ‘bottom up’ • Starting with one language and adding another language or languages • Starting with more than one language simultaneously

  16. The Approach • The candidate terms: • The corpus; Both print-on-paper and electronic sources; E.g., 1)Cologne online Tamil lexicon. [Based on Tamil Lexicon and supplement, 1924-1939]. http://webapps.uni-koeln.de/tamil/ (COTL) 2)Commemorative bibliography of the first 1008 books published by the South India Saiva Siddhanta Works Publishing Society / By S.R. Ranganathan and R. Muthukumaraswamy. Tirunelveli: The Society; 1961. 3)Periya puranam: a Tamil classic on the great Saiva Saints of South India / By Sekkizhaar. Condensed English version by G. Vanmikanathan and N. Mahalingam. Madras: Sri Ramakrishna Math; [1985]. 4) Sub-forms of Tamil poetry and their classification / By S.R. Ranganathan and V.Thillainayagam. Annals of Library Science, 10(3); 1963; 175-185 5)WordNet 2.1 (online) 6) Murugan, V. (200). Tolkappiam in English: Translation with the Tamil text translileration in the Roman script, Introduction, glossary and illustrations / Project Director; Dr. G. John Samuel. Chennai: Institute of Asian Studies. ISBN 81-87892-05-6. 7) Tamil lexicon (1924-1939). Published under the authority of the University of Madras. Reprint 1982. v.I-VI + Supplement. 8) Thillainayagam, V. (1978). The cultural heritage of the Tamils: Library studies. Madras Institute of Tamil Studies, Seminar on Cultural Heritage of Tamils, 25-27 February 1978; p. 292-333. Also published in Pulamai, v.4, No.3-4; July-September 1978; p.253-299.

  17. The Approach [Contd.] • To Create records in an alphabetical fashion (from a to z); This was found to be tedious; • The terms in the corpus were grouped into broad categories – based on Basic Classes of C.C. • The thesaurus is being maintained as a database (using WINISIS)

  18. The Approach [Contd.] • Candidate concepts • Titles of Classics • Quasi classes; have attracted other works upon themselves;

  19. The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs

  20. Script & Transliteration • Terms entered in the Roman script using the COTL scheme for transliteration (This is used by the Tamil Lexicon) • Supports automatic conversion to Tamil script • Records will eventually be in Tamil script

  21. Issues • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs

  22. Semantic Issues • Equivalence • Within the Language • A large number of synonyms in Tamil • Across Languages • Concepts unique to a culture (and so to the language); Non-Availability of terms in English for a large number of concepts • Near equivalent concepts • Use the original term

  23. Semantic Issues [Contd.] Example • tAmarai (lotus) • mirunALam (Stalk of the Lotus) • tAmaraimuL (thorny portion of the stalk of the lotus)

  24. Search Term No. of Records tAmarai 327 Entries with tAmarai as entry word or in the explanation kamalam 36 entries with kamalam as entry word or in the explanation Lotus 309 entries with Lotus as entry word or in the explanation Multiplicity of Synonyms • tAmarai – 82 synonyms in Tamil

  25. Semantic Issues [Contd.] • cAttunARRu = Young plants planted in place of the dead ones • aSTAgkaputti = Eight Kinds of Knowledge • cARvAkam = cAruvAka’s materialistic philosophy which says perception is the only source of knowledge

  26. Semantic Issues [Contd.] • Homographs • tAmarai = Lotus plant; Lotus flower; Lotus as a shape (entities in the shape of a lotus); Lotus-like properties (e.g., soft like lotus petals) • appu = Thigh; Father; Loan; Debt; Domestic male servant; Water; Trumpet tree; Sixth division of day • May also have to do with the evolution in the meaning and connotation of terms in Tamil • kurinchi, mullai, marutam, neitl, and palai

  27. Semantic Issues [Contd.] • Homographs • Elam (spice) • SN Cardamom plant, elettaria cardamomum; cardamom • UF ilAjncali (spice) • UF ilAjnci (spice) • UF kALintam (spice) • UF kaNmali (spice) • BT tAparavastu (plant) • BT2 tAparanUl (botany) • IlAjncali (spice) • Use Elam (spice)

  28. Homographs • The real meaning is to be understood in the context; Extensive use of Role Operators. Examples: • iTimpam (baby); iTimpam (castor); iTimpam (egg); iTimpam (misery); iTimpam (spleen) • Inverted Index will help users to select appropriate search term

  29. Issues • The Approach • Candidate Concepts & Terms • Issues Related to Script & Transliteration • Semantic Issues • Structural Issues • Management Issues • Handling NTs, BTs, RTs

  30. Structural Issues • Hierarchy • Difficulties in developing corresponding hierarchies in two languages • Large Number NTs • Alternative Ways of Managing • Associative Relations • Links to Online lexical tools

  31. tAmarai mirunALam tAmaraimuL Lotus Stalk of the Lotus thorny portion of the stalk of the lotus

More Related