1 / 33

Language Identification and IT

Language Identification and IT. Peter Constable and Gary Simons SIL International peter_constable@sil.org gary_simons@sil.org www.sil.org. Language identification.

Download Presentation

Language Identification and IT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Identification and IT Peter Constable and Gary Simons SIL International peter_constable@sil.org gary_simons@sil.org www.sil.org

  2. Language identification • The use of identificational codes for tagging information objects to indicate the language in which the information is expressed <body xml:lang=“en”> 17th International Unicode Conference San Jose, CA September 2000

  3. Language identification • Not considering automated language detection Considering only language identifiers, not identifiers for paralinguistic notions, such as writing system or locale 17th International Unicode Conference San Jose, CA September 2000

  4. About the Ethnologue • SIL Ethnologue • catalogue of all modern languages in the world • lists over 6,800 living languages • result of decades of research • system of three-letter codes • http://www.sil.org/ethnologue 17th International Unicode Conference San Jose, CA September 2000

  5. About the Ethnologue 17th International Unicode Conference San Jose, CA September 2000

  6. About the Ethnologue 17th International Unicode Conference San Jose, CA September 2000

  7. About the Ethnologue • Existing user base for Ethnologue codes: • SIL • UNESCO • Linguistic Data Consortium (850+ agencies) • The Linguist List (12,500 individual linguists) • The Endangered Language Fund • others 17th International Unicode Conference San Jose, CA September 2000

  8. Linguistic diversity • # of languages: Europe: 237 Asia: 2202 Africa: 2062 Americas: 1020 Pacific: 1312 17th International Unicode Conference San Jose, CA September 2000

  9. Motivation for this paper • Languages covered by standards • ISO 639-x covers approx. 400languages; • existing needs to go much further—over 6,800 languages • immediate need among linguists and other researchers for use in XML 17th International Unicode Conference San Jose, CA September 2000

  10. Five issues • Change • Categorization • Inadequate definition • Scale • Documentation 17th International Unicode Conference San Jose, CA September 2000

  11. The need for language identifiers • Language-specific processing • spell-checking • sorting • morphological parsing • speech recognition/synthesis • language-specific typographic behaviour • etc. 17th International Unicode Conference San Jose, CA September 2000

  12. The need for language identifiers • Language-specific processing • choosing appropriate resources Los eventos deportivos pra la juventud Los eventos deportivos pra la juventud ህ ጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ። ህ ጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ። 17th International Unicode Conference San Jose, CA September 2000

  13. The need for language identifiers • Two distinct issues: • identify the language • apply the specific processing for that language 17th International Unicode Conference San Jose, CA September 2000

  14. The need for language identifiers • Language detection • identify language by inspection of data itself • available only for a few languages • not practical for searching large corpora (e.g. the Internet) • doesn’t work on short text segments She said, “chat”. 17th International Unicode Conference San Jose, CA September 2000

  15. The need for language identifiers • Language-specific processing • in general, must tag information objects to indicate language • identifiers are needed to distinguish every language 17th International Unicode Conference San Jose, CA September 2000

  16. Issue #1: change • Languages are constantly changing • Implications: • systems of language tags cannot be static • the speech variety (varieties) denoted by a tag is time-bound “English” c. 1700 A.D. ≠ “English” c. 2000 A.D. 17th International Unicode Conference San Jose, CA September 2000

  17. Issue #2: categorization • Typical question: Are Serbian and Croatian the same language, or different languages? Operational definitions of language • many different ways to formulate a definition • different definitions create different categorizations • different categorizations serve different purposes 17th International Unicode Conference San Jose, CA September 2000

  18. Issue #3: inadequate definition • Existing systems do not consistently employ a single operational definition • ISO 639-2: codes for “languages” and for groups of languages nav = Navajo ath = Athapascan languages • ISO 639-2: some “languages” are groups of languages que = “Quechua” (47 distinct languages) 17th International Unicode Conference San Jose, CA September 2000

  19. Issue #3: inadequate definition • Consistent use of a single definition in a given namespace is beneficial • “Requiring a single definition imposes too much constraint on users” • users may legitimately have different requirements • but no control results in confusion, especially when thousands of identifiers are added 17th International Unicode Conference San Jose, CA September 2000

  20. Issue #4: Scale • Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800) • Existing systems do not scale well 17th International Unicode Conference San Jose, CA September 2000

  21. Issue #4: Scale • ISO 639-x • slow process unable to cope with large volume of requests • minimal attestation (50 documents) not appropriate for lesser-known languages • mnemonic codes (impossible for thousands of languages) • confusion due to inconsistent definition 17th International Unicode Conference San Jose, CA September 2000

  22. Issue #4: Scale • RFC 1766 • process unable to cope with large volume of requests • confusion due to inconsistent definition • unclear how to create tags 17th International Unicode Conference San Jose, CA September 2000

  23. Issue #5: documentation • Existing systems: can’t tell what codes denote • ISO 639-x: language, or group of languages? ara, “Arabic”: Standard only? all variants? • ISO 639-x: which of several alternate possibilities? bin, “Bini” = dial. of Yoruba (Nigeria; 20,000,000) = dial. of Anyin (Côte d'Ivoire; 810,000) = alt. name for Edo (Nigeria; 1,000,000) = alt. name for Pini (Australia; dying) 17th International Unicode Conference San Jose, CA September 2000

  24. Issue #5: documentation • ISO 639-x: 2- vs. 3-letter codes st, “Sesotho” = nso, “Sotho, Northern”? = sot, “Sotho, Southern”? = both? to, “Tonga” = tog, “Tonga (Nyasa)”? = ton, “Tonga (Tonga Islands)”? 17th International Unicode Conference San Jose, CA September 2000

  25. Solving these problems • Requirements of an adequate system: • able to scale • able to deal with change, track history of change • use a single operational definition for a given namespace • apply definition consistently within a namespace • complete, maintained, online documentation 17th International Unicode Conference San Jose, CA September 2000

  26. What the Ethnologue offers • Scale: already there • enumeration of languages • set of three-letter codes • Change: careful management • no re-use of codes • have begun recording revision history 17th International Unicode Conference San Jose, CA September 2000

  27. What the Ethnologue offers • Definition: single definition, applied quite consistently • definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates for separate literacy, literature • all categories are of the same type; no language families, groups, writing systems 17th International Unicode Conference San Jose, CA September 2000

  28. What the Ethnologue offers • Documentation • extensive information maintained for every language • new site will provide various reports • alternate names, location, population, etc. • related ISO codes, relationship • return Ethnologue data given an ISO code • evaluating possibilities for returning results as XML 17th International Unicode Conference San Jose, CA September 2000

  29. Integration with RFC 1766, XML • Ethnologue codes immediately available using “x-” “Hopi”: <body xml:lang=“x-hop”> <body xml:lang=“x-sil-hop”> • private-use tags not ultimately satisfactory 17th International Unicode Conference San Jose, CA September 2000

  30. Integration with RFC 1766, XML • Register thousands of new tags with IANA • process would not be able to cope • problems devising that many tags • create considerable confusion in the single namespace 17th International Unicode Conference San Jose, CA September 2000

  31. Integration with RFC 1766, XML • Register “i-sil-” to specify a namespace maintained by a particular agency • <body xml:lang=“i-sil-hop”> • deals with scale • creates a namespace with a particular definition that is consistently applied • avoids confusion of having a single namespace for all needs • allow alternate namespaces 17th International Unicode Conference San Jose, CA September 2000

  32. Integration with RFC 1766, XML • Possible refinement: define primary tag “n-” <body xml:lang=“n-sil-hop”> • first sub-tag identifies a registered namespace of identifiers • each namespace provides its own operational definition(s) • “i-” usage more consistent (languages only) • “i-” specifies a privileged namespace (doesn’t require “n-”) 17th International Unicode Conference San Jose, CA September 2000

  33. Conclusions • Language identifiers required for language-specific processing • Immediate need for thousands of new language identifiers; in particular, for use in XML • Five problem areas—need to be considered in any system • SIL Ethnologue codes address all five problems • Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits 17th International Unicode Conference San Jose, CA September 2000

More Related