1 / 32

A Field Linguist’s Guide to Unicode

A Field Linguist’s Guide to Unicode. Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and Databases January 4, 2007. Working with Text Representation.

cybill
Download Presentation

A Field Linguist’s Guide to Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and Databases January 4, 2007

  2. Working with Text Representation • “Use Unicode” (ISO/IEC 10646)

  3. Working with Text Representation • “Use Unicode” (ISO/IEC 10646) • Practical issues to consider: * which Unicode characters? * what about fonts? * how about keyboards? * will the language be supported in off-the-shelf software?

  4. Working with Text Representation • Goal today is to discuss the whole process of enabling a language to be used on a computer: • identifying letters/symbols in Unicode • fonts • keyboards • how to get support for the characters and scripts in software

  5. Step 1. Identify the characters used in a language • List all letters, symbols, digits, and marks of punctuation used in a language

  6. Step 1. Identify the characters used in a language One proposal for the Kazym Khanty alphabet

  7. Step 1. Identify the characters used in a language • List all letters, symbols, digits, marks of punctuation used in a language • Assign Unicode codepoints http://www.tlg.uci.edu/quickbeta.pdf

  8. Step 1. Identify the characters used in a language • List all letters, symbols, digits, marks of punctuation used in a language • Assign Unicode codepoints • Post a plain text version on a publicly accessible website • Circulate this list for comment

  9. Step 1. Identify the characters used in a language • Questions on which Unicode characters to use? • Check codecharts on the Unicode website

  10. Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations

  11. Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations • Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters

  12. Step 1. Identify the characters used in a language http://www.unicode.org/alloc/Pipeline.html

  13. Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations • Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters • Unsure? Ask on Unicode email list

  14. Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard

  15. Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a Unicode proposal or to conduct research

  16. Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a Unicode proposal or to conduct research • TIP: Allow enough time for writing and review of proposal

  17. Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a proposal or to conduct research • TIP: Allow enough time for writing and review of proposal • Note: Once written, the proposal will take 2-5 years to get through standards bodies

  18. Step 1. Identify the characters used in a language • For languages without an orthography, consult Unicode Technical Note #19 : • http://www.unicode.org/notes/tn19/

  19. Step 1. Identify the characters used in a language • From Unicode Technical Note #19: • If at all possible, use an already encoded character, abiding by the following tips: • If the script is right-to-left, select a character that is from a script that is right-to-left • Avoid “presentation forms” or “letterlike characters” • For a punctuation mark, select a character from the general punctuation block. • http://www.unicode.org/notes/tn19/

  20. Step 2: Send locale data to CLDR project • Locales: local conventions used to create software that is tailored to a specific language and location • Currency ($, £, etc.) • Time/date formats, measurement systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300) • Sorting order

  21. Step 2: Send locale data to CLDR project • Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others. http://www.unicode.org/cldr/

  22. Step 2: Send locale data to CLDR project

  23. Step 2: Send locale data to CLDR project • TIP: Involve a member of the user community to submit locale data

  24. Step 3: Create a font • Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font • If any characters are being proposed, wait until they are far along in the standards process • Tip: Apply for funding to create a freely available font; costs can run $100/glyph

  25. Step 3: Create a font • It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) • Use FontLab

  26. Step 4: Rendering Engines for complex scripts need upgrade • For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. • Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly

  27. Examples of Complex Scripts N’Ko Javanese

  28. Step 4: Rendering Engines for complex scripts need upgrade • SIL’s Graphite rendering engine offers a good test environment • Generally Apple does not require upgrades to its rendering engine • Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS

  29. Step 5: Create a Keyboard • There are a number of keyboard creation programs that are available, including: • Keyman (for Windows) • Microsoft Keyboard Layout Creator (“MKLC”) • Ukelele (for the Mac) • Keyboard Mapping for Linux

  30. Step 5: Create a Keyboard • Make the keyboard layout practical and have the user community test it out. • Make the keyboard layout freely available on (such as on Tavultesoft’s website)

  31. Conclusion • Getting support for a language on the computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key. • Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts) • Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.

  32. Unicode website: http://www.unicode.orgScript Encoding Initiative: http://linguistics.berkeley.edu/sei

More Related