A Field Linguist’s Guide to Unicode

A Field Linguist’s Guide to Unicode Deborah Anderson Script Encoding Initiative (Universal Scripts Project) Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and Databases January 4, 2007

Working with Text Representation • “Use Unicode” (ISO/IEC 10646)

Working with Text Representation • “Use Unicode” (ISO/IEC 10646) • Practical issues to consider: * which Unicode characters? * what about fonts? * how about keyboards? * will the language be supported in off-the-shelf software?

Working with Text Representation • Goal today is to discuss the whole process of enabling a language to be used on a computer: • identifying letters/symbols in Unicode • fonts • keyboards • how to get support for the characters and scripts in software

Step 1. Identify the characters used in a language • List all letters, symbols, digits, and marks of punctuation used in a language

Step 1. Identify the characters used in a language One proposal for the Kazym Khanty alphabet

Step 1. Identify the characters used in a language • List all letters, symbols, digits, marks of punctuation used in a language • Assign Unicode codepoints http://www.tlg.uci.edu/quickbeta.pdf

Step 1. Identify the characters used in a language • List all letters, symbols, digits, marks of punctuation used in a language • Assign Unicode codepoints • Post a plain text version on a publicly accessible website • Circulate this list for comment

Step 1. Identify the characters used in a language • Questions on which Unicode characters to use? • Check codecharts on the Unicode website

Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations

Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations • Not in Unicode charts? See if it is on the “Pipeline” page on the website for new characters

Step 1. Identify the characters used in a language http://www.unicode.org/alloc/Pipeline.html

Step 1. Identify the characters used in a language • Questions on which Unicode characters? • Check codecharts on the Unicode website • Check nameslist and annotations • Not in Unicode charts? See if it is in the “Pipeline” page on the website for new characters • Unsure? Ask on Unicode email list

Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard

Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a Unicode proposal or to conduct research

Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a Unicode proposal or to conduct research • TIP: Allow enough time for writing and review of proposal

Step 1. Identify the characters used in a language • Propose any missing characters for inclusion into the Unicode Standard • TIP: Apply for funding to write a proposal or to conduct research • TIP: Allow enough time for writing and review of proposal • Note: Once written, the proposal will take 2-5 years to get through standards bodies

Step 1. Identify the characters used in a language • For languages without an orthography, consult Unicode Technical Note #19 : • http://www.unicode.org/notes/tn19/

Step 1. Identify the characters used in a language • From Unicode Technical Note #19: • If at all possible, use an already encoded character, abiding by the following tips: • If the script is right-to-left, select a character that is from a script that is right-to-left • Avoid “presentation forms” or “letterlike characters” • For a punctuation mark, select a character from the general punctuation block. • http://www.unicode.org/notes/tn19/

Step 2: Send locale data to CLDR project • Locales: local conventions used to create software that is tailored to a specific language and location • Currency ($, £, etc.) • Time/date formats, measurement systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300) • Sorting order

Step 2: Send locale data to CLDR project • Common Locale Data Project: project hosted by Unicode that makes locale info freely available for software developers and others. http://www.unicode.org/cldr/

Step 2: Send locale data to CLDR project

Step 2: Send locale data to CLDR project • TIP: Involve a member of the user community to submit locale data

Step 3: Create a font • Once a list of all the letters and symbols has been created with Unicode values, work can begin on a font • If any characters are being proposed, wait until they are far along in the standards process • Tip: Apply for funding to create a freely available font; costs can run $100/glyph

Step 3: Create a font • It is recommended to use someone familiar with the script and computer typography (esp. for complex scripts) • Use FontLab

Step 4: Rendering Engines for complex scripts need upgrade • For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to the rendering engine are often needed in order to properly draw the glyphs. • Early contact with companies (Microsoft and Adobe), the Linux community, and SIL is advised so the rendering engine can support the script properly

Examples of Complex Scripts N’Ko Javanese

Step 4: Rendering Engines for complex scripts need upgrade • SIL’s Graphite rendering engine offers a good test environment • Generally Apple does not require upgrades to its rendering engine • Microsoft prioritizes which scripts are included in its next rendering engine; governmental support is helpful in making a case to MS

Step 5: Create a Keyboard • There are a number of keyboard creation programs that are available, including: • Keyman (for Windows) • Microsoft Keyboard Layout Creator (“MKLC”) • Ukelele (for the Mac) • Keyboard Mapping for Linux

Step 5: Create a Keyboard • Make the keyboard layout practical and have the user community test it out. • Make the keyboard layout freely available on (such as on Tavultesoft’s website)

Conclusion • Getting support for a language on the computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key. • Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts) • Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.

Unicode website: http://www.unicode.orgScript Encoding Initiative: http://linguistics.berkeley.edu/sei

A Field Linguist’s Guide to Unicode