1 / 35

CLDR: The Common Locale Data Repository Locales for the World

LRC ? XI The Localisation Factory. Agenda. Why CLDR?CLDR dataTools and vettingToday and the future. LRC ? XI The Localisation Factory. Agenda. Why CLDR?CLDR dataTools and vettingToday and the future. LRC ? XI The Localisation Factory. Locales ? does anything stay the same?. "Theatre Center Ne

taber
Download Presentation

CLDR: The Common Locale Data Repository Locales for the World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

    2. LRC – XI The Localisation Factory Agenda Why CLDR? CLDR data Tools and vetting Today and the future

    3. LRC – XI The Localisation Factory Agenda Why CLDR? CLDR data Tools and vetting Today and the future

    4. LRC – XI The Localisation Factory Locales – does anything stay the same? "Theatre Center News: The date of the last version of this document was 2003?3?20?. A copy can be obtained for $50,0 or 1.234,57 ???. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."

    5. LRC – XI The Localisation Factory Locales – the many differences Locales specify user preferences Linguistic and cultural differences Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizes Even in the same locale, interoperability issues across platforms Global economics has increased the need for greater globalization support in computer systems Everyone expects more!

    6. LRC – XI The Localisation Factory Add the Universal Character Encoding Unicode: Unique character codes for all languages

    7. LRC – XI The Localisation Factory The Need for Common Locale Data Computing environments often contain a variety of operating systems and software. Historically locale sensitive data research has been done by individuals and/or companies. Because of political changes, it is easy for locale data to become out of date. It is difficult to get complete agreement on correctness.

    8. LRC – XI The Localisation Factory Common Locale Data Project Began as Common XML Locale Repository (CXLR) developed by OpenI18N in 2003 CLDR project began in 2004 Hosted by Unicode Consortium http://www.unicode.org/cldr/ Goals: Common, necessary software locale data for all world languages Collect and maintain locale data XML format for effective interchange Freely available The Common Locale Data Repository (CLDR) was developed in response to the need for standardized locales based on Unicode. CLDR provides key building blocks for software to support the world’s languages. This data is used by a wide spectrum of companies for their software internationalization and localization – adapting software to the conventions of different languages and locations for such common tasks as formatting of dates, times, time zones, numbers, and currency values, sorting text; and choosing languages or countries by name, among others. The CLDR project collects and maintains locale data and uses the Locale Data Markup Language (LDML) to describe the data.The Common Locale Data Repository (CLDR) was developed in response to the need for standardized locales based on Unicode. CLDR provides key building blocks for software to support the world’s languages. This data is used by a wide spectrum of companies for their software internationalization and localization – adapting software to the conventions of different languages and locations for such common tasks as formatting of dates, times, time zones, numbers, and currency values, sorting text; and choosing languages or countries by name, among others. The CLDR project collects and maintains locale data and uses the Locale Data Markup Language (LDML) to describe the data.

    9. LRC – XI The Localisation Factory CLDR in use (partial list) Libraries and Environments ICU – International Components for Unicode JDK – Java Development Kit Operating Systems Solaris AIX MacOS X Applications OpenOffice.org Acrobat ModernBill

    10. LRC – XI The Localisation Factory Agenda Why CLDR? CLDR data Tools and vetting The future

    11. LRC – XI The Localisation Factory What is a Locale? A locale is an identifier referring to linguistic and cultural preferences en_US, en_GB, ja_JP These preferences can change over time due to cultural and political reasons Introduction of new currencies, like the Euro Standard sorting of Spanish changes Many of these preferences have varying degrees of standardization 12 and 24 hour format in the United States This is a very broad topic A locale is a string identifier that refers to specific linguistic and cultural preferences. These preferences can include date/time formatting, number formatting, spelling of certain names and many other items. These preferences can change over time due to cultural and political reasons. For example, modern Spanish sorts differently from older Spanish from the 1990s. In another example, some countries mandate how specific regions are referred to (this can happen when ownership of a region is in dispute). Of course, these types of preferences are not absolute. For example, most people in the United States use 12 hour time, but there are some people in the US that use 24 hour time. There are some languages, like French and Japanese, that have published standards for how to sort those languages. There are some other languages that may not have enough exposure to other cultures to have names for certain places or concepts. There are many things that locale data can cover. It could cover industry specific topics, like shoe size. CLDR limits its scope to a few specific topics. Scope of data limited to common system applications A locale is a string identifier that refers to specific linguistic and cultural preferences. These preferences can include date/time formatting, number formatting, spelling of certain names and many other items. These preferences can change over time due to cultural and political reasons. For example, modern Spanish sorts differently from older Spanish from the 1990s. In another example, some countries mandate how specific regions are referred to (this can happen when ownership of a region is in dispute). Of course, these types of preferences are not absolute. For example, most people in the United States use 12 hour time, but there are some people in the US that use 24 hour time. There are some languages, like French and Japanese, that have published standards for how to sort those languages. There are some other languages that may not have enough exposure to other cultures to have names for certain places or concepts. There are many things that locale data can cover. It could cover industry specific topics, like shoe size. CLDR limits its scope to a few specific topics. Scope of data limited to common system applications

    12. LRC – XI The Localisation Factory Types of Locale Data Dates/time/calendar formats Number/currency formats Measurement system Collation specification Sorting Searching Matching Translated names for language, territory, script, timezones, currencies,… Script and characters used by a language This is a list of the some of the topics that CLDR has translations and formats for locale data.This is a list of the some of the topics that CLDR has translations and formats for locale data.

    13. LRC – XI The Localisation Factory Locale Data Markup Language Locale data described using XML CLDR data uses LDML Structure of CLDR controlled by Locale Data Markup Language (LDML) specification http://unicode.org/reports/tr35

    14. LRC – XI The Localisation Factory LDML Data Categories <ldml> <identity> <localeDisplayNames> <layout> <characters> <delimiters> <measurement> <dates> <numbers> <posix> <collations>

    15. LRC – XI The Localisation Factory Names <localeDisplayNames> Provides translated display names for languages, territories, scripts, variants and keywords used in CLDR. Most of this information is at the language level, since it typically does not vary by territory, only language. An example: ICU Locale Explorer

    16. LRC – XI The Localisation Factory Names Examples From ga.xml (Irish): <localeDisplayNames> <languages> <language type="aa">Afar</language> <language type="ab">Abcáisis</language>… <scripts> <script type="Arab">Araibis</script>… <territories> <territory type="AD">Andóra </territory> <territory type="AE">Aontas na nÉimíríochtaí Arabacha </territory>… Here is an example of what CLDR looks like. In this snippet of CLDR data, some translations are provided for some language, country and script display names. The keys use other standards, like ISO-639, ISO-3166 and other various standards. As you can see CLDR is written in XML. This data can be used for web site preferencesHere is an example of what CLDR looks like. In this snippet of CLDR data, some translations are provided for some language, country and script display names. The keys use other standards, like ISO-639, ISO-3166 and other various standards. As you can see CLDR is written in XML. This data can be used for web site preferences

    17. LRC – XI The Localisation Factory Characters <characters> Allows for creation of exemplar character sets. An exemplar set specifies the set of characters that must be present in order to properly render the language. Auxiliary exemplar set defines additional characters that may appear in foreign words or phrases. Lower case only

    18. LRC – XI The Localisation Factory Date Formats <dates> Defines representation of calendars using various calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.) Defines formatting for dates, times, eras and time zones wide, abbreviated, or narrow Date and time formats use patterns of letters to define proper formatting Week information Relative day/time translations (for example, yesterday, tomorrow, etc. ) An example: ICU Locale Explorer

    19. LRC – XI The Localisation Factory Characters / Dates Examples From ga.xml (Irish): <characters> <exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z] </exemplarCharacters> <exemplarCharacters type="auxiliary"> [? c ? ? g ? ? ? ?] </exemplarCharacters> </characters>… <dayContext type="format"> <dayWidth type="abbreviated"> <day type="sun">Domh</day> <day type="mon">Luan </day>…

    20. LRC – XI The Localisation Factory Time Zone Names <timeZoneNames> Based on Olson time zone database Localized display names for standard, daylight, and generic representations of time zones. Short and long display names.

    21. LRC – XI The Localisation Factory Numbers <numbers> Specifies proper localized formatting of numeric quantities Decimal Scientific Currency Percentages Includes localized decimal, thousands separators, currency symbols, etc.

    22. LRC – XI The Localisation Factory Time Zones / Currencies From ga.xml (Irish) and root.xml: <timeZoneNames> <zone type="Europe/Dublin"> <long> <standard>Meán-Am Greenwich</standard> <daylight>Am Samhraidh na hÉireann </daylight> </long>… <numbers> <currencies> <currency type=“EUR"> <displayName>Euro</displayName> <symbol>€</symbol>…

    23. LRC – XI The Localisation Factory Delimiters <delimiters> Specifies a primary and secondary of delimiter characters to be used for bracketing quotations in text

    24. LRC – XI The Localisation Factory Delimiters Example From fr.xml (French): <delimiters> <quotationStart>«</quotationStart> <quotationEnd>»</quotationEnd> <alternateQuotationStart>“</alternateQuotationStart> <alternateQuotationEnd>”</alternateQuotationEnd> </delimiters>

    25. LRC – XI The Localisation Factory Collation <collations> Information in collation directory, not main XML version of Java/ICU collation syntax Unicode collation algorithm is the base http://unicode.org/reports/tr10 Allows tailoring of the UCA on a per locale basis.

    26. LRC – XI The Localisation Factory Collation Example From collations/root.xml: <collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT"> <collation type="standard"> <rules> ... <s>a</s> <t>A</t> <s>á</s> <t>Á</t> <s>a</s> <t>A</t> <s>ŕ</s> <t>Ŕ</t>…

    27. LRC – XI The Localisation Factory Agenda Why CLDR? CLDR data Tools and vetting Today and the future

    28. LRC – XI The Localisation Factory CLDR Tools Export ICU resource bundle generation POSIX locale generator openOffice.org format export Survey tool http://www.unicode.org/cgi-bin/cldr-survey

    29. LRC – XI The Localisation Factory Vetting Process for Data Collect from different platforms, experts, submissions: new or revised References to external sources strongly encouraged Must be before freeze date for release Use Survey Tool to Collect Data Will show a demo of Survey Tool Will show a demo of Survey Tool

    30. LRC – XI The Localisation Factory Causes of Conflicting Data Typographical errors Canda instead of Canada Regional differences German spelling is different between countries Parts of speech “???? 2004” versus “3 ?????” when the Russian word for March is used in a date Context of usage Normal German sorting versus German phonebook sorting Standards versus common use “Republic of Laos” versus “Laos” Individual preferences 24 hour time format versus 12 hour time format Now we will look at some examples of conflicting data. These are items which turn up when data comparisons are made. Not everything is an either-or case. Sometimes we find that a restructuring of the data is in order to accomodate both the old and new data because both could be correct. Typographical errors: Sometimes this is due to data being entered by keyboard incorrectly. Other times it can be due to using one locale’s translations as a template for another locale’s data. Regional differences: Regional and sub-regional differences may require the decision to keep both sets of data in different locales rather than choosing one over another. For example, German in Germany and Switzerland frequently have spelling differences, and sometimes American English is different from British English. Context of usage: There is more than one way to sort German text. There is normal German sorting, and there is German phonebook sorting. For example, “öf” and “of” sort in differently between normal German sorting and German phonebook sorting. Parts of speech: Some languages make a distinction between the way month names are written when cited independently, and when written as part of a date. For example, “March 2004” at the heading of a Calendar would be written as just the name March, but the date “3rd March, 2004” would require a different form meaning “of March”. CLDR accommodates such languages using a type value of “standalone” or “format”, respectively. Standards vs. common use: CLDR uses the commonly used translation or format for the default. However alternates are allowed in CLDR. Sometimes there is more than one right answer. Misunderstanding: Sometimes translators don’t have enough knowledge about how CLDR works. Sometimes a translator will try to translate the format and characters of a date format instead of just the format. The localized characters of a date format are in a separate field of CLDR. Uncommon cases: There are some items and concepts in CLDR that are not commonly known by all translators. For example, how does a translator translate the word “Interlingua” (a language) when the translator has never heard of the Interlingua language. Sometimes translators guess, and these guesses will appear during the vetting process. Individual preferences: Some people have different preferences, and this can vary between translators. For example, the US military usually use 24 hour time, but the rest of the United States uses 12 hour time. Now we will look at some examples of conflicting data. These are items which turn up when data comparisons are made. Not everything is an either-or case. Sometimes we find that a restructuring of the data is in order to accomodate both the old and new data because both could be correct. Typographical errors: Sometimes this is due to data being entered by keyboard incorrectly. Other times it can be due to using one locale’s translations as a template for another locale’s data. Regional differences: Regional and sub-regional differences may require the decision to keep both sets of data in different locales rather than choosing one over another. For example, German in Germany and Switzerland frequently have spelling differences, and sometimes American English is different from British English. Context of usage: There is more than one way to sort German text. There is normal German sorting, and there is German phonebook sorting. For example, “öf” and “of” sort in differently between normal German sorting and German phonebook sorting. Parts of speech: Some languages make a distinction between the way month names are written when cited independently, and when written as part of a date. For example, “March 2004” at the heading of a Calendar would be written as just the name March, but the date “3rd March, 2004” would require a different form meaning “of March”. CLDR accommodates such languages using a type value of “standalone” or “format”, respectively. Standards vs. common use: CLDR uses the commonly used translation or format for the default. However alternates are allowed in CLDR. Sometimes there is more than one right answer. Misunderstanding: Sometimes translators don’t have enough knowledge about how CLDR works. Sometimes a translator will try to translate the format and characters of a date format instead of just the format. The localized characters of a date format are in a separate field of CLDR. Uncommon cases: There are some items and concepts in CLDR that are not commonly known by all translators. For example, how does a translator translate the word “Interlingua” (a language) when the translator has never heard of the Interlingua language. Sometimes translators guess, and these guesses will appear during the vetting process. Individual preferences: Some people have different preferences, and this can vary between translators. For example, the US military usually use 24 hour time, but the rest of the United States uses 12 hour time.

    31. LRC – XI The Localisation Factory Agenda Why CLDR? CLDR data Tools and vetting Today and the future

    32. LRC – XI The Localisation Factory Latest Release: CLDR 1.4 Released: July 17, 2006 360 locales: 121 languages 142 territories 25% more data 17,000 new or modified data items Over 100 different contributors Here is a summary of the latest CLDR release. Complete POSIX-format data with POSIX conversion tool More timezone translations Data for UN M.49 regions, including continents and regions Addition of ISO 4217 currency code change overs Additional number and data tests to verify CLDR implementations Mappings from language to script and territory Various other fixes, additions, and extensions Survey tool for improved collection of data (read only to non-members)Here is a summary of the latest CLDR release. Complete POSIX-format data with POSIX conversion tool More timezone translations Data for UN M.49 regions, including continents and regions Addition of ISO 4217 currency code change overs Additional number and data tests to verify CLDR implementations Mappings from language to script and territory Various other fixes, additions, and extensions Survey tool for improved collection of data (read only to non-members)

    33. LRC – XI The Localisation Factory Challenges Complex Formats Experts knowledgeable both in technology and a specific language Collation Exemplar characters Etc… Require close interaction of CLDR experts with language experts There are some challenges for creating data for CLDR. Some of the information can be complex. Some items in CLDR have a very specific purpose and meaning, but a language expert may be unfamiliar with these purposes and meanings. Sometimes close interaction between experts can be difficult over the phone or face to face. Interacting over e-mail is easier.There are some challenges for creating data for CLDR. Some of the information can be complex. Some items in CLDR have a very specific purpose and meaning, but a language expert may be unfamiliar with these purposes and meanings. Sometimes close interaction between experts can be difficult over the phone or face to face. Interacting over e-mail is easier.

    34. LRC – XI The Localisation Factory Getting Involved Simplest – anyone! Use CLDR Bug report / feature request More Involved Vetting, Assessment, Tools, Policies, Decisions, … Any Unicode member eligible to name representatives including country liaison members Who can participate in CLDR? Anyone can get involved! It can be as simple as suggesting a fix for a translation that is misspelled, or it can be as big as submitting data for a whole new locale. We also welcome vetters that can verify that data is correct, tool writers and many other people interested in the topic of locale data. When submitting data to the CLDR project, references to standards, dictionaries or actual examples of every day use frequently help to get the locale data vetted correctly. Please see the CLDR project web site for how to submit locale data and how to participate in the project. Designed for most effective participation from people around the world Meetings By phone, never face to face Short, frequent Allows preparation between meetings Resolves conflicts and new feature requests Written Email Bug database submissions Who can participate in CLDR? Anyone can get involved! It can be as simple as suggesting a fix for a translation that is misspelled, or it can be as big as submitting data for a whole new locale. We also welcome vetters that can verify that data is correct, tool writers and many other people interested in the topic of locale data. When submitting data to the CLDR project, references to standards, dictionaries or actual examples of every day use frequently help to get the locale data vetted correctly. Please see the CLDR project web site for how to submit locale data and how to participate in the project. Designed for most effective participation from people around the world Meetings By phone, never face to face Short, frequent Allows preparation between meetings Resolves conflicts and new feature requests Written Email Bug database submissions

    35. LRC – XI The Localisation Factory Example Country Process (Finland) Finnish Ministry of Education made CLDR data a major goal, 2004-06 Research Institute for the Languages of Finland (“RILF” aka “Kotus”) designated agency Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be covered Over 30 different parties represented: commercial, non-commercial, individuals Results expected to lead to new/revised national standards

    36. LRC – XI The Localisation Factory For More Information Unicode http://www.unicode.org/ CLDR http://www.unicode.org/cldr/ LDML specification http://unicode.org/reports/tr35 lisam@us.ibm.com

More Related