Unicode & W3C Jataayu Software - PowerPoint PPT Presentation

Unicode w3c jataayu software
1 / 25

  • Uploaded on
  • Presentation posted in: General

Unicode & W3C Jataayu Software. C. Kumar January 2007. Agenda. About Jataayu Unicode & Encoding W3C Specification for multi-lingual authoring Multilingual WEB Address Indian WEB Sites an Overview W3C Activity. About Jataayu.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Unicode & W3C Jataayu Software

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Unicode w3c jataayu software

Unicode & W3CJataayu Software

C. Kumar

January 2007



  • About Jataayu

  • Unicode & Encoding

  • W3C Specification for multi-lingual authoring

  • Multilingual WEB Address

  • Indian WEB Sites an Overview

  • W3C Activity

About jataayu

About Jataayu

  • Jataayu formed with a clear focus of delivering solutions for wireless data services

  • Over 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu Products

  • Mobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, Email, Web Browsing, Download

  • Active participants in OMA, W3C and MWI

  • Over 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization internationalization

Localization - Internationalization

  • Localization (l10n)

    • Adaptation of the content to meet the language, cultural and other requirements of a specific target market

  • Internationalization (i18n)

    • Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language.

    • Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for unicode

Need for Unicode

  • Early character sets based on 7-bit, gave 27 (ie. 128) possible characters

  • Adding the 8th bit gave a total of 256 possible characters. Still not enough for all the European languages.

  • Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex.

  • Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode encoding

Unicode & Encoding

  • Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers.

    • Allows for simple display and storage of multilingual content

  • An encoding refers to the way that characters are mapped from the character set to actual Unicode value.

    • Different encoding yield different byte sequences.

Unicode encoding1

Unicode & Encoding

  • UTF-8 (Unicode Transformation Format)

    • Variable length 8-bit character encoding for Unicode

    • Able to represent any universal character in the Unicode Standard

    • Uses one to four bytes to encode a Unicode symbol

    • Only one byte is needed to encode the US-ASCII characters

Unicode encoding2

Unicode & Encoding

  • UTF-16 (16-bit Unicode Transformation Format)

    • Variable length 16-bit character encoding for Unicode

    • Uses two or four byte sequence to encode a Unicode symbol

    • Two byte is required to encode the US-ASCII character

  • UCS-2 (2-byte Universal Character Set)

    • Fixed length encoding that always encodes characters into a single 16-bit value

    • It can encode characters in the range 0x0000 to 0xFFFF

Unicode encoding3

Unicode & Encoding

  • UCS-4 / UTF-32 (32-bit Unicode Transformation Format)

    • Fixed length 32-bit character encoding for Unicode

    • Every character it uses 4 bytes and it is very space inefficient

      • Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text

  • http://www.unicode.org/

Unicode encoding4

Unicode & Encoding

  • Devanagari (0x0900 – 0x097F)

  • Bengali (0x0980 – 0x09FF)

  • Tamil (0x0B80 – 0x0BFF)

  • Kannada (0x0C80 – 0x0CFF)

Unicode encoding5

Unicode & Encoding

  • Alternate way to represent the character is by using escape value. (א)

  • Not all documents have to be encoded as Unicode

  • But documents can only contain characters defined by Unicode Standard

  • Any encoding can be used as long as it is properly declared and it is the subset of Unicode

  • Unicode encoding also allows many more languages to be mixed on a single page

Other encoding formats

Other Encoding formats …

  • Shift_JIS (SJIS), character encoding for the Japanese Language

    • Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F)

    • Double-byte character encoding for the upper-ASCII bytes

  • GB2312, character encoding for simplified Chinese characters

W3c specification encoding

W3C Specification - Encoding

  • W3C specification for multi-lingual authoring

    • Encoding of the document needs to be mentioned, so that the application that consumes can interpret it.

  • Meta Tag

    • <meta http-equiv=“Content-type” content=“text/html;charset=UTF-8” />

  • XML

    • <?xml version=“1.0” encoding=“UTF-8”?>

  • Content-type header returned from the WEB server should also contain the character encoding of the document

    • Content-Type: text/html; Charset=utf-8

W3c specification language

W3C Specification - Language

  • Author needs to specify the language of the document (web page content)

    • Browser can choose the appropriate font selection using the Lang attribute

    • Search Engine can group or filter results based on the user’s linguistic preferences (using meta)

    • Translation tools use to recognize the section of text in a particular language

W3c specification language1

W3C Specification - Language

  • HTTP Content Language Header

    • Content-Language: hi

  • Language Attribute on html tag

    • <html lang=“hi”>

    • <html xml:lang=“hi”>

  • Content Language in meta tag

    • <meta http-equiv=“Content-Language” content=“hi” />

  • Language attribute on embedded content

    • <div lang=“en” xml:lang=“en”> Some English Content </div>

What value to use for lang

What value to use for lang?

  • IANA (Internet Assigned Numbers Authority)

    • Provides a unique value for each language

    • It is available in the Subtag value in the new IANA Language

      • http://www.iana.org/assignments/language-subtag-registry

      • Hindi – hi, Kannada – kn, Tamil – ta

Bi directional text

Bi-directional text

  • Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu)

  • In HTML, dir attribute is used to specify the direction of the text

    • The title says “<span dir=“rtl”> ם ו א נ י ב ה ת ו ל י ע פ, W3C</span>” in Hebrew.

Multilingual web address

Multilingual WEB Address

  • A Web address is used to point a resource on the WEB

    • Web address are typically expressed using URIs (Uniform Resource Identifiers)

    • Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols).

  • User’s expectations and use of the Internet have changed this restrictions.

    • There is a growing need to use any language characters in WEB Addresses.

Multilingual web address1

Multilingual WEB Address …

  • A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: http://खोज.com)

  • Punycode is a way of representing Unicode code points using only ASCII characters. (Ex: http://xn--21bm4l.com)

Indian content an overview

Indian Content an Overview

  • Most Indian Websites are not using Unicode

    • Content are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages.

    • Visually it will be fine, but no other entities will be able to interpret it

    • For each site, the user may need to download the proprietary fonts, which is not user friendly

    • Search Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Indian content an overview1

Indian Content an Overview

Unicode w3c importance

Unicode & W3C Importance

  • WEB is also moving towards the mobile

    • W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing

  • Cannot install the required font’s during run-time as used to do in desktop

  • If Unicode character are used the required font may be available within the device



  • Firefox (http://www.getfirefox.com)

    • Provides extensive support for Unicode and related fonts

    • Provides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs)

      • https://addons.mozilla.org/firefox/5484/author/

W3c i18n activity

W3C i18n activity

  • Core Working group

    • Enable universal access to the World Wide Web by providing adequate support to other W3C Working Groups

  • GEO (Guidelines, Education & Outreach)

    • Internationalization aspects of W3C technology better understood and more widely and consistently used

  • ITS (Internationalization Tag Set)

    • Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents




  • Login