Unicode w3c jataayu software
1 / 25

Unicode & W3C Jataayu Software - PowerPoint PPT Presentation

  • Uploaded on

Unicode & W3C Jataayu Software. C. Kumar January 2007. Agenda. About Jataayu Unicode & Encoding W3C Specification for multi-lingual authoring Multilingual WEB Address Indian WEB Sites an Overview W3C Activity. About Jataayu.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Unicode & W3C Jataayu Software' - brett-burns

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Unicode w3c jataayu software

Unicode & W3CJataayu Software

C. Kumar

January 2007


  • About Jataayu

  • Unicode & Encoding

  • W3C Specification for multi-lingual authoring

  • Multilingual WEB Address

  • Indian WEB Sites an Overview

  • W3C Activity

About jataayu
About Jataayu

  • Jataayu formed with a clear focus of delivering solutions for wireless data services

  • Over 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu Products

  • Mobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, Email, Web Browsing, Download

  • Active participants in OMA, W3C and MWI

  • Over 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization internationalization
Localization - Internationalization

  • Localization (l10n)

    • Adaptation of the content to meet the language, cultural and other requirements of a specific target market

  • Internationalization (i18n)

    • Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language.

    • Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for unicode
Need for Unicode

  • Early character sets based on 7-bit, gave 27 (ie. 128) possible characters

  • Adding the 8th bit gave a total of 256 possible characters. Still not enough for all the European languages.

  • Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex.

  • Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode encoding
Unicode & Encoding

  • Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers.

    • Allows for simple display and storage of multilingual content

  • An encoding refers to the way that characters are mapped from the character set to actual Unicode value.

    • Different encoding yield different byte sequences.

Unicode encoding1
Unicode & Encoding

  • UTF-8 (Unicode Transformation Format)

    • Variable length 8-bit character encoding for Unicode

    • Able to represent any universal character in the Unicode Standard

    • Uses one to four bytes to encode a Unicode symbol

    • Only one byte is needed to encode the US-ASCII characters

Unicode encoding2
Unicode & Encoding

  • UTF-16 (16-bit Unicode Transformation Format)

    • Variable length 16-bit character encoding for Unicode

    • Uses two or four byte sequence to encode a Unicode symbol

    • Two byte is required to encode the US-ASCII character

  • UCS-2 (2-byte Universal Character Set)

    • Fixed length encoding that always encodes characters into a single 16-bit value

    • It can encode characters in the range 0x0000 to 0xFFFF

Unicode encoding3
Unicode & Encoding

  • UCS-4 / UTF-32 (32-bit Unicode Transformation Format)

    • Fixed length 32-bit character encoding for Unicode

    • Every character it uses 4 bytes and it is very space inefficient

      • Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text

  • http://www.unicode.org/

Unicode encoding4
Unicode & Encoding

  • Devanagari (0x0900 – 0x097F)

  • Bengali (0x0980 – 0x09FF)

  • Tamil (0x0B80 – 0x0BFF)

  • Kannada (0x0C80 – 0x0CFF)

Unicode encoding5
Unicode & Encoding

  • Alternate way to represent the character is by using escape value. (א)

  • Not all documents have to be encoded as Unicode

  • But documents can only contain characters defined by Unicode Standard

  • Any encoding can be used as long as it is properly declared and it is the subset of Unicode

  • Unicode encoding also allows many more languages to be mixed on a single page

Other encoding formats
Other Encoding formats …

  • Shift_JIS (SJIS), character encoding for the Japanese Language

    • Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F)

    • Double-byte character encoding for the upper-ASCII bytes

  • GB2312, character encoding for simplified Chinese characters

W3c specification encoding
W3C Specification - Encoding

  • W3C specification for multi-lingual authoring

    • Encoding of the document needs to be mentioned, so that the application that consumes can interpret it.

  • Meta Tag

    • <meta http-equiv=“Content-type” content=“text/html;charset=UTF-8” />

  • XML

    • <?xml version=“1.0” encoding=“UTF-8”?>

  • Content-type header returned from the WEB server should also contain the character encoding of the document

    • Content-Type: text/html; Charset=utf-8

W3c specification language
W3C Specification - Language

  • Author needs to specify the language of the document (web page content)

    • Browser can choose the appropriate font selection using the Lang attribute

    • Search Engine can group or filter results based on the user’s linguistic preferences (using meta)

    • Translation tools use to recognize the section of text in a particular language

W3c specification language1
W3C Specification - Language

  • HTTP Content Language Header

    • Content-Language: hi

  • Language Attribute on html tag

    • <html lang=“hi”>

    • <html xml:lang=“hi”>

  • Content Language in meta tag

    • <meta http-equiv=“Content-Language” content=“hi” />

  • Language attribute on embedded content

    • <div lang=“en” xml:lang=“en”> Some English Content </div>

What value to use for lang
What value to use for lang?

  • IANA (Internet Assigned Numbers Authority)

    • Provides a unique value for each language

    • It is available in the Subtag value in the new IANA Language

      • http://www.iana.org/assignments/language-subtag-registry

      • Hindi – hi, Kannada – kn, Tamil – ta

Bi directional text
Bi-directional text

  • Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu)

  • In HTML, dir attribute is used to specify the direction of the text

    • The title says “<span dir=“rtl”> ם ו א נ י ב ה ת ו ל י ע פ, W3C</span>” in Hebrew.

Multilingual web address
Multilingual WEB Address

  • A Web address is used to point a resource on the WEB

    • Web address are typically expressed using URIs (Uniform Resource Identifiers)

    • Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols).

  • User’s expectations and use of the Internet have changed this restrictions.

    • There is a growing need to use any language characters in WEB Addresses.

Multilingual web address1
Multilingual WEB Address …

  • A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: http://खोज.com)

  • Punycode is a way of representing Unicode code points using only ASCII characters. (Ex: http://xn--21bm4l.com)

Indian content an overview
Indian Content an Overview

  • Most Indian Websites are not using Unicode

    • Content are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages.

    • Visually it will be fine, but no other entities will be able to interpret it

    • For each site, the user may need to download the proprietary fonts, which is not user friendly

    • Search Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Unicode w3c importance
Unicode & W3C Importance

  • WEB is also moving towards the mobile

    • W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing

  • Cannot install the required font’s during run-time as used to do in desktop

  • If Unicode character are used the required font may be available within the device


  • Firefox (http://www.getfirefox.com)

    • Provides extensive support for Unicode and related fonts

    • Provides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs)

      • https://addons.mozilla.org/firefox/5484/author/

W3c i18n activity
W3C i18n activity

  • Core Working group

    • Enable universal access to the World Wide Web by providing adequate support to other W3C Working Groups

  • GEO (Guidelines, Education & Outreach)

    • Internationalization aspects of W3C technology better understood and more widely and consistently used

  • ITS (Internationalization Tag Set)

    • Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents