Unicode Support Status in Various Platforms

Unicode support status in various platforms (Microsoft Windows) • Windows 9x / ME • Do not support Unicode internally • Limited Unicode APIs are supported. • Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x • Use code page to support different encodings • Windows NT / 2000 / XP • Support Unicode • Use of wide char (fixed 2 bytes) • Use UCS-2

Unicode support status in various platforms (Linux & Mac OS) • Linux • Newer Kernel supports Unicode • Requires glibc 2.2.2 and XFree86 4.0.3 or newer • Use UTF-8 in most case, e.g. filesystem • Set locale to <lang>_<place>.<encoding>, e.g. zh_TW.utf8 • Enable UTF-8 support in console by executing unicode_start • Mac OS • Mac OS 9.1, Mac OS X support Unicode • 16-bit for Unicode character

What is a code page • There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. • A code page (code page identifier) is a number to identify a codeset. • e.g. 950 – Traditional Chinese (Big5) • e.g. 1252 – Windows Latin-1 • Other code page identifiers can be found in: http://msdn.microsoft.com/library/en-us/intl/unicode_81rn.asp • In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings.

Java • Java is in Unicode internally. The supported encoding sets are provided by Java library packages rt.jar and i18n.jar • The supported encoding sets for java.io.*, java.lang.* and java.nio.* API can be found in: http://java.sun.com/j2se/1.4/docs/guide/intl/encoding.doc.html • User Input/Output will be automatically convert between Unicode and System code page • Specify the encoding of the source files when compiling. • javac –encoding <encoding> <source files> • Convert to other supported encoding: e.g. byte[] utf8Bytes = str.getBytes(“UTF-8”); • Convert from other supported encoding: e.g. String str = new String(utf8Bytes, “UTF-8”);

Code Conversion • Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same) • Unicode is a superset of every existing national standard => guaranteed round-trip conversion • Round-trip conversion: Suppose a file file1 in codeset A is converted to a file file2 in codeset B and then converted back to codeset A with a file name file3. If file3=file1, we say that codeset B guarantees round-trip conversion for A.

Java Code conversion Conversion from multibyte to Unicode Byte[ ]my_data = { 0xA4, 0x40} String my_unicode_data = newString(my_data,”big5”) Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to: Conversion from Unicode to multibyte String my_unicode_data =“\u4E00” (一) Byte[]my_b5_data=my_unicode_data.getBytes(“Big5”) My_b5_data will have the value of 0xA440 Byte[]my_gb_data= my_unicode_data.getBytes(“GBK”) My_gb_data will have the value of D2BB

Text stream import FileI= newFile(“input”); FileInputStreamtmpin = newFileInputStream(I); BufferedReaderin = newBufferedReader( newInputStreamReader(tmpin, “Big5”)); Once the BufferedReaderin is established, data can be read using the readLine() method. inputStr = in.readLine();

Text Stream Export Fileo = newFile(“output.big5”); FileOutputStreamtmpout = newFileOutputStream(o); BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”)); • …. • Out.println(“\u6CB3\u8C5A”); (“河豚”) • Out.close(); • 0xAA65 0xB362

Multilingual applications • Software teaching Chinese for English people • Software teaching English for Chinese • Conceptually separate two types of data in a multilingual application: • Data related to display of menu/instructions, • Data related to the processing in the program • Multilingual application vs. I18n applications • I18N: data related to display and processing are the same and it is for the same language/convention • Multilingual applications: Data related to display is for one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language. • Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary

The Ideographic Composition Scheme Used in ISO 10646 • Introduction to Ideograph Description Characters(IDCs) • The ideographic composition scheme • Application using IDCs

What are Ideograph Description Characters • 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character components⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻

Characteristics of Ideographs • Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components • Natural in the formation of characters • Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

Problems with ideograph Character Encoding • Each character is treated as a different symbol, and thus given a codepoint • Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. • When new character is created, codepoint allocation is needed in new blocks, thus radical order cannot be globally maintained. • Also there is a potentially endless standardization process • Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort

Introduction of IDCs • Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995 • Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. • Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: • Ideograph_Proper(日): Every coded character is considered ideograph proper, thus not needed • Left_Up_Encompass: no un-encoded example • Mirror_Symmetry(非): left being mirrored to the right, but can be describe by Left_to_Right • Renames the 12 symbols as Ideograph Description Characters

Ideographic Composition Scheme • IDS describes a character using its components and indicating the relative positions of the components. • IDCs are considered operators to the components. • IDSs can be expressed by a context free grammar through the Backus Naur Form(BNF). The grammar G has four components: • Let G = {, N, P, S}, where • : the set of terminal symbols— coded radicals, coded ideographs, and the 12 IDCs. • N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} • S = {IDS}, which is the start symbol of the grammar • P: a set of rewrite rules

The following is the set of rewriting rules P: • IDS::=<Binary_Symbol><IDS1><IDS1>|<Ternary_Symbol> <IDS1><IDS1><IDS1> • <IDS1> ::= <IDS> | <Ideograph_Component> • <Ideograph_Component>::= coded_ideograph | coded_radical | coded_component • <Binary-Symbol> ::=⿰|⿱ | ⿴ | ⿵ | ⿶ | ⿷ | ⿸ | ⿹ | ⿺ | ⿻ • <Ternary_Symbol> ::= ⿲ | ⿳ • Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

Examples

IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (IDC-OLD , ⿻): • The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other. • ⿻从工is an example of an IDS which represents the abstract from of 巫 • IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT（IDC-SUR, ⿹): • The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D1 on the right top corner of D2, and D2 is encompassed by D1. • ⿹is an example of an IDS which represents the abstract from of

IDS allows a character to be described by different sequences • One IDS should describe only one character, yet one character can be described by different IDSs.

IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. • Not intended for rendering. • Nesting is natural in ideographs and they are reflected in in the IDS scheme

Components • Ideographic Components(IRG definition): units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO 10646 (BMP) and some basic elements used to form ideographs. • Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD. • ISO extensions: • Radicals • Components

28 from GBK and more from IRG ISO IRG component sample

Extending the Objectives of IDCs • Using coded characters to describe not yet code ideographs both for representation and exchange • Limit standardization to only modern characters, and not some rarely used characters • Learning of character composition(education) • Revealing substructures of ideograph characters • Description of ideograph variants

Examples • Given characters => IDS? • 忂䑑䔄䔴蠿 • Given a IDS => what are the characters • 莫言 • 艹旲言 • Is the following a legal IDS? • 莫言艹旲

Conclusion • IDCs are introduced in Unicode 3.0 • The use is going beyond the original objective • Applications based on the IDCs were already developed such as in the the Hong Kong Glyph Specification. • IDCs should also useful in ideograph variant specifications • Additional search site: http://glyph.iso10646hk.net/ccs/ccs.jsp?lang=zh_TW

Unicode Support Status in Various Platforms

Unicode Support Status in Various Platforms

Presentation Transcript

Deploying Windows XP Professional Robert Cameron Enterprise Platforms Support Microsoft Corporation

Using Internet Connection Sharing in Windows® 2000 Tod Edwards Support Engineer Microsoft Platforms Support Microsoft Co

Terminal Servers on 64-bit Windows Platforms

Unicode Support for Mathematics

Unicode and Windows XP

Introduction to Microsoft Windows Movie Maker Steve Stoker Support Professional MPSD Microsoft Corporation

Windows 2000 The Fruits of Unicode

Microsoft EMEA Retail Technology Conference 2004 Microsoft Point Of Service Platforms

Microsoft Support ending April 8, 2014

Windows Language Support

Supplementary Character Support in Microsoft Products

Unicode and Windows XP

A Case Study on Unicode Integration: Microsoft Visual Basic, v 3.0 to v. 7.0

In this session, you will learn to: Install Microsoft Windows.

Counting Letters in an Unicode String

Unicode and Keyboards on Windows

Unicode and Collation Support in Microsoft SQL Server

Unicode and Windows XP

Windows

Windows 10 support and windows 8 assistance

Reliable Outlook Support Service Center in Dubai

Unicode Support for Mathematics