1 / 25

Unicode support status in various platforms (Microsoft Windows)

Unicode support status in various platforms (Microsoft Windows). Windows 9x / ME Do not support Unicode internally Limited Unicode APIs are supported. Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x Use code page to support different encodings

maryawilson
Download Presentation

Unicode support status in various platforms (Microsoft Windows)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode support status in various platforms (Microsoft Windows) • Windows 9x / ME • Do not support Unicode internally • Limited Unicode APIs are supported. • Unicode applications compiled with Microsoft Layer for Unicode can be run on Win9x • Use code page to support different encodings • Windows NT / 2000 / XP • Support Unicode • Use of wide char (fixed 2 bytes) • Use UCS-2

  2. Unicode support status in various platforms (Linux & Mac OS) • Linux • Newer Kernel supports Unicode • Requires glibc 2.2.2 and XFree86 4.0.3 or newer • Use UTF-8 in most case, e.g. filesystem • Set locale to <lang>_<place>.<encoding>, e.g. zh_TW.utf8 • Enable UTF-8 support in console by executing unicode_start • Mac OS • Mac OS 9.1, Mac OS X support Unicode • 16-bit for Unicode character

  3. What is a code page • There are a lot of different encodings, e.g. EUC-TW, Big5, Latin-1 etc. • A code page (code page identifier) is a number to identify a codeset. • e.g. 950 – Traditional Chinese (Big5) • e.g. 1252 – Windows Latin-1 • Other code page identifiers can be found in: http://msdn.microsoft.com/library/en-us/intl/unicode_81rn.asp • In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings.

  4. Java • Java is in Unicode internally. The supported encoding sets are provided by Java library packages rt.jar and i18n.jar • The supported encoding sets for java.io.*, java.lang.* and java.nio.* API can be found in: http://java.sun.com/j2se/1.4/docs/guide/intl/encoding.doc.html • User Input/Output will be automatically convert between Unicode and System code page • Specify the encoding of the source files when compiling. • javac –encoding <encoding> <source files> • Convert to other supported encoding: e.g. byte[] utf8Bytes = str.getBytes(“UTF-8”); • Convert from other supported encoding: e.g. String str = new String(utf8Bytes, “UTF-8”);

  5. Code Conversion • Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same) • Unicode is a superset of every existing national standard => guaranteed round-trip conversion • Round-trip conversion: Suppose a file file1 in codeset A is converted to a file file2 in codeset B and then converted back to codeset A with a file name file3. If file3=file1, we say that codeset B guarantees round-trip conversion for A.

  6. Java Code conversion Conversion from multibyte to Unicode Byte[ ]my_data = { 0xA4, 0x40} String my_unicode_data = newString(my_data,”big5”) Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to: Conversion from Unicode to multibyte String my_unicode_data =“\u4E00” (一) Byte[]my_b5_data=my_unicode_data.getBytes(“Big5”) My_b5_data will have the value of 0xA440 Byte[]my_gb_data= my_unicode_data.getBytes(“GBK”) My_gb_data will have the value of D2BB

  7. Text stream import FileI= newFile(“input”); FileInputStreamtmpin = newFileInputStream(I); BufferedReaderin = newBufferedReader( newInputStreamReader(tmpin, “Big5”)); Once the BufferedReaderin is established, data can be read using the readLine() method. inputStr = in.readLine();

  8. Text Stream Export Fileo = newFile(“output.big5”); FileOutputStreamtmpout = newFileOutputStream(o); BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”)); • …. • Out.println(“\u6CB3\u8C5A”); (“河豚”) • Out.close(); • 0xAA65 0xB362

  9. Multilingual applications • Software teaching Chinese for English people • Software teaching English for Chinese • Conceptually separate two types of data in a multilingual application: • Data related to display of menu/instructions, • Data related to the processing in the program • Multilingual application vs. I18n applications • I18N: data related to display and processing are the same and it is for the same language/convention • Multilingual applications: Data related to display is for one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language. • Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary

  10. The Ideographic Composition Scheme Used in ISO 10646 • Introduction to Ideograph Description Characters(IDCs) • The ideographic composition scheme • Application using IDCs

  11. What are Ideograph Description Characters • 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character components⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻

  12. Characteristics of Ideographs • Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components • Natural in the formation of characters • Examples: 2 components => Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

  13. Problems with ideograph Character Encoding • Each character is treated as a different symbol, and thus given a codepoint • Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed. • When new character is created, codepoint allocation is needed in new blocks, thus radical order cannot be globally maintained. • Also there is a potentially endless standardization process • Encoding of rarely used ideograph characters is a waste of resource both in terms of code space and also standardization effort

  14. Introduction of IDCs • Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995 • Objective of the Original proposal: use coded ideographs and “structure symbols” to describe not yet coded ideographs. • Original proposal has 15 “Ideograph Structure Symbols” base on study on Han characters, three of them didn’t make it to ISO 10646/Unicode: • Ideograph_Proper(日): Every coded character is considered ideograph proper, thus not needed • Left_Up_Encompass: no un-encoded example • Mirror_Symmetry(非): left being mirrored to the right, but can be describe by Left_to_Right • Renames the 12 symbols as Ideograph Description Characters

  15. Ideographic Composition Scheme • IDS describes a character using its components and indicating the relative positions of the components. • IDCs are considered operators to the components. • IDSs can be expressed by a context free grammar through the Backus Naur Form(BNF). The grammar G has four components: • Let G = {, N, P, S}, where • : the set of terminal symbols— coded radicals, coded ideographs, and the 12 IDCs. • N:the set of 5 non-terminal symbols N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component} • S = {IDS}, which is the start symbol of the grammar • P: a set of rewrite rules

  16. The following is the set of rewriting rules P: • IDS::=<Binary_Symbol><IDS1><IDS1>|<Ternary_Symbol> <IDS1><IDS1><IDS1> • <IDS1> ::= <IDS> | <Ideograph_Component> • <Ideograph_Component>::= coded_ideograph | coded_radical | coded_component • <Binary-Symbol> ::=⿰|⿱ | ⿴ | ⿵ | ⿶ | ⿷ | ⿸ | ⿹ | ⿺ | ⿻ • <Ternary_Symbol> ::= ⿲ | ⿳ • Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

  17. Examples

  18. IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (IDC-OLD , ⿻): • The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other. • ⿻从工is an example of an IDS which represents the abstract from of 巫 • IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT(IDC-SUR, ⿹): • The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D1 on the right top corner of D2, and D2 is encompassed by D1. • ⿹is an example of an IDS which represents the abstract from of

  19. IDS allows a character to be described by different sequences • One IDS should describe only one character, yet one character can be described by different IDSs.

  20. IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions. • Not intended for rendering. • Nesting is natural in ideographs and they are reflected in in the IDS scheme

  21. Components • Ideographic Components(IRG definition): units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO 10646 (BMP) and some basic elements used to form ideographs. • Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD. • ISO extensions: • Radicals • Components

  22. 28 from GBK and more from IRG ISO IRG component sample

  23. Extending the Objectives of IDCs • Using coded characters to describe not yet code ideographs both for representation and exchange • Limit standardization to only modern characters, and not some rarely used characters • Learning of character composition(education) • Revealing substructures of ideograph characters • Description of ideograph variants

  24. Examples • Given characters => IDS? • 忂 䑑 䔄 䔴 蠿 • Given a IDS => what are the characters • 莫言 • 艹旲言 • Is the following a legal IDS? • 莫言 艹旲

  25. Conclusion • IDCs are introduced in Unicode 3.0 • The use is going beyond the original objective • Applications based on the IDCs were already developed such as in the the Hong Kong Glyph Specification. • IDCs should also useful in ideograph variant specifications • Additional search site: http://glyph.iso10646hk.net/ccs/ccs.jsp?lang=zh_TW

More Related