must know about unicode n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Must Know about Unicode PowerPoint Presentation
Download Presentation
Must Know about Unicode

Loading in 2 Seconds...

play fullscreen
1 / 57

Must Know about Unicode - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Must Know about Unicode . Vinson Hsieh. 如果不知道你拿到的字串是什麼 encoding 其實你不該寫 code , 直到你懂為止. ASCII ANSI Unicode. 世界的演變. When Unix was being invented and K&R (Brian Kernighan and Dennis Ritchie) were writing The C Programming Language, everything was very simple. 

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Must Know about Unicode' - jamuna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
encoding code
如果不知道你拿到的字串是什麼encoding其實你不該寫code,直到你懂為止如果不知道你拿到的字串是什麼encoding其實你不該寫code,直到你懂為止
slide4
世界的演變
  • When Unix was being invented and K&R (Brian Kernighan and Dennis Ritchie) were writing The C Programming Language, everything was very simple. 
  • The only characters that mattered were good old unaccented English letters, we had a code for them called ASCII which was able to represent every character using a number between 32 and 127 . This could conveniently be stored in 7 bits.
  • Codes below 32 were called unprintable . They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.
ascii
ASCII

The lower 128 (codes 0-127) are the most often used codes. Early email systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text")

plain text ascii characters 8 bits
Plain text = ASCII = Characters 8 bits
  • Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare.
  • 『gosh, we can use the codes 128-255 for our own purposes.』 The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. 
slide7

The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters (Dos時代畫表格)

IBM PC Code Page 850

buying pcs outside of america
Buying PCs outside of America
  •  For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (). In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.
ansi standard
ANSI standard
  • Eventually this OEM free-for-all got codified in the ANSI standard.
  • Everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived.
  • These different systems (國家/單位) were called code pages.
slide11
DBCS
  • Asian alphabets have thousands of letters
  • This was usually solved by the messy system called DBCS, the 『double byte character set』
  • Visual C++ 裡,MBCS 永遠是指 DBCS
  • 65536 可以表達六萬多個字

8bits

slide12

秦代的《倉頡》、《博學》、《爰歷》三篇共有3300秦代的《倉頡》、《博學》、《爰歷》三篇共有3300

字,漢代揚雄作《訓纂篇》,有5340字,到許慎作

《説文解字》就有9353字了,晉宋以後,文字又日漸

增繁。據唐代封演《聞見記文字篇》所記晉呂忱作

《字林》,有12824字,後魏楊承慶作《字統》,有

13734字,梁顧野王作《玉篇》有16917字。唐代孫強

增字本《玉篇》有22561字。到宋代司馬光修《類篇》多

至31319字,到清代《康熙字典》就有47000多字了。1915

年歐陽博存等的《中華大字典》,有48000多字。1959年

日本諸橋轍次的《大漢和辭典》,收字49964個。1971年

張其昀主編的《中文大辭典》,有49888字

1990年徐仲舒主編的《漢語大字典》,收字數為54678個。1994年

冷玉龍等的《中華字海》,收字數更是驚人,多達85000字。

幸好《中華字海》一類字書裏收錄的漢字絕大部分是“死字”,

也就是歷史上存在過而今天的書面語裏已經廢置不用的字。

shift jis kanji table
Shift-JIS Kanji Table

MultibyteCharacter Sets take advantage of the fact that only the first 128 characters of the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). When parsing Shift-JIS, if you get a byte in the range 0x80-0xff, you know it is the first character of a two code sequence. Else, it is a single byte of regular ASCII.

http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

slide14

Character based applications use whichever code page is set as the active "OEM" (aka "MS-DOS") code page and Win32 applications use whichever code page is set as the active "ANSI" code page. (Note that Windows "ANSI" code pages do not necessarily map to official ANSI standard character sets.)

cp437

http://www.sqlsnippets.com/en/topic-13410.html

python win32 console dbcs
Python Win32 Console (DBCS)

Big 5 Code Table

0 1 2 3 4 5 6 7 8 9 a

我愛你 should be \xa7\xda\xb7\x52\xa7\x41

A7D0 役 忘 忌 志 忍 忱 快 忸 忪 戒 我

0 1 2

B750 感 想 愛

In ASCII, 52 = R, 41 = A

So become to \xa7\xda\xb7R\xa7A

0 1

A740 作 你

(7F之前都會mapping到ASCII的0-127)

看起來\x會把後面兩個湊成一個字

slide18
「許功蓋」(DBCS)

最常見字:功餐許蓋閱次常見字:擺珮豹枯淚穀愧

http://www.khngai.com/chinese/charmap/tblbig.php

ASCII(5C) == “\”

A45C么 AE5C娉 B85C稞 C25C擺A55C功AF5C珮 B95C鈾 C35C黠 A65C吒B05C豹

BA5C暝 C45C孀 A75C吭 B15C崤 BB5C蓋C55C髏 A85C沔B25C淚 BC5C墦 C65C躡

A95C坼 B35C許BD5C穀AA5C歿 B45C廄  BE5C閱AB5C俞 B55C琵 BF5C璞AC5C枯

B65C跚 C05C餐AD5C苒B75C愧C15C縷 

ASCII(7C) == “|”

AA7C泜 B47C揉 A87C育 BE7C魯 B27C琍  BC7C慝 C67C鸛 A97C尚 B37C逖 BD7C罵

A77C坑 B17C悴 BB7C誡 C57C疊 A67C帆  B07C院 BA7C漏 C47C辮 AB7C咽 B57C稅

BF7C糕 AC7C洱 B67C閏 C07C嚐 AD7C迢  B77C會 C17C舉 A47C弋 AE7C徑 B87C腮

C27C甕 A57C四 AF7C砝 B97C頌 C37C牘

Python 會把 ‘\’ 變成 ‘\\’,還不錯 ,可以翻回5C

shift jis kanji table 5c 7c
Shift-JIS Kanji Table 5C/7C

http://www.chi2ko.com/jingyan/shiftjis2uni.htm

how about move strings to another pc
How about move strings to another PC
  • Of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down.
  • Win95/98 時代
windows 98
Windows 98

It has 16-bit Windows heritage

– Almost everything using ANSI strings

unicode
Unicode
  • Unicode 只是一個字形和內碼上的標準,並沒有定義實際在電腦上存取的方法,因此Unicode協會便定義了一整套的電腦存取Unicode編碼的轉換格式,並考慮了與其它編碼方式兼容,稱之為UTF(Unicode/UCS Transformation Format,統一碼/通用字集變換格式)。UTF8/16/32。
unicode code point chart
Unicode Code Point Chart
  • U+0000 to U+007F: Basic Latin
  • U+0080 to U+00FF: Latin-1 Supplement
  • U+0100 to U+017F: Latin Extended-A
  • U+0180 to U+024F: Latin Extended-B
  • U+0250 to U+02AF: IPA Extensions
  • U+02B0 to U+02FF: Spacing Modifier Letters
  • U+0300 to U+036F: Combining Diacritical Marks
  • U+0370 to U+03FF: Greek and Coptic
  • U+0400 to U+04FF: Cyrillic
  • U+0500 to U+052F: Cyrillic Supplement
  • U+0530 to U+058F: Armenian
  • U+0590 to U+05FF: Hebrew
  • U+0600 to U+06FF: Arabic
  • U+0700 to U+074F: Syriac
  • U+0750 to U+077F: Arabic Supplement
  • U+0780 to U+07BF: Thaana
  • U+0900 to U+097F: Devanagari

http://inamidst.com/stuff/unidata/

unicode terminology
Unicode terminology

notation U+NNNN

uni = {U+03A0} + {U+03A3} + {U+03A9} 

(ΠΣΩ)

slide25

Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to:

Print uni to the screen.

Save uni to a file.

Add uni to another piece of text.

Tell me how many bytes it takes to store uni.

valid coding of
Valid Coding of Ω

You should think of Unicode as symbols (Ω), not as bytes.

converting unicode symbols to python literals
Converting Unicode symbols to Python literals

Pseudocode:

uni=‘abc_’+{U+03A0}+{U+03A3}+{U+03A9}+‘.txt’

Here is how you make that string in Python:

uni=u"abc_\u03a0\u03a3\u03a9.txt"

Pseudocode:

uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C}

Python:

uni = u'\u001a\u0bc3\u1451\U0001d10c’

  • Python:
  • uni = u'\u001A\u0BC3\u1451\U0001D10C'
codecs
Codecs
  • Unicode objects have no fixed computer representation.
  • Before an Unicode object can be printed, stored to disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a codec. Some popular codecs you may have heard about in your day to day experiences: ASCII,iso-8859-7,UTF-8, UTF-16.
slide29
轉換的正確觀念
  • ANSI 和 Unicode間的轉換
  • Big5  Unicode  utf8/16/32
  • utf8/16/32  Unicode  Big5
unicode1
Unicode字元平面映射

http://zh.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A8%AE%E5%B9%B3%E9%9D%A2#.E5.9F.BA.E6.9C.AC.E5.A4.9A.E6.96.87.E7.A7.8D.E5.B9.B3.E9.9D.A2

utf 32 always 4 bytes
UTF 32(Always 4 bytes)
  • UTF-32 - Each Unicode code point is represented directly by a single 32-bit code unit
  • UTF-32 is restricted to representation of code points in the range 0..10FFFF16—that is, the Unicode codespace
  • UTF-32 may be a preferred encoding form where memory or disk storage space for characters is no particular concern, but where fixed-width, single code unit access to characters is desired. UTF-32 is also a preferred encoding form for processing characters on most Unix platforms.
utf 16 2 or 4 bytes
UTF 16 ( 2 or 4 bytes)

code points in the range U+0000..U+FFFF are represented

as a single 16-bit code unit; code points in the supplementary planes, in the range U+10000..U+10FFFF, are instead represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs.

utf 8 1 4 bytes
UTF 8 (1 – 4 bytes)
  • The UTF-8 encoding form maintains transparency for all of the ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are converted to single bytes 0x00..0x7F in UTF-8,
  • All non-surrogate code points between U+0800 and U+FFFF are represented by three bytes; and supplementary code points above U+FFFF require four bytes.

Unihan統漢字將中日韓文加以整合分布於U+3400~U+9FFF與U+F900~U+FAFF的空間

slide35

Windows 2000 and Unicode

All of the core function for ––Create windows, displaying text, string manipulation require Unicode string

More memory and runs and slower, if you don’t use Unicode from the start

windows ce and unicode
Windows CE and Unicode

The machines were going to be sold all over the world

– Windows CE is natively Unicode

A machine with little memory and no disk storage

– The ANSI Windows APIs are not support

After XP is now recommended that developers make all their applications using the Unicode versions of the APIs. But you may say, "if I do that my application will not run under Windows 95, 98 and ME because those Windows versions do not support the Unicode APIs". Well this is where the Microsoft Layer for Unicode (or "mslu") comes in. The mslu is contained in a Dll called "unicows.dll". This is redistributable, so the intention is that you will ship this with your executable for placement in the same folder as your executable.

multibytetowidechar
MultiByteToWideChar

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

glyph rendering
Glyph Rendering
  • Automatic context analysis: There is only one key for Arabic "b". The system automatically selects whether the isolate, initial, medial or final form of "b" is appropriate, and changes this if you e.g. add another character afterwards. Notice that only the letter value "b" is stored on disk, not the form: this is only selected dynamically on display.

http://www.smi.uib.no/ksv/ArabicMac.html#uni

writing direction bidirectional
Writing Direction(bidirectional)

letters, punctuation, symbols, and diacritics

Hebrew and Arabic, characters are arranged from

right to left into lines, although digits run the other

way, making the scripts inherently bidirectional.

Left-to-right and right-to-left scripts are frequently

used together. In such a case, arranging characters

into lines becomes more complex. The Unicode

Standard defines an algorithm to determine the

layout of a line. See Unicode Standard Annex #9,

“The Bidirectional Algorithm,” for more information.

http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G18421

sequence of base characters and diacritics
Sequence of Base Characters and Diacritics

The sequence of Unicode characters U+0061 “a” + U+0308 + U+0075 “u” unambiguously encodes “äu” not “aü”.

slide54

Unicode Bidirectional Algorithm

http://unicode.org/reports/tr9/

slide56

我 – u6211

愛 - u611b

你– u4f60

http://blog.163.com/guoo1230@126/blog/static/321155112011328102542586/

Why?

  • U+0000 to U+007F: Basic Latin

U+0370 to U+03FF: Greek and Coptic

  • U+1400 to U+167F: Unified Canadian Aboriginal Syllabics
  • U+4E00 to U+9FFF: CJK Unified Ideographs
slide57

UTF编码有个优点,即尽管编码字节数不等,但是不像gb2312/gbk编码一样,需要从文本开始寻找,才能正确对汉字进行定位。在UTF编码下,根据 相对固定的算法,从当前位置就能够知道当前字节是否是一个代码点的开始还是结束,从而相对简单的进行字符定位。不过定位问题最简单的还是UTF- 32,它根本不需要进行字符定位,但是相对的大小也增加不少。