1 / 16

Multilingual Computing

Dr. Lu Qin( 陸勤), csluqin@comp Rm PQ 814, ext 7247 Course Material on-line: www.comp.polyu.edu.hk/~csluqin/comp341 Lecture notes available : Friday 14:30 previous week. Lab/tutorial hand-outs: Friday 14:30 previous week Schedule and announcement on-line

lydie
Download Presentation

Multilingual Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dr. Lu Qin(陸勤), csluqin@comp Rm PQ 814, ext 7247 Course Material on-line: www.comp.polyu.edu.hk/~csluqin/comp341 Lecture notes available:Friday 14:30 previous week. Lab/tutorial hand-outs: Friday 14:30 previous week Schedule and announcement on-line office hours: 2:30 – 3:30 Tues, 2:30 – 3:30 Thurs Labs : Mr. Joe Lam, Tel. 2766 7330, Rm QT406 Email: cscwlam@comp.polyu.edu.hk Text book: CJKV Information Processing, by Ken Lunde, O’Reilly, 1999 Multilingual Computing

  2. Teaching and Assessment • Lectures(fundamentals) • Introduction • Characteristics of different languages(scripts) • Computer representations • Input Processing & Output Processing • Information processing techniques: • Open systems • Internationalization and localization • Algorithms • Software development for multilingual environment • Introduction to natural language processing • Tutorial/labs(gain experience in using some common Chinese Operating System and programming),http://www4.comp.polyu.edu.hk/~cscwlam/cc/ • MS Chinese Windows • Different programming environments • Assessment: 60%Final, 15% Midterm, 20% Proj &Hwk (15% +5%), 3%class participation and 2% punctuation

  3. What is Multilingual Computing • Computer processing of data related to more than one language/scripts including any human-computer interaction activity where communication is achieved • Bilingual, trilingual, vs. Multilingual • Fundamental issues: • Dealing with different languages and each language has there own characteristics which requires expert knowledge of each language Example: count the number of words: “Multilingual Computing” vs “多語言文字處理技術” • Ways to distinguish different scripts • How can a system be designed so that it can be used by different languages with minimal changes • How can a system be designed so that it can be used for multiple languages

  4. Different Scripts(Written languages) • English: Fixed alphabet, words are naturally delimited by SPACE, more morphological changes but very regular, more of a token based language than a phonetic based language, writing from left-to-right Example: auto, automatic, autonomous, automation, Auto-movement, spelling is easy to do • Phonetic transcription system: Pinyin, Jyut Ping(粵拼), International Phonetic Alphabet(IPA) • Korean: Kanja(漢字) similar to Chinese, Hangul is a two dimensional Pinyin system. In other words, Hangul is a phonetic script or phonetic transcription system.

  5. Korean Hangul • KA KEU NGOA SAN NUN KOAEN • Romanization: Using Roman letters to denote the phonetic transcriptions

  6. Japanese Kana • Hiragana(phonetic): can be used completed without any Han characters, often used with Han characters(Hanji), for Japanese/Chinese native words • Katakana(phonetic): denoting only foreign words • Writing either from left-to-right or top-to-bottom for both Hiragana and Katakana as well as Han characters

  7. The Chinese Language • General Characteristics • Sino-Tibetan Language Family (漢藏語系) • Ideographic in nature (表意文字 ) • 50+ languages in PRC • Hanyu the official language • 7 Major Hanyu dialects • Hanyu Dialect similarities • relatively unified writing system • some dialect-specific characters and variant character writing • Hanyu Dialect differences • different pronunciation across different dialects • different words (e.g. 係 and 是 ) • word-order reversal (e.g. 找尋 and 尋找) • different expression / grammar (e.g.先坐 and 坐先)

  8. Chinese Characters • Graphemics ( the look, 形 ) • Strokes (distribution 1-30+), Radicals (214+), components(500+), Characters (65,000+) • Stroke sequence order • Variant writing (e.g. 教 都) • Character Formation • Bounded radicals and components, but unbounded alphabet / character set (charset) • 6 principles - ideographic 象形 (火) , objective 指事 (一二 ), meaning會意 (炎旦), ideo-phonetic 形聲( 訪), borrowed假借(孰 熟), transitive 轉注( 考 老)

  9. Character Decomposition • Most basic elements of characters are • “Strokes”(筆畫) 基本的“一”(橫)、“”(豎)、“”(撇)、“、”(點)和“”(折)。 • Chinese components(部件) is composed of strokes which can be considered a functional unit and they can reflect the meaning, pronunciation and origin of the characters • See http://glyph.iso10646hk.net • Chinese character variants(異體字): and鳥 for birds, thus, and

  10. Phonetics ( the sound,音) • Phoneme( 音素 單音 ) - contrastive unit of speech (e.g. bag and tag) • vows(元音) and consonants(輔音) • Putonghua: single consonants, can be double vows: b, p, m, f, a, o, e, ai (two phonemes), • Cantonese: kwok, cheung, ng • One-character-one-syllable: mono-syllable • Tonal language - tone differentiates meaning • Putonghua: 5 tones • Cantonese: 9 tones(?) • Semantics (the meaning,義 ) • meaning may derive from components of character (e.g. 廳) • Single-character words have multiple-meanings( 樂) • Multi-character words usually have less ambiguity( 快樂 音樂 ) • Writing from left-to-right and also from top-to-bottom • Pinyin system, Zhuyin system(only for learning characters, not as general reading tool)

  11. Character Set • A character set is a collection of characters. The set usually has a name, such as, KangXi character set, etc. Usually, each character in a character set is unique. C ={ci| 1<i<n, ci is a character} • Computer processing of a character set requires that that each character in a character set is assigned a unique binary value • Encoding: Is the process of mapping a character to a numeric value • A coded character set, normal referred to as acodeset CC, can be considered as a set of tuples: CC={(ci, codei) |ciC and codei CODE } • where codei<>codej if ci <> cj, & CODE is normally a set of integers in binary form and CODE is also called code space

  12. Note that CODE is a set of numbers usually in consecutive orders. • Examples: Suppose CODE1={00, 01, 10, 11}, CODE2={0000, 0001, 0010, 0011}, CODE3={1000, 1001, 1010, 1011}, CC1={(ci, codei) |ciC and codei CODE1 } CC2={(ci, codei) |ciC and codei CODE2 } CC3={(ci, codei) |ciC and codei CODE3 } Where CC1 , CC2 , and CC3 are different codesets! • A codeset can also be considered conceptually as a character set with a predetermined order and the order is determined by the numerical value in CODE • Length of binary/order depends on the size of (C) or some predetermined number • Codepoint: a value in the code space • For Chinese, since there are more than 256 characters in the set, at least 2 bytes (at most 64k codepoints) are necessary to represent all the Chinese characters.

  13. Numerical Notations • Decimal notation (10 distinct values): no prefix • Binary notation (2 distinct values): • Hexadecimal notation: 0xHH where H: 0 ..9,A..F • Hexadecimal notation is normally used to replace binary notation for better readability • 1 to 4 binary digits -> 1 Hex digit • Scalar value: the actual numeric value for any fixed digit numbers: scalar( 0001) = 12, scalar( 0111) = 716, scalar( 01111) = F16= 1510= 11112 • In computer, 00AF and AF represents different things, but they have the same scalar value.

  14. ASCII code table • 0x00 - 0x1F and 7Fcontrol characters • 0x20 - 0x7E graphic characters(printable chars) • Code range: range of values in code-point assignment • The code range is 00 to 7F for ASCII • Code range may not start from scalar value zero

  15. Row-Cell notation: Matrix with row number and column number defines a cell and thus the order of the characters, also avoid binary notation. This is particularly useful when the code range is not consecutive. • Character subsets: putting characters of similar nature next to each other, different subsets in different rows • Some codepoints in the code space may not have any character assignment, they are called empty codepoints.

  16. Codeset Compatibility • For two character sets, C1 and C2, equivalence: C1 = C2 , subset: C1  C2, superset: C1  C2 intersect: C1  C2 , C1  C2  Examples: GB&B5 -> ? GB&GBK -> ? • For two coded character sets: CC1={(c1i, code1i) | c1i C1 and code1i CODE1 } CC2={(c2i, code2i) | c2i C2 and code2i CODE2 } If for every (c1i, code1i)  CC1, it is true that (c1i, code1i)  CC2 then, CC2 is said to be fullycompatible with CC1

More Related