1 / 17

Wide character vs. Multi-byte characters

Wide character vs. Multi-byte characters. Text information needs to be represented by the right data types. Multi byte characters : data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 Wide characters : Fixed-byte encoding and no testing of high bit is needed.

earl
Download Presentation

Wide character vs. Multi-byte characters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wide character vs. Multi-byte characters • Text information needs to be represented by the right data types. • Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 • Wide characters: Fixed-byte encoding and no testing of high bit is needed. • Processing representation for wide characters: • Big Endian vs. Little Endian • Data type dependent: only for wide characters • System architecture dependent • Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian

  2. Character Input • Input method: A scheme of mapping characters from their external representations to the internal codepoints used in computer systems. • Classification of input methods: • Images: • Off-line character recognition (Optical character recognition) • On-line character recognition • Speech: voice recognition • Character features: Keyboard input based on glyph shapes and pronunciations.

  3. Character Input Based on Images • Optical Character Recognition (via image, off-line ): • Written material --> scanner --> bitmap image file (e.g. TIFF, JPEG) --> characters (represented by an internal code) • very difficult for unrestricted handwritten characters, commercially viable for printed materials and acuracy depends on printing quality • Degree of difficulty increases when the total number of characters to be recognized increases • On-line character Recognition (by pen writing devices): • Handwriting information capture (pen-in, pen-out, pen-movement, on-line) --> Stroke information (pre processing with noise reduction) --> Searching for the character based on the sequence of strokes. • commercially viable

  4. Speech Recognition (by voice input): • Capture speech by microphones --> speech signal segmentation --> speech signal converted to phonetic transcription --> phonetic spelling converted to internal code. • becoming commercially viable, problem with non-native speaker, conversion from colloquial to written text • more affordable and getting common in the next 5-10yrs

  5. Keyboard based Input method: an encoding method which maps a sequence of keystrokes (with a predefined keyboard layout) to an internal code of a character. • Conceptually, an input method can be considered as a mapping table with two columns: 1st column X is a sequence of keys, 2nd column Y is the corresponding internal code. • Uniqueness requirement: for any two internal codepoints Yi andYj, if Yi≠Yj then Xi≠Xj. • Input methods are normally language(script) dependent: • Input for Chinese and Greek Letters in GB are two different input methods and are thus separately invoked.

  6. Typing in the internal code is straight forward, easiest to implement, and accurate, but requires labour intensive training, only good for professionals • Why do we need to design input methods: • People cannot relate characters with internal code • 憤 =>(BCAB16 ) 憔=>(BCAC16 ) • Number of characters is much larger that the number of keys on the keyboard=>a sequence of keystrokes maps into one key • What is the restriction: limited number of keys(people cannot remember too many different keys with unrelated numbers)

  7. What are the information we know? All input methods must use some features associated with the characters: pronunciation, radicals, components, strokes, writing sequence, etc., or combinations of them. • Different mapping methods leads to different input methods • Users: Professional typists, casual users, daily users • Different mode of inputs: • Typing by looking at printed material • Typing while thinking

  8. Design considerations: • Ease of learning • Shorter learning time: Easy to pick up(perhaps easy to forget), but slow input speed • Longer learning time: Difficult to learn, but once you are trained, not easy to forget and faster input speed • Mapping of features to keys on the keyboard: • Physical control of the different fingers and access to different key positions on the keyboard • Frequency analysis of the features • Uniqueness: one to one mapping and user friendliness • Equal keystroke sequence vs. uneven keystroke sequence

  9. Input methods based on glyphs • Problems: • What are the fundamental units? • How to put the units together (or how to form sequences)? Need to translate 2-D spatial relations into 1-D ordering Example: 夵(U+5935) and 尖(U+5C16) • How difficult is it to learn? Trade-off between ease of learning and speed • Features related to glyphs: • Strokes(筆劃):點 橫 豎 撇 捺 • Radicals(偏旁): for indexing mostly, not unique • Components(部件 ): 女 and 且in 姐組 • Character(整字 ): 甘 • Spatial relations(方位關係): left-right, upper-lower,

  10. Principles of Input method design • Design example: using strokes only • Suppose we assign the strokes to keys 1,2,3,4,5, respectively, using only 5 keys • Example: 哲 , 23144233232, very long a sequence • What problems do we have for characters like these:岭岺 => At least an extra key must be used to distinguish them • As there are more keys available, some keys can be assigned to multiple strokes:

  11. 2-stroke keys: if the first stroke is x, second stroke is y, how many different 2-stroke keys? • Example: • Total No. of keys now? • With these additional keys the number of key presses is reduced to: 23 14 42 33 23 2 • With 3 stroke keys: xyz, additional keys: • Total No. of keys:

  12. Study of character features and use patterns • Study of character frequency(based on 50,000char.) • 2,000 most frequently used characters: 97% • out of that: first 100 characters: 45% • the first 10 characters: 12% • Example: 有 的 口 是 我 不 女 日 : assign keys • 2-stroke keys: • 3-stroke keys, etc, use the most frequently used, • Other considerations are • easily identifiable • reducing the length of key sequence

  13. Keyboard Arrangements • Some fingers are easier to control, assign priority L: use only index(2nd finger) to 5th finger for typing. • General Principle: Assign more frequently used features keys to the position on the keyboardwhich are easier to reach • One simple method: • Some keyboard rows are easy to press R: • Keys are ranked according to LxR • all the selected strokes(characters, and combined strokes) are ranked according to frequency of use, K • Then mapping the feature keys according to rank.

  14. Phonetic-based IM: 拼音 (Pinyin) • Romanized input method vs. native phonetic symbols based input method • Romanized letter strings (usually 1-2 characters) which can use the English keyboard readily • Native phonetic symbols are easier for people to relate • Design Problems and Solutions: • Homonyms(同音字 ) in GB: • No tone: only 18 char. Have no homonyms. Largest set yi is 114. • With tone: 262 no homonyms, largest is reduce to 60. • Solutions: (1) Specification of tone is optional (1-4 for Putonghua and 1-9 for Cantonese), (2) use a window to show all the candidates, (3) word/bigram input. • Multiple pronunciations of the same character. Enter all possible pronunciation into the phonetic spelling database. (e.g. che and kui for 車 in Cantonese). • Quantitatively not a significant problem • May slow down if for fault-tolerance reason (fuzzy input)

  15. User Problems: • Some sounds are difficult to analyze: • similar consonants: /b/ vs /p/, /t/ vs /d/, /g/ vs /k/ • tone interact with vowel: the way we say things and the standard pinyin is different: 普洱 pu3 er3 to pu2 er3(Putonghua) • Difficult to analyze the behaviour of non-native speakers because of accent interfering with phonetic analysis • Tedious to find the correct character from the set of candidates that have no apparent relationships • When user cannot use shape-based keystroke input, then try phonetic spelling!

  16. Other Ims for Chinese • Zhuyin (注音) [also called bopomofo] • Chinese/Japanes phonetic symbols (similar to Kantana or Hiragana) • Includes the use of numerals keystrokes • Similar English sounds: bpmfdtnlgkhjsaor • tone: . (tone 0), <space> (tone 1), 2 (tone 2), 3, (tone 3), 4 (tone 4) • One-to-one mapping to PinYin(Pages 218-219) ㄅㄆㄇㄈto bo, po mo fo • 九方:mapping into number keys good for small appliances: mobile phone, PDA, etc.

  17. Japanese and Korean • Since hiragana and katakana are all phonetic based, they have unique Romanized mapping • Example: a i u e o, ha hi hu he ho • But separate key(native symbols) mapping is also provided pp248 • Romanized input and native symbol-based direct mapping input methods are different • Similar for Korean Hangul

More Related