1 / 25

Writing

Writing. Character sets Unicode Input methods. Character sets. What’s the problem? Computer should handle your language’s writing system in a natural way “Handle” means input and output (and some other things, eg sorting) “Natural” means like you are used to Input method

dfrost
Download Presentation

Writing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Writing Character sets Unicode Input methods

  2. Character sets • What’s the problem? • Computer should handle your language’s writing system in a natural way • “Handle” means input and output (and some other things, eg sorting) • “Natural” means like you are used to • Input method • Output (it should look right) • English is straightforward (why?), but not other languages • Distinguish: storage and handling of text within the computer vs. input/output

  3. Why the fuss? • Typing characters on a computer may appear deceptively simple: you press a key labelled “A”, and the character “A” appears on the screen. Well, you actually get uppercase “A” or lowercase “a” depending on whether you used the shift key or not, but that’s common knowledge. You also expect “A” to be included into a disk file when you save what you are typing, you expect “A” to appear on paper if you print your text, and you expect “A” to be sent if you send your product by e-mail or something like that. And you expect the recipient to see an “A”. • No big deal, but does the same happen for “Ä”? Or “ ” • Depends on keyboard settings, display settings, and degree of standardization Adapted from: http://www.cs.tut.fi/~jkorpela/chars.html

  4. Character sets • Size of character set has to do with storage as bits and bytes • Early computers had only 32 characters – upper case “English” plus numerals and a few other symbols • ASCII had space for 64 characters • most alphabetic writing systems can be covered by 128 characters • Internal storage is independent of i/o • Leads to need for standardization of encoding

  5. Writing systems • Alphabetic • Many languages use Roman alphabet • Often with diacritics (accents), • many are common to lots of languages • but some of are quite unusual • and some languages use multiple diacritics • There are other alphabetic writing systems • Conventionally, a range of other symbols (numerals, currency signs, fractions, math symbols) are included • Syllabic • Ideographic

  6. Input method Individual key Key combination Menu Must be available in all fonts Accented characters

  7. Characters and glyphs • A single character might have a variety of appearances (glyphs) depending on size, font, etc. • a aaaaaaaaaa • A a à å α are all different characters • Appearance is a matter of rendering • In some writing systems, the same character is rendered differently depending on its context

  8. Output text direction Note mixed LR and RL in Arabic, and orientation of Roman script in Chinese

  9. Unicode • Problem of many (competing) standards, especially for Arabic, CJK and Indian scripts • Industry-agreed standard aiming to cover “all” the world’s writing systems • “Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for text normalization, decomposition, collation, rendering and bidirectional display order” (Wikipedia)

  10. Unicode – some issues • 30+ writing systems encoded, but many more still to do • Non-alphabetic symbols should be included (eg music notation, currency symbols) • Should invented alphabets (eg Klingon, Tolkien) and/or ancient systems (hieroglyphics, Mayan) be included?

  11. Unicode – some issues • Ready-made vs composite characters, e.g. é = e+´; Hangul and Chinese/Japanese characters made up of identifiable components • Ligatures: many writing systems have special forms for character combinations • Is this a matter of representation or rendering? • Some disputed characters: ligature or separate character? (e.g. Dutch ij) • Unicode also defines ordering conventions, not always uncontroversial

  12. Input methods • Typing • Keyboard layout • Key combinations • Inputting ideographs • Handwriting pad • OCR

  13. Typing • We are used to conventional keyboard which has (roughly) one key-stroke per character • We quickly learn key-stroke combinations (eg for capitals, accented characters) • Fluent typists rely on the key layout being familiar

  14. Typing • Recent emergence of MSN on telephones has required input using just ten keys • Shows that software can map key-stroke combination to appropriate character sequence • For some users, bilingual keyboards are commonplace

  15. Non-alphabetic writing systems • Syllabic system may require multiple key-strokes per character • Ideographic system (Chinese, Japanese) typically has input based on pronunciation, plus conversion to character, which may require contextual analysis • Alternate method: composition by radical + stroke count

  16. Graphic input • Using stylus, eg on PDA • Also using finger on mousepad on laptop • Depends on recognizing stroke direction and order • Shorthand method invented • Recent systems recognize conventional letter shapes ... • ... in all their varieties

  17. Graphic input • Also found for Chinese/Japanese • Important to get stroke order correct

  18. OCR • Optical character recognition • “Scanning” • Essentially a pattern recognition task: how similar is a given image to the expected image • Divide image into regions • Measure blackness of each region • Compare resulting matrix with template

  19. OCR • Originally developed with special OCR font which maximized the differences between characters • For Latin scripts, works very well with almost any font • Can include orientation detection • Errors are predictable and could be eradicated with more sophisticated (linguistic) processing, but is it worth it?

  20. OCR for handwriting • Neat printed handwriting not much harder than some fonts • Joined-up cursive handwriting still a research problem • Related problem of handwriting recognition – a bit like speech understanding and voice recognition

  21. OCR for other scripts • Correspondingly more difficult, depending on • Complexity of writing system in general • Complexity and similarity of individual characters

  22. Not always easy Handwriting is even harder

  23. Need for OCR • Input of (all sorts of) texts for various purposes • Rapid input to save (re)typing • For further processing • For study • Two typical (hard) cases • Study of ancient manuscripts • Intelligence gathered in Iraq

More Related