Digital Text Primer
Download
1 / 28

Roland Telfeyan < roland@telf> Robert Coffin <rdcoffin@earthlink> - PowerPoint PPT Presentation


  • 204 Views
  • Uploaded on

Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland. Roland Telfeyan < roland@telf.com> Robert Coffin <rdcoffin@earthlink.net> October 2, 2006 • Charlotte, North Carolina. Text encoding

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Roland Telfeyan < roland@telf> Robert Coffin <rdcoffin@earthlink>' - julio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Roland telfeyan roland telf robert coffin rdcoffin earthlink

Digital Text PrimerPrepared for: AIEA Roundtable on Digitization of Armenian DocumentsSaturday 7 October 2006, University of Geneva, Switzerland

Roland Telfeyan <roland@telf.com>

Robert Coffin <rdcoffin@earthlink.net>

October 2, 2006 • Charlotte, North Carolina


Contents

Text encoding

ASCII Problem

Unicode Solution

OCR

ABBYY FineReader

Sample scans

Contents


1963 ascii
1963: ASCII

  • Telegraph machines

  • American Standard Code for Information Interchange (ASCII)

  • 128 numbers representing

    • Printed characters, like ‘A’, ‘B’, ‘+’, ‘=’, etc.

    • Commands to control the print head of the teletype, like “carriage return”, “line feed”, “tab”, “back space”, etc.


Ascii cont d
ASCII: Cont’d

  • No indication of type appearance

  • Only numbers representing letters

0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel

8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si

16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb

24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us

32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '

40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /

48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7

56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?

64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G

72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O

80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W

88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _

96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g

104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o

112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w

120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del


Early keyboards
Early Keyboards

  • Keyboards were “hard-wired.”

  • To get a lowercase ‘b’, you press the [B] key, making the keyboard emit code 98.


Mid 1970 s computer fonts
Mid 1970’s: Computer Fonts

  • An array of glyphs, one per ASCII code

  • Character code 97 (‘a’) can be rendered variously:a, a, a, ա, ...


Significance of fonts
Significance of Fonts

  • Fonts were the first flexible mapping interposed between the hard-wired keyboard and the printed glyphs.

  • This technology made the Macintosh famous.


Font design dilemma
Font Design Dilemma

  • Now 97 can mean not only ‘a’ but ‘ա’.

  • However, should 98 mean ‘բ’ or ‘պ’?


Font design dilemma1
Font Design Dilemma

  • Font designers assigned glyphs to specific character codes that satisfied their own personal keyboard layout preferences.

  • An Armenian text file could not be viewed reliably in absence of the font used to create it.


1986 to 2006 next to mac
1986 to 2006: NeXT to Mac

  • Steve Jobs (whose mother is a Hagopian) invented the NeXT computer

  • It had user-definable Keyboard Layouts

  • Today’s Mac OS X is 90% NeXT

  • Today, the placement of letters on a keyboard is a user preference, like the location of windows on a screen.


Unicode
Unicode

  • The character set has been extended to allow for more than 95,000 characters.

  • The goal is a set of standard character codes for every known language.

  • For the first time, Armenian (and other) characters have their own codes, defined by a de-facto international standard.


Unicode cont d
Unicode (Cont'd)

  • The Unicode Character Set is a standard definition of character codes for the glyphs of most known languages.

  • Armenian codes range from 1328 to 1423 (95 codes).


But i like my old system
But I like my old system

  • If you want Armenian, Georgian, Greek, Hebrew, Arabic, Chinese, and more all on the same page using one font with with a consistent look, …

  • If you want to type using your own key layout, …

  • If you want others to be able to read your text in absence of the font or keyboard layout or computer system you used, …

  • … use Unicode.


But i have a lot of ascii
But I have a lot of ASCII

  • Unicode conversion tools at: http://www.telf.com/


95 000 glyphs
95,000 Glyphs?

  • With more than 95,000 potential glyphs in a Unicode font, any one font can represent multiple language scripts.

  • How can a computer keyboard address all these characters?

  • User-defined keyboard layouts map selected characters in the Unicode font to the physical keyboard.


Review two main points
Review: Two Main Points

  • Keyboard layouts are user preferences that have nothing to do with legibility of text on another system.

  • Unicode text is legible in absence of the fonts or keyboard mappings or possibly the application used to compile it.


Roland telfeyan roland telf robert coffin rdcoffin earthlink
1985

Physical Keyboard

Different fonts had different glyphs for the same character.

Kevork

font

“K”

ASCII 67

Tigran

font

“G”

Code saved in file


Roland telfeyan roland telf robert coffin rdcoffin earthlink
1995

Unicode characters are saved in text file—the same Unicode character code for the same glyph, regardless of font.

Virtual Keyboard

(User Selected)

Any

Unicode

Standard

Font

(Multi-

lingual)

ABCD…

ΑΒΓΔ …

…ܐܒܓܕ

… אבגד

ԱԲԳԴ …

ႠႡႢႣ …

Physical

Keyboard

“Գ”

Armenian

Letter “Gim”

Unicode 0533

Armenian

Keyboard

Preference

Key Code 67

“ג”

Hebrew

Letter “Gimel”

Unicode 05D2

Hebrew

“Ⴂ”

Georgian

Letter “Gan”

Unicode 10A2

Georgian


Roland telfeyan roland telf robert coffin rdcoffin earthlink
OCR

  • ABBYY FineReader is a commercial multilingual OCR software that recognizes Armenian and many other languages.

  • Built-in dictionaries assist in checking accuracy, and all text is handled through Unicode.


Finereader
FineReader

  • The program is simple yet powerful.

  • The program links each letter of text with its location in the scanned image, for fast proofreading.


Finereader cont d
FineReader (Cont’d)

  • Ample control over page layout

  • Tools to automate large batches

  • Outputs Word, PDF, HTML, XML, …


Ocr results
OCR: Results

  • Armenian accuracy depends on typeface and richness of the internal dictionary.

  • Arial Armenian: ~99.9%

  • Times, Aramian, Nork: ~96%

  • Երկաթագիր, Գրաբար manuscripts: not too good ~70%


Finereader conclusion
FineReader: Conclusion

  • Tuned for modern, Arial-like letters.

  • We are working with ABBYY to improve recognition rates on old manuscripts and books.


Screen shots
Screen Shots

  • On the next slides are:

    • A screenshot of FineReader

    • A scanned image

    • MS Word output


Finereader screen
FineReader Screen

Recognized Text

Scan




Further information
Further Information

  • Questions, suggestions, and corrections are welcome.

  • Updates will be posted to www.telf.com