1 / 11

Understanding Character Encoding and Textual Documents in Unicode

This lecture explores character sets and encoding systems, focusing on Unicode, ISO standards, and the significance of textual documents. We discuss the Universal Character Set (UCS), its representation in the Basic Multilingual Plane (BMP), and how Unicode enhances text semantics. Learn about the creation and storage methodologies of textual documents, particularly the use of PDF, and approaches for encoding in word processing applications. This presentation serves as a valuable resource for understanding the intersection of digital character representation and document formatting.

lorant
Download Presentation

Understanding Character Encoding and Textual Documents in Unicode

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. lis508 lecture 2: characters to textual documents Thomas Krichel 2002-09-30

  2. Structure • Character sets • Coded character set • Character endcoding

  3. Literature • Norton “new inside the PC” chapter 4 • http://www.danbbs.dk/~erikoest/bb_terms.htm • http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html • http://www.cl.cam.ac.uk/~mgk25/unicode.html

  4. Recall from last lecture • UCS is a character set defined by the ISO • The most important characters are in the basic multilingual plane. It has 2^16=65536 characters. • UCS characters in the BMP can be represented by two bytes. • Other characters need more space.

  5. Unicode • Unicode are an industry consortium. • The Unicode Standard published by the Unicode Consortium corresponds to the BMP of ISO 10646. All characters are at the same positions and have the same names in both standards. • The Unicode Standard defines in addition much more semantics associated with some of the characters. There is a free online book at http://www.unicode.org/unicode/uni2book/u2.html

  6. application • Word and Wordpad give the option to input Unicode character • Insert symbol • Hex sequence followed by ALT-X • You may not see the character if you do not have a font for it. • Wordpad and Notepad allow to save the Unicode file in various encodings. When in doubt, use Unicode UTF-8. • likely to be the most widely supported • does not screw up ASCII text

  7. Textual documents

  8. What is textual document? • A text is a sequence of characters. • A textual document is a text with some formatting • Font • Font shape (e.g. italics) • Spacing and other “lay-out” issues • Why are librarians concerned about textual documents?

  9. Creation of textual documents • Pure text editors only create text. • Usually text is created with wordprocessing software. This surrounds text with digital gibberish that explains the formatting. • Formatting instructions are depended on the wordprocessing software. • Why is this bad?

  10. Storing of textual documents • Most widely used is PDF • It is based on a language called postscript that describes documents. • Support for fonts • Support for inclusion of non-textual files • PDF compresses PostScript files • Proprietary format owned by Adobe Inc. • Requires special software • Also bad for digital preservation

  11. http://openlib.org/home/krichel Thank you for your attention!

More Related