1 / 17

Informatics I101

Informatics I101. February 25, 2003 John C. Paolillo, Instructor. Electronic Text. ASCII — American Standard Code for Information Interchange EBCDIC (IBM Mainframes, not standard) Extended ASCII (8-bit, not standard) DOS Extended ASCII Windows Extended ASCII Macintosh Extended ASCII

wenda
Download Presentation

Informatics I101

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informatics I101 February 25, 2003 John C. Paolillo, Instructor

  2. Electronic Text • ASCII — American Standard Code for Information Interchange • EBCDIC (IBM Mainframes, not standard) • Extended ASCII (8-bit, not standard) • DOS Extended ASCII • Windows Extended ASCII • Macintosh Extended ASCII • UNICODE (16-bit, standard-in-progress)

  3. means Alphabet letter "A" is displayed as Screen Representation A AAA ASCII 01000001

  4. The ASCII Code 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 A B C D E F NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US blank ! " # $ % & ' ( ) * + ` - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ ~ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL

  5. An Example Text T h i s i s a n e x a m p l e 84 104 105 115 32 105 115 32 97 110 32 101 120 97 109 112 108 101 Note that each ASCII character corresponds to a number, including spaces, carriage returns, etc. Everything must be represented somehow, otherwise the computer couldn’t do anything with it.

  6. Representation in Memory 01101010 01101001 01101000 01100111 01100110 01100101 01100100 01100011 01100010 01100001 01100000 32 101 108 112 109 97 120 101 32 110 97 _ e l p m a x e _ n a

  7. Features of ASCII • 7 bit fixed-length code • all codes have same number of bits • Sorting: A precedes B, B precedes C, etc. • Caps + 32 = Lower case (A + space = a) • Word divisions, etc. must be parsed ASCII is very widespread and almost universally supported.

  8. Variable-Length Codes • Some symbols (e.g. letters) have shorter codes than others • E.g. Morse code: e = dot, j = dot-dash-dash-dash • Use frequency of symbols to assign code lentgths • Why? Space efficiency • compression tools such as gzip and zip use variable-length codes (based on words)

  9. Requirements Starting and ending points of symbols must be clear (simplistic) example: four symbols must be encoded: 0  10  110  1110  All symbols end with a zero Any zero ends a symbol Any one continues a symbol Average number of bits per symbol = 2

  10. 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Example • 12 symbols • digits 0-9 • decimal point and space (end of number) 0 00 010 0110 01110 011110 011111 10 110 1110 11110 _ 111110 . 111111 0 1 2 3 4 5 6 7 8 9 _ .

  11. Efficient Coding Huffman coding (gzip) • count the number of times each symbol occurs • start with the two least frequent symbol • combine them using a tree • put 0 on one branch, 1 on the other • combine counts and treat as a single symbol • continue combining in the same way until every symbol is assigned a place in the tree • read the codes from the top of the tree down to each symbol

  12. Information Theory • Mathematical theory of communication • How many bits in an efficient variable-length encoding? • How much information is in a chunk of data? • How can the capacity of an information medium be measured? • Probabilistic model of information • “Noisy channel” model • less frequent ≈ more surprising ≈ more informative • Measures information using the notion entropy

  13. Noisy Channel Source Destination 1 0 1 0 We measure the probability of each possible path (correct reception and errors)

  14. Entropy • Entropy of a symbol is calculated from its probability of occurrence Number of bits required hs = log2ps Average entropy: H(p) = – sum( pi log pi ) • Related to variance • Measured in bits (log2)

  15. Base 2 Logarithms 2log2x= x ; e.g. log22 = 1, log24 = 2, log28 = 3, etc. Often we round up to the nearest power of two (= min number of bits)

  16. Unicode • Administered by the Unicode Consortium • Assigns unique code to every written symbol (21 bits: 2,097,152 codes) • UTF-32: four-byte fixed-length code • UTF-16: two to four-byte variable-length code • UTF-8: one to 4-byte variable length code • ASCII Block (one byte) + basic multilingual plane (2-3 bytes) + supplementary (4 bytes)

More Related