Chapter 3 Data Representation Text Characters

Chapter 3Data RepresentationText Characters

Representing Text • To represent a text document in digital form, we need to be able to represent every possible character that may appear. • There is a finite number of characters to represent, so the general approach is to list them all and assign each a binary string.

Representing Text • A character set is a list of characters and the codes used to represent each one. • In 1960, a survey revealed 60 different characters sets in use. • At IBM alone there were 9 different sets. • By agreeing to use ONE particular character set, computer manufacturers have made the processing of text data easier.

The ASCII Character Set • ASCII stands for American Standard Code for Information Interchange. • TheASCII character set originally used seven bits to represent each character, allowing for 128 unique characters. • Wikipedia has an excellent entry on ASCII.

The ASCII Character Set (7 bit)

The ASCII Character Set • Notice the organisation of the ASCII table. • The table divides in half according to the MSB. • Letters are all in the second half so all codes for alphabetic characters start with 1. • This second half of the table divides in half again according to the next bit: • UPPERCASE letters start 10. • lowercase letters start 11. • The first half of the table also divides in half according to the next bit: • Control characters start 00. • Numerals and punctuation start 01.

The ASCII Character Set • Note that control characters (the first 32 in the ASCII character set) do not have simple character representations that you could print to the screen. • Many, however, perform actions with which you are familiar. • Some have there own keys, others need to be constructed.

The ASCII Character SetControl Characters Control sequences are created by holding the Ctrl key (control) and pressing a letter. This has the effect of subtracting 64 from the ASCII value of the letter pressed. For example: • ‘M’ has ASCII value 77 (1001101 in binary), • Ctrl-M has ASCII value 13 (0001101 in binary). Alternately, we can see this as “masking bit 6.”

The ASCII Character SetCommon Control Characters

The ASCII Character Set Coding letters in ASCII is easy. Let’s look at ‘j’ as an example: Since ‘j’ is a letter, its code starts with a 1. Since it’s lowercase, the next bit is also a 1. Since it’s the tenth letter of the alphabet the rest of the code is 01010. The complete ASCII code for ‘j’ is 1101010.

The ASCII Character Set • ASCII evolved so that eight bits are used. • The 7-bit codes were simply prefixed with another bit, giving another natural doubling. • The original 7-bit codes were padded with 0. • So the code for ‘j’ became 01101010. • 128 new characters were added. The codes for this alternate character set start with 1.

The Unicode Character Set • Even the extended version of the ASCII character set is not enough for international use. • The Unicode character set uses 16 bits per character. The Unicode character set can represent 216, or over 65 thousand characters. • Unicode was designed to be a superset of ASCII. That is, the first 256 characters in the Unicode character set correspond exactly to the extended ASCII character set.

Examples of Unicode Characters Figure 3.6 A few characters in the Unicode character set

Chapter 3 Data Representation Text Characters

Chapter 3 Data Representation Text Characters

Presentation Transcript

Chapter 3 Data Representation part2

Chapter 3 Data Representation

Chapter 2: Data Representation

Chapter 2 Data Representation

Chapter 3 Data Representation

Data Mining Chapter 3 Output: Knowledge Representation

Chapter 3 Numeral System and Data Representation

Data Representation – Chapter 3

Chapter 3 Data Representation

Chapter 3 : Data Representation

Chap. 3 Data Representation

Data Representation 3

Chapter 3 Data Representation

Chapter 3 Representation

Chapter 2: Data Representation

Data Representation – Chapter 3

Chapter 3 Data Representation

Chap. 3 Data Representation

Chapter 3 Data Representation

Chapter 3 : Data Representation