1 / 17

מבנה מחשב

מבנה מחשב. תרגול 1 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky. Introduction. Computers are considered "number crunchers“. Humans work with characters.

jacqui
Download Presentation

מבנה מחשב

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. מבנה מחשב תרגול 1ייצוג תווים בחומרה

  2. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky

  3. Introduction • Computers are considered "number crunchers“. • Humans work with characters. • Character data isn't just alphabetic characters, but also numeric characters, punctuation, spaces, etc. Most keys on the central part of the keyboard (except shift, caps lock) are characters. • Everything represented by a computer is represented by binary sequences. • We use standard encodings (binary sequences) to represent characters. תמר שרוט, נועם חזון

  4. Introduction (2) • The two's complement method is used to represent integer numbers, because it has nice mathematical properties, in particular. • However, there aren't such properties for character data, so assigning binary codes for characters is somewhat arbitrary. • The most common character representation is ASCII, which stands for American Standard Code for Information Interchange. • The ASCII code defines what character is represented by each binary sequence. תמר שרוט, נועם חזון

  5. The ASCII code תמר שרוט, נועם חזון

  6. The ASCII code (2) • There are two reasons to use ASCII: • A way to represent characters. • An acceptable standard. • Different bit patterns are used for each different character that needs to be represented. • A nice property –Thelowercase (uppercase; digits) letters are contiguous. Applications: • ‘a’ < ‘b’; 'A' < 'B‘; ‘0’<‘1’. • ‘a’ – ‘A’ = ‘b’ – ‘B’ = …. = ‘z’ – ‘Z’ = 32. • ‘1’ – ‘0’ = 1 – 0. תמר שרוט, נועם חזון

  7. The ASCII code (3) • Note: • ‘a’ ≠ ‘A’. • 0 ≠ ‘0’ (‘0’ = 48). • The characters between 0 and 31 are generally not printable (control characters that affect how text is processed, etc). 32 is the space character. • There are 128 (= 2^7) ASCII characters. • The eighth bit being used as a parity bit to detect transmission errors. תמר שרוט, נועם חזון

  8. The ASCII’s disadvantage • The greatest disadvantage: biased for the English language character set. • Missing: • Mathematical symbols. • European languages (as well as Hebrew). • Solution: use the 8th bit as well (Extended ASCII). Switching up to 256 letters, which is plenty for most alphabet based languages. תמר שרוט, נועם חזון

  9. Extended ASCII • Problems: • Not enough for Asian languages, which are word-based (thousands of characters). • Can’t add more than one language (é = ג; email from France to Israel and vise verses). • Code-Pages – different characters encoding. Identical only in the first 128 codes (the ASCII part). • Works reasonably in small networks that use the same coding. • Problem: The Internet! תמר שרוט, נועם חזון

  10. Unicode • An effort to create a single character set that includes every reasonable writing system. • Uses 2 bytes to represent a character. • 1st byte + 2nd empty byte – used to represent the ASCII characters. • 1st + 2nd bytes – used to represent other characters. • The UCS-2 (2-Bytes Universal Character Set. Also known as UTF-16) disadvantages: • Endians. • Doubles the files size. • Doesn’t support old files. תמר שרוט, נועם חזון

  11. Endians • Now when the characters are stored in more than one byte, the bytes order (high / low endian) matters! • Causes problems when transferring files between different computers. • Solution: “Union Byte Order Mark” – 0xFEFF (in a 16-bit Unicode). • Always place the mark at the beginning of the characters’ stream. • While receiving an input that start with 0xFFFE – the programmer knows she must swap every other byte. תמר שרוט, נועם חזון

  12. Unicode – cons. • Yet: • Not every Unicode string has a byte order mark at the beginning. • Pure English files are doubled for no reason. • Old files must be converted. • Unicode was abandoned for several years (until 1992). • Solution: UTF-8 (8-bit-Unicode-Transfer- Format). תמר שרוט, נועם חזון

  13. UTF-8 • This is a variable length character encoding. • Every code-point from 0-127 (ASCII’s original codes) is stored in a single byte. • Code points 128 and above are stored using 2-4 bytes according to the character code-point (it is possible to use 6 bytes) . • Outcomes: • Pure English files are identical to ASCII files. • No unneeded doubled files. • No need to convert old files. • Enables representation of richer character set through the extra bytes. • Frequent characters use shorter encodings. תמר שרוט, נועם חזון

  14. UTF-8 – How does it work? • If we have an ASCII character: • It will be placed in one byte and the MSB will be zero. • Otherwise: we need more than one byte! • The first byte will tell us how many bytes are used to encode the character. • The first byte will start (MSB) with a sequence of ones followed by a single zero. The sequence length will be the number of bytes used to encode the character. • Each additional byte will have the value 10 in its MSB. • The remaining bits will be used to encode the character. תמר שרוט, נועם חזון

  15. Other encodings • There are hundreds of different encodings. • UTF-7, UTF-8, UTF-16 and UTF-32 are the most reliable when working with languages other than English. • When passing a sequence of characters (strings, files etc.) one must mention which encoding method is used. Or else: • Gibberish. • Question marks. • Wrong representation of several characters. תמר שרוט, נועם חזון

  16. Standards • E-mail: • Content-Type: text / plain; charset = “UTF-8” • Web page: tag • <html> <head> <meta http-equiv=“Content-Type” content = “text/html; charset=utf-8”> … תמר שרוט, נועם חזון

  17. Libraries for managing encodings • There are many libraries that support different characters encoding. I.e.: • Iconv (Or a more stable implementation: libiconv). (Mostly Unix). • Codecs module (python). • “The International Component for Unicode” (ICU) (There are libraries for C/C++ & Java). • UTF8-CPP (C++). תמר שרוט, נועם חזון

More Related