1 / 28

Character Sets Logins & Terminal Types

Character Sets Logins & Terminal Types. Jarret Raim 2004-04-20. Definitions. Character Repertoire A set of characters where no internal presentation in computers or data transfer is assumed. Does not define an ordering for the characters.

haines
Download Presentation

Character Sets Logins & Terminal Types

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Character SetsLogins & Terminal Types Jarret Raim 2004-04-20

  2. Definitions • Character Repertoire • A set of characters where no internal presentation in computers or data transfer is assumed. • Does not define an ordering for the characters. • Usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form.

  3. Definitions • Character Code • Defines a one-to-one correspondence between characters in a repertoire and a set of nonnegative integers, called a code position. • Aka: code number, code value, code element, code point, code set value - and just code. • Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers.

  4. Definitions • Character Encoding • A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. • In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets. • Eg: 7-bit ASCII, 8-bit ASCII, UCS, Unicode, UTF-6, UTF-16, etc.

  5. ASCII & Friends • The original ASCII is a 7-bit encoding using 0-127 to define basic US characters • ISO Latin 1 is ASCII with European characters. (8-Bit) • Contain control codes as well as text.

  6. More ASCII Love • Even basic 7-bit ASCII is not safe • Many “national variants” of ASCII replace some characters with international ones. • Safe ASCII Characters • 0-9 • A-Z and a-z • ! " % & ' ( ) * + , - . / : ; < = > ? • Space

  7. Other Ridiculousness • Other 8-Bit ASCII Extensions • DOS Code Pages • Macintosh Character Codes • IBM’s EBCDIC (Mainframes) • Windows did not conform to any known standards until NT switched over to using Unicode encoded as UTF-16.

  8. The Solution: Unicode • Unicode is a practical description of the ISO 10646 standard known as UCS or the Universal Character Set. • Up to 1,114,111 characters can be encoded. • As of Feb 2000, there were 49,194 characters. • The encoding is not defined • Several implementations

  9. Encodings For Unicode • Most Common: UTF-8 • Character codes less than 128 (effectively, the ASCII repertoire) are presented "as such", using one octet for each code. • All codes with the high bit set to 1 (ie. Not ASCII) link to a complex mechanism for rendering Unicode character with up to 6 octets. • Allows space savings and compatibility at the cost of implementation complexity.

  10. Unicode Complexity • Characters can be encoded multiple ways. • Π is encoded as: • GREEK_CAPITAL_LETTER_PI • N_ARY_PRODUCT • Ä can be encoded as: • LATIN CAPITAL LETTER A WITH DIAERESIS • The symbol A with a link to the umlaut diacritic • All characters can be represented by the U+nnnn notation. • Other Implementations • UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE

  11. Unicode Implementation • Level 1 • Combining characters and Hangul Jamo characters are not supported. [Hangul Jamo are required to fully support the Korean script including Middle Korean.] • Level 2 • Like level 1, except limited combining characters are supported for some languages. • Level 3 • All UCS characters are supported, such that, for example, mathematicians can place a tilde or an arrow (or both) on any character.

  12. Programming Languages • Special Data Types for Unicode • Ada95, Java, TCL, Perl, Python, C# and others. • ISO C 90 • Specifies mechanisms to handle multi-byte encoding and wide characters. • The type wchar_t, usually a signed 32-bit integer, can be used to hold Unicode characters. • ISO C 99 • Some problems with backwards compatibility. • The C compiler can signal to an application that wchar_t is guaranteed to hold UCS values in all locales.

  13. Using Unicode in Linux • Most newer distributions have standardized on UTF-8. • RedHat 8.0, SuSE 8.1, etc. • Most applications only have a Level 1 implementation. • Terminals, output, etc. • Requires hand tuning for UTF-8 • Grep without hand tuning was 100x slower in multi-byte mode than in single-byte mode.

  14. Using Unicode • Libraries must support Unicode formats. • New strlen() definitions: • Number of bytes • Number of characters • Display width (# of cursor positions) • Application must pay attention to the locale setting for UTF-8 activation. • Do NOT use command switches (ie. –u8)

  15. Unicode Functions • Setting the locale • setlocale (LC_NUMERIC, "Germany"); • Defines all numbers returned from libc to use German notation. • gettext() returns the translations of typical strings. // get the translation for the "Hello, world\n" string printf(gettext("Hello, world\n")); • No more reliance on the underlying numerical representation of ASCII. BAD: l = c - 'A' + 'a'; GOOD: l = tolower(c);

  16. Remaining Unicode Issues • Determining Implementation Levels • Font Support • No system will be able to display all Unicode characters. • Printing • Some conversions for UTF-8 to PS have been written. • Hard Coded Conventions • Masking a string to convert to upper-case, etc.

  17. Unicode on the Web • Should be specified in a MIME header for ALL communications internal and external. • The header is sent in ASCII (UTF-8) X-Mailer: Mozilla 4.0 [en] (Win95; I) MIME-Version: 1.0 To: jkorpela@cs.tut.fi Subject: Test X-Priority: 3 (Normal) Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7 Content-Transfer-Encoding: 7bit

  18. Unicode in DNS • Current DNS only supports ASCII domain names. • iDNS is a program to allow international characters to be used in domains names. • Converts native language into roman characters. • iDNS complies with the Row-based ASCII Compatible Encoding (RACE) and fully supports UTF-5, UTF-8 and other local encodings.

  19. Terminal Emulation • Terminal emulation allows clients on differing operating systems to talk to a central server via a defined language for control characters and text. • Almost all terminal emulators speak ASCII • Some still use IBM’s EBCDIC. • As we know, there are major problems with ASCII internationalization.

  20. Serial Devices • Serial ports are one of the most useful ports on a Linux system. • Uses: • Terminals • Printers • Custom Connections • Media Changers, Temperature Sensors, Sewing Machines, etc.

  21. Serial Protocols • RS-232 (EIA-232-E) • Specification that defines the meaning of the signals on each wire. • Normally DB-9 or DB-25. • Two Interfaces • DCE (Data Communications Equipment) • DTE (Data Terminal Equipment) • Many alternative connectors • DIN-8, RJ-45, etc.

  22. Serial In Linux • Character Device Files • /dev/ttyS# • /dev/cua# (for historical reasons, deprecated) • setserial • Sets the serial parameters • Eg. Setserial /dev/ttyS1 port 0x02f8 irq 3 • In RedHat • /etc/rc.d/rc.sysinit reads /etc/rc.serial

  23. Hardwired Terminals • The init process spawns a terminal process (a getty) for each terminal port in /etc/inittab. # Run gettys in standard runlevels 1:2345:respawn:/sbin/mingetty tty1 2:2345:respawn:/sbin/mingetty tty2 3:2345:respawn:/sbin/mingetty tty3 4:2345:respawn:/sbin/mingetty tty4 5:2345:respawn:/sbin/mingetty tty5 6:2345:respawn:/sbin/mingetty tty6

  24. Login Sequence • getty prints the contents of /etc/issue and a login prompt. • getty executes the login program with the username and password from the user. • login verifies the un/pass against /etc/shadow, prints the motd and opens a shell. • The shell runs startup scripts and waits for input.

  25. Terminal Support • Linux supports many terms through a database of terminal capabilities. • termcap - /etc/termcap • terminfo - /usr/share/terminfo • Linux looks at the TERM environment variable to determine terminal type. • Special Characters • CTRL-? vs. CRTL-H • stty command • tset command

  26. More Terminals • Terminals can be badly broken • cat’ing a binary file • Vi or emacs crashing and not restoring the terminal state • Use reset or stty sane to correct. • Modems can be used to transmit terminal information. • USB is successor.

  27. Conclusions • Unicode is complex, but using UTF-8 allows a programmer to get many of the benefits of internationalization for free. • Using STL data structures and other Unicode aware libraries will significantly reduce the pain of using Unicode (kiss char* goodbye). • Assume that there will be problems with internationalization.

  28. Sources • Quick Overview • http://www.linuxjournal.com/article.php?sid=3327 • Longer Overview • http://turnbull.sk.tsukuba.ac.jp/Tools/I18N/LJ-I18N.html • Sun’s Internationalization Reference • http://developers.sun.com/dev/gadc/educationtutorial/creference/sampfiles/sampfiles.html • Programming for Internationalization FAQ • http://www.cs.uu.nl/wais/html/na-dir/internationalization/programming-faq.html • Plus many more.

More Related