Introduction to character encodings java and you
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

Introduction to Character Encodings, Java and You PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to Character Encodings, Java and You. Agenda. Defining the problem Where webMethods products encounter character set problems. What the symptoms look like. Understand core concepts What is a character set? What’s an encoding? What is Unicode, really?

Download Presentation

Introduction to Character Encodings, Java and You

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to character encodings java and you

Introduction to Character Encodings, Java and You


Agenda

Agenda

  • Defining the problem

    • Where webMethods products encounter character set problems.

    • What the symptoms look like.

  • Understand core concepts

    • What is a character set? What’s an encoding?

    • What is Unicode, really?

  • Code Examples to avoid problems

Private and Confidential


Confusion reigns

Confusion Reigns

  • Generally, the most confusing aspect of internationalization.

    • Many, many standards to choose from.

    • Arcane terminology

    • American programmers rarely (seem) to encounter it head-on.

  • We’re presenting this because many of our products are encountering this problem now.

Private and Confidential


Problem domain

Problem Domain

  • webMethods products interface with:

    • non-Java systems (for example, in the adapters)

    • non-Java environments (file systems, databases, libraries, email, ftp, http, etc.).

Private and Confidential


Java s text representation

Java’s Text Representation

  • Java provides a convenient text processing architecture centered on the Java String object.

    • A Java String is basically an array of Java Character Objects.

Private and Confidential


Java characters

Java Characters

  • Each Java Character object represents a Unicode character.

    • (Currently) a 16-bit unsigned integer value between 0 and 65,535.

    • Character class provides access to character properties.

      • UPPER, lower, and Titlecase mapping

      • Comparison

      • Directionality

      • Compatibility

      • C-TYPE values such as ‘alpha-ness’, ‘digit-ness’, ‘alphanumeric-ness’

Private and Confidential


Non java text

Non-Java Text

  • Non-Java files, applications, filesystems, database, et.al. typically do not use Unicode. Java sees them as an array of bytes (byte[]).

Private and Confidential


Three problems

Three Problems

Private and Confidential


Bad conversion

Bad Conversion

  • Target character set doesn’t have this character in it. Java replaces each character with a “?”

  • Input String: 日本語

  • Output String: ???

  • Typically:

    • Using the default encoding when we meant to specify one.

    • Writing on a device (such as System.out) whose legacy encoding doesn’t support the characters.

Private and Confidential


No glyph

“No Glyph”

  • Java knows what the character is and is handling it properly, but doesn’t have a picture of it to show you (in the current Font selected).

  • Input String: 日本語

  • Output String:

  • Typically:

    • Nothing is wrong, just using the wrong Font.

Private and Confidential


Random trash

Random Trash

  • A byte[] was converted using the wrong character encoding. Bytes were mapped to the wrong characters.

  • Input String: 日本語

  • Output String: “ú–{Œê

  • Typically:

    • Using the wrong encoding, the underlying bytes are mapped to different, random-seeming characters.

Private and Confidential


Examples

Examples

  • Same byte sequences, different results:

    Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “漓パ”

    Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAƒp”

    Java String = 0xE0, 0x41, 0x83, 0x70 = “荰”

    Java String = “漓パ” = U+6F13 U+30D1

Private and Confidential


Character set terminology

Character Set Terminology


What is a character

What is a Character?

  • A character is a single, atomic unit of text.

  • The definition has a different meaning according to the writing system and context.

Private and Confidential


Abstract characters

Abstract characters

  • Some abstract characters include:

    A Roman Letter Capital A

    ` Combining Accent Grave

    に Hiragana character “ni”

    語 CJK Ideograph

    ي Arabic letter

    앚 Hangul syllable

    A Fullwidth compatibility letter A

Private and Confidential


What is a character set

What is a Character Set?

  • A character set is a “set”--- a collection of characters, usually organized in some fashion.

  • You’re probably most familiar with ASCII:

    • 0x41 ‘A’

    • 0x42 ‘B’

    • Etc.

Private and Confidential


What is a character encoding

What is a Character Encoding?

  • Character set: a collection of characters, basically, a bucket.

  • Character encoding: the specific ones and zeroes assigned to a character set.

    Character Set: ‘A’ == 0x41

    Character Encoding: ‘A’ == 0x41

Private and Confidential


Eight bit encodings

Eight Bit Encodings

  • 8-bit encodings allow for 256 characters.

128 ASCII

32 ‘C1’ controls

96 extended

Private and Confidential


Latin 1

Latin-1

  • The standard for Western Europe is generally ISO-8859-1

  • AKA “Latin-1”

  • Used by UNIX systems and the Web.

  • Extended version used by Microsoft for Windows.

Private and Confidential


Let a thousand encodings bloom

Let a Thousand Encodings Bloom…

  • Each language has it’s own character set…

    • Everywhere: ASCII*

    • Western European (like German or French): Latin-1

    • Eastern European (like Polish or Slovak): Latin-2

    • Simplified Chinese: GB2312

Private and Confidential


Actually many for each language

Actually, many for each language…

Private and Confidential


Other writing systems

Other Writing Systems

  • Writing systems vary around the world (in order of increasing complexity, more or less):

    • Latin-based alphabets

      • (ABCDEFG…) English

    • Cyrillic and Greek-based alphabets

      • (АБВГДЕЖЩ...) Russian

    • Ideographic writing systems have thousands of characters

      • (一丁勺両亀困...) Japanese

    • Bi-directional (RTL) languages go right to left

      • (...זוהדגבא) Hebrew

    • Complex scripts (everything else):

      • (ऋऌऍऎ )Devanagari

Private and Confidential


Expanded character sets

Expanded Character Sets

  • Most languages have alphabetic or phonetic writing systems:

    • Russian, Greek, Slavic, (many) Native American, Bahasa, Hebrew, Arabic, Semitic, etc.: alphabetic

    • Indian (subcontinent), Thai, Japanese kana, Korean: phonetic writing systems

    • 8 bits is enough for all of the above (with some tricks)

  • Some languages use scripts based on Chinese ideographic writing (“Han” or “Hanja”):

    • Chinese

    • Korean

    • Vietnamese (traditional)

    • Japanese Kanji

Private and Confidential


Double byte

“Double-Byte”

  • 8-bit character encodings use eight bits per character.

    • 28 = 255 characters

  • “Double-byte” character sets must be 2 bytes per character ?

    • 216 = 65,535 characters

  • Should actually be called “multi-byte” (MBCS).

    • Each character can be ONE, TWO, THREE and sometimes FOUR bytes in length.

    • MAY involve shift states.

Private and Confidential


Multibyte encodings

Multibyte Encodings

A typical Japanese Character Set:

JIS X 208 (漢字)

Character Encodings of JIS X 208:

Shift-JIS (CP932):0x8A 0xBF 0x8E 0x9A

EUC-JP:0xB4 0xC1 0xBB 0xFA

ISO 2022-JP:0x1B, 0x24, 0x42, 0x34 0x41 0x3B 0x7A 0x1B 0x28 0x4A

Non-Legacy:

UTF-16:(0x6F22 0x5B57)

Private and Confidential


An mbcs example shift jis

An MBCS Example: Shift-JIS

  • Character set used by DOS, Windows, Macs, and a few UNIX-like systems for Japanese.

    • Code Page 932

    • JIS X 208:1997

Private and Confidential


Shift jis

Shift-JIS

  • In order to reach more characters, double byte values start with a limited range of “lead bytes”

  • These can be followed by any character value> 0x40 (“trail byte”)

Private and Confidential


Shift jis1

Shift-JIS

  • Each “lead byte” provides a “window” onto additional characters.

Private and Confidential


Shift jis2

Shift-JIS

  • Problems:

    • Lead byte values are also valid as trail bytes.

    • Common special characters (“\”!!) are valid trail bytes.

Private and Confidential


Introduction to character encodings java and you

Han

  • CJK scripts require up to 100,000 unique characters for complete representation.

    • Four major variants:

      • Traditional Chinese

      • Simplified Chinese

      • Japanese Kanji

      • Korean (non-Hangul)

Private and Confidential


Kanji

“Kanji”

  • Sometimes you hear Japanese called “kanji”

    • Kanji is actually one of fourwriting systems used in Japan.

    • Kanji should be avoided as a generic term for DBCS.

  • Kanji (“Han” or Chinese writing): 日本語

  • Hiragana (phonetic for Japanese words): にほんご

  • Katakana (phonetic for “foreign” words): ニホンゴ

  • Romanji (“Roman script”): nihongo

Private and Confidential


Chinese

Chinese

  • Upper two are Traditional.

  • Lower character is the Simplified variant.

Private and Confidential


Hangul

Hangul

  • Korean Hangul is a syllabic phonetic system, which has thousands of combinations.

    • Hangul is not related to Han ideographic writing.

Private and Confidential


Code page hell

Code Page Hell

  • With hundreds of encodings and character sets to choose from, making internationalized code work in the late 1980’s and early 1990’s was “hellish”.

  • Internationalization folks referred to this as “code page hell”

Private and Confidential


Unicode and java

Unicode and Java

To the Rescue


Unicode iso 10646 2

Unicode (ISO 10646-2)

  • Unicode is a character set that supports all of the world’s languages and writing systems.*

    • Originally designed as a “wide character set”--every character was represented by 16-bits. This allowed for 65,535 potential characters.

    • Extended to allow 1.1 million characters.

    • Unicode is maintained by an industry consortium. ISO 10646-2 is maintained by WG2. The two are exactly identical.

Private and Confidential


It s a character set

It’s a character set?

  • Unicode is a character set. It has these encodings:

    • UTF-32. (BE/LE)

      • A 32-bit encoding. All characters 32 bits.

    • UTF-16. (BE/LE)

      • A 16-bit encoding. All characters are 16-bits.

      • Characters above 0xFFFF (the “Basic Multilingual Plane”) require two special “surrogate” characters.

    • UTF-8.

      • An 8-bit variable width encoding. Characters are 1, 2, 3 or 4 bytes long. Always non-endian.

      • ASCII == ASCII

      • All other characters have a special bit pattern

Private and Confidential


Utf 8 bit pattern

UTF-8 Bit Pattern

  • ASCII == ASCII

    • 0x41 == ‘A’

  • All other characters are multibyte.

    • 110xxxxx == two bytes

    • 1110xxxx == three bytes

    • 11110xxx == four bytes

    • 10xxxxxx == trail byte

    • U+00C0 == À == 0xC3 0x80 (11000011 10000000)

Private and Confidential


Convenience method for utf8

Convenience Method for UTF8

  • Almost True: readUTF and writeUTF allow direct access to UTF-8 DataInput/DataOutputStreams.

    • This is not really UTF-8, but a Sun specialized version.

    • Use InputStreamReader/OutputStreamWriter to do proper conversions.

Private and Confidential


Java uses unicode

Java Uses Unicode

  • Every character in every Java String object is encoded as UTF-16 Unicode.

    • Every string is converted from a legacy encoding, either by the compiler or by the String class.

    • This is the reason for native2ascii and –encoding switches.

  • Once you have a String object, everything is Unicode UTF-16.

Private and Confidential


Special encodings

“Special” encodings

  • There are two encodings that the system treats as special:

    • file.encoding

    • ISO-8859-1

  • All basic conversion functions use your system default encoding.

  • Most servlet conversion functions use ISO-8859-1 as the default.

Private and Confidential


Two file encodings

Two File Encodings

  • Windows systems generally have two different file encodings:

    • “ANSI” encoding is the Windows default code page for GUI applications.

    • “OEM” encoding is the code page used by the ‘cmd’ or ‘command’ interpreter shells.

Private and Confidential


Stream readers and writers

Stream Readers and Writers

  • InputStreamReader and OutputStreamWriter classes perform controlled conversion between byte[] and String.

    • Always pass the encoding as a variable.

    • Use the IANA preferred name for the encoding, if possible (see ftp://ftp.isi.edu/in-notes/iana/assignments/)

    • Prefer UTF8 for on-the-wire transport.

Private and Confidential


Code sample

Code Sample

// use with any type of InputStream class

InputStream is = new FileInputStream(file);

InputStreamReader isr =

new InputStreamReader(is, encoding);

// use Buffered Reader for efficiency

BufferedReader br =

new BufferedReader(isr);

StringBuffer sb = new StringBuffer();

int chr;

while ((chr = br.read() > -1) {

sb.append(chr);

}

* Note: Try blocks eliminated for clarity.

Private and Confidential


Outputstreamwriter code sample

OutputStreamWriter Code Sample

// use with any type of OutputStream class

OutputStream os =

new ByteArrayOutputStream(file);

OutputStreamWriter osw =

new OutputStreamWriter((OutputStream)os,

encoding);

osw.write(myString, 0, myString.length());

osw.flush();

* Note: Try blocks eliminated for clarity.

Private and Confidential


Character class

Character Class

  • Provides access to Unicode character properties.

    • UnicodeBlock inside class

    • Character getType (defined types)

    • isDigit

    • isLetter

    • isLetterOrDigit

    • isUpperCase/isLowerCase/isTitleCase

    • toUpperCase/toLowerCase/toTitleCase

    • isSpace/isWhitespace

    • isISOControl/isJavaIdentifierStart/isJavaIdentiferPart

Private and Confidential


Normalization

Normalization

  • Many characters have two (or more) representations in Unicode.

    • Normalization makes the sequences the same.

    • Simplifies user input parsing and validation.

Private and Confidential


Icuj normalizer class

ICUj Normalizer Class

  • Four forms of Normalization:

    • Form C (composed)

    • Form D (decomposed)

    • Form KC (canonical composed)

    • Form KD (canonical decomposed)

    • Special handling for Hangul characters!

    • Note that there is a private class java.text.Normalizer in the JDK.

Private and Confidential


Demo programs

Demo Programs

  • UnicodeDemo – a Java program that demonstrates the byte sequences of different encodings and also provides some code that shows ISR and OSW in action.

  • Charsets – a Windows program by my buddy Bill Hall for playing with encodings.

  • http://www.inter-locale.com -- my personal website, with examples and demos of certain Java I18n things.

Private and Confidential


Questions

Questions?

Addison Phillips

[email protected]


  • Login