introduction to character encodings java and you
Download
Skip this Video
Download Presentation
Introduction to Character Encodings, Java and You

Loading in 2 Seconds...

play fullscreen
1 / 50

Introduction to Character Encodings, Java and You - PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on

Introduction to Character Encodings, Java and You. Agenda. Defining the problem Where webMethods products encounter character set problems. What the symptoms look like. Understand core concepts What is a character set? What’s an encoding? What is Unicode, really?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Character Encodings, Java and You' - shika


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
agenda
Agenda
  • Defining the problem
    • Where webMethods products encounter character set problems.
    • What the symptoms look like.
  • Understand core concepts
    • What is a character set? What’s an encoding?
    • What is Unicode, really?
  • Code Examples to avoid problems

Private and Confidential

confusion reigns
Confusion Reigns
  • Generally, the most confusing aspect of internationalization.
    • Many, many standards to choose from.
    • Arcane terminology
    • American programmers rarely (seem) to encounter it head-on.
  • We’re presenting this because many of our products are encountering this problem now.

Private and Confidential

problem domain
Problem Domain
  • webMethods products interface with:
    • non-Java systems (for example, in the adapters)
    • non-Java environments (file systems, databases, libraries, email, ftp, http, etc.).

Private and Confidential

java s text representation
Java’s Text Representation
  • Java provides a convenient text processing architecture centered on the Java String object.
    • A Java String is basically an array of Java Character Objects.

Private and Confidential

java characters
Java Characters
  • Each Java Character object represents a Unicode character.
    • (Currently) a 16-bit unsigned integer value between 0 and 65,535.
    • Character class provides access to character properties.
      • UPPER, lower, and Titlecase mapping
      • Comparison
      • Directionality
      • Compatibility
      • C-TYPE values such as ‘alpha-ness’, ‘digit-ness’, ‘alphanumeric-ness’

Private and Confidential

non java text
Non-Java Text
  • Non-Java files, applications, filesystems, database, et.al. typically do not use Unicode. Java sees them as an array of bytes (byte[]).

Private and Confidential

three problems
Three Problems

Private and Confidential

bad conversion
Bad Conversion
  • Target character set doesn’t have this character in it. Java replaces each character with a “?”
  • Input String: 日本語
  • Output String: ???
  • Typically:
    • Using the default encoding when we meant to specify one.
    • Writing on a device (such as System.out) whose legacy encoding doesn’t support the characters.

Private and Confidential

no glyph
“No Glyph”
  • Java knows what the character is and is handling it properly, but doesn’t have a picture of it to show you (in the current Font selected).
  • Input String: 日本語
  • Output String:
  • Typically:
    • Nothing is wrong, just using the wrong Font.

Private and Confidential

random trash
Random Trash
  • A byte[] was converted using the wrong character encoding. Bytes were mapped to the wrong characters.
  • Input String: 日本語
  • Output String: “ú–{Œê
  • Typically:
    • Using the wrong encoding, the underlying bytes are mapped to different, random-seeming characters.

Private and Confidential

examples
Examples
  • Same byte sequences, different results:

Shift JIS byte[] = 0xE0, 0x41, 0x83, 0x70 = “漓パ”

Latin-1 byte[] = 0xE0, 0x41, 0x83, 0x70 = “àAƒp”

Java String = 0xE0, 0x41, 0x83, 0x70 = “荰”

Java String = “漓パ” = U+6F13 U+30D1

Private and Confidential

what is a character
What is a Character?
  • A character is a single, atomic unit of text.
  • The definition has a different meaning according to the writing system and context.

Private and Confidential

abstract characters
Abstract characters
  • Some abstract characters include:

A Roman Letter Capital A

` Combining Accent Grave

に Hiragana character “ni”

語 CJK Ideograph

ي Arabic letter

앚 Hangul syllable

A Fullwidth compatibility letter A

Private and Confidential

what is a character set
What is a Character Set?
  • A character set is a “set”--- a collection of characters, usually organized in some fashion.
  • You’re probably most familiar with ASCII:
    • 0x41 ‘A’
    • 0x42 ‘B’
    • Etc.

Private and Confidential

what is a character encoding
What is a Character Encoding?
  • Character set: a collection of characters, basically, a bucket.
  • Character encoding: the specific ones and zeroes assigned to a character set.

Character Set: ‘A’ == 0x41

Character Encoding: ‘A’ == 0x41

Private and Confidential

eight bit encodings
Eight Bit Encodings
  • 8-bit encodings allow for 256 characters.

128 ASCII

32 ‘C1’ controls

96 extended

Private and Confidential

latin 1
Latin-1
  • The standard for Western Europe is generally ISO-8859-1
  • AKA “Latin-1”
  • Used by UNIX systems and the Web.
  • Extended version used by Microsoft for Windows.

Private and Confidential

let a thousand encodings bloom
Let a Thousand Encodings Bloom…
  • Each language has it’s own character set…
    • Everywhere: ASCII*
    • Western European (like German or French): Latin-1
    • Eastern European (like Polish or Slovak): Latin-2
    • Simplified Chinese: GB2312

Private and Confidential

actually many for each language
Actually, many for each language…

Private and Confidential

other writing systems
Other Writing Systems
  • Writing systems vary around the world (in order of increasing complexity, more or less):
    • Latin-based alphabets
      • (ABCDEFG…) English
    • Cyrillic and Greek-based alphabets
      • (АБВГДЕЖЩ...) Russian
    • Ideographic writing systems have thousands of characters
      • (一丁勺両亀困...) Japanese
    • Bi-directional (RTL) languages go right to left
      • (...זוהדגבא) Hebrew
    • Complex scripts (everything else):
      • (ऋऌऍऎ )Devanagari

Private and Confidential

expanded character sets
Expanded Character Sets
  • Most languages have alphabetic or phonetic writing systems:
    • Russian, Greek, Slavic, (many) Native American, Bahasa, Hebrew, Arabic, Semitic, etc.: alphabetic
    • Indian (subcontinent), Thai, Japanese kana, Korean: phonetic writing systems
    • 8 bits is enough for all of the above (with some tricks)
  • Some languages use scripts based on Chinese ideographic writing (“Han” or “Hanja”):
    • Chinese
    • Korean
    • Vietnamese (traditional)
    • Japanese Kanji

Private and Confidential

double byte
“Double-Byte”
  • 8-bit character encodings use eight bits per character.
    • 28 = 255 characters
  • “Double-byte” character sets must be 2 bytes per character ?
    • 216 = 65,535 characters
  • Should actually be called “multi-byte” (MBCS).
    • Each character can be ONE, TWO, THREE and sometimes FOUR bytes in length.
    • MAY involve shift states.

Private and Confidential

multibyte encodings
Multibyte Encodings

A typical Japanese Character Set:

JIS X 208 (漢字)

Character Encodings of JIS X 208:

Shift-JIS (CP932): 0x8A 0xBF 0x8E 0x9A

EUC-JP: 0xB4 0xC1 0xBB 0xFA

ISO 2022-JP: 0x1B, 0x24, 0x42, 0x34 0x41 0x3B 0x7A 0x1B 0x28 0x4A

Non-Legacy:

UTF-16: (0x6F22 0x5B57)

Private and Confidential

an mbcs example shift jis
An MBCS Example: Shift-JIS
  • Character set used by DOS, Windows, Macs, and a few UNIX-like systems for Japanese.
    • Code Page 932
    • JIS X 208:1997

Private and Confidential

shift jis
Shift-JIS
  • In order to reach more characters, double byte values start with a limited range of “lead bytes”
  • These can be followed by any character value> 0x40 (“trail byte”)

Private and Confidential

shift jis1
Shift-JIS
  • Each “lead byte” provides a “window” onto additional characters.

Private and Confidential

shift jis2
Shift-JIS
  • Problems:
    • Lead byte values are also valid as trail bytes.
    • Common special characters (“\”!!) are valid trail bytes.

Private and Confidential

slide30
Han
  • CJK scripts require up to 100,000 unique characters for complete representation.
    • Four major variants:
      • Traditional Chinese
      • Simplified Chinese
      • Japanese Kanji
      • Korean (non-Hangul)

Private and Confidential

kanji
“Kanji”
  • Sometimes you hear Japanese called “kanji”
    • Kanji is actually one of fourwriting systems used in Japan.
    • Kanji should be avoided as a generic term for DBCS.
  • Kanji (“Han” or Chinese writing): 日本語
  • Hiragana (phonetic for Japanese words): にほんご
  • Katakana (phonetic for “foreign” words): ニホンゴ
  • Romanji (“Roman script”): nihongo

Private and Confidential

chinese
Chinese
  • Upper two are Traditional.
  • Lower character is the Simplified variant.

Private and Confidential

hangul
Hangul
  • Korean Hangul is a syllabic phonetic system, which has thousands of combinations.
    • Hangul is not related to Han ideographic writing.

Private and Confidential

code page hell
Code Page Hell
  • With hundreds of encodings and character sets to choose from, making internationalized code work in the late 1980’s and early 1990’s was “hellish”.
  • Internationalization folks referred to this as “code page hell”

Private and Confidential

unicode and java

Unicode and Java

To the Rescue

unicode iso 10646 2
Unicode (ISO 10646-2)
  • Unicode is a character set that supports all of the world’s languages and writing systems.*
    • Originally designed as a “wide character set”--every character was represented by 16-bits. This allowed for 65,535 potential characters.
    • Extended to allow 1.1 million characters.
    • Unicode is maintained by an industry consortium. ISO 10646-2 is maintained by WG2. The two are exactly identical.

Private and Confidential

it s a character set
It’s a character set?
  • Unicode is a character set. It has these encodings:
    • UTF-32. (BE/LE)
      • A 32-bit encoding. All characters 32 bits.
    • UTF-16. (BE/LE)
      • A 16-bit encoding. All characters are 16-bits.
      • Characters above 0xFFFF (the “Basic Multilingual Plane”) require two special “surrogate” characters.
    • UTF-8.
      • An 8-bit variable width encoding. Characters are 1, 2, 3 or 4 bytes long. Always non-endian.
      • ASCII == ASCII
      • All other characters have a special bit pattern

Private and Confidential

utf 8 bit pattern
UTF-8 Bit Pattern
  • ASCII == ASCII
    • 0x41 == ‘A’
  • All other characters are multibyte.
    • 110xxxxx == two bytes
    • 1110xxxx == three bytes
    • 11110xxx == four bytes
    • 10xxxxxx == trail byte
    • U+00C0 == À == 0xC3 0x80 (11000011 10000000)

Private and Confidential

convenience method for utf8
Convenience Method for UTF8
  • Almost True: readUTF and writeUTF allow direct access to UTF-8 DataInput/DataOutputStreams.
    • This is not really UTF-8, but a Sun specialized version.
    • Use InputStreamReader/OutputStreamWriter to do proper conversions.

Private and Confidential

java uses unicode
Java Uses Unicode
  • Every character in every Java String object is encoded as UTF-16 Unicode.
    • Every string is converted from a legacy encoding, either by the compiler or by the String class.
    • This is the reason for native2ascii and –encoding switches.
  • Once you have a String object, everything is Unicode UTF-16.

Private and Confidential

special encodings
“Special” encodings
  • There are two encodings that the system treats as special:
    • file.encoding
    • ISO-8859-1
  • All basic conversion functions use your system default encoding.
  • Most servlet conversion functions use ISO-8859-1 as the default.

Private and Confidential

two file encodings
Two File Encodings
  • Windows systems generally have two different file encodings:
    • “ANSI” encoding is the Windows default code page for GUI applications.
    • “OEM” encoding is the code page used by the ‘cmd’ or ‘command’ interpreter shells.

Private and Confidential

stream readers and writers
Stream Readers and Writers
  • InputStreamReader and OutputStreamWriter classes perform controlled conversion between byte[] and String.
    • Always pass the encoding as a variable.
    • Use the IANA preferred name for the encoding, if possible (see ftp://ftp.isi.edu/in-notes/iana/assignments/)
    • Prefer UTF8 for on-the-wire transport.

Private and Confidential

code sample
Code Sample

// use with any type of InputStream class

InputStream is = new FileInputStream(file);

InputStreamReader isr =

new InputStreamReader(is, encoding);

// use Buffered Reader for efficiency

BufferedReader br =

new BufferedReader(isr);

StringBuffer sb = new StringBuffer();

int chr;

while ((chr = br.read() > -1) {

sb.append(chr);

}

* Note: Try blocks eliminated for clarity.

Private and Confidential

outputstreamwriter code sample
OutputStreamWriter Code Sample

// use with any type of OutputStream class

OutputStream os =

new ByteArrayOutputStream(file);

OutputStreamWriter osw =

new OutputStreamWriter((OutputStream)os,

encoding);

osw.write(myString, 0, myString.length());

osw.flush();

* Note: Try blocks eliminated for clarity.

Private and Confidential

character class
Character Class
  • Provides access to Unicode character properties.
    • UnicodeBlock inside class
    • Character getType (defined types)
    • isDigit
    • isLetter
    • isLetterOrDigit
    • isUpperCase/isLowerCase/isTitleCase
    • toUpperCase/toLowerCase/toTitleCase
    • isSpace/isWhitespace
    • isISOControl/isJavaIdentifierStart/isJavaIdentiferPart

Private and Confidential

normalization
Normalization
  • Many characters have two (or more) representations in Unicode.
    • Normalization makes the sequences the same.
    • Simplifies user input parsing and validation.

Private and Confidential

icuj normalizer class
ICUj Normalizer Class
  • Four forms of Normalization:
    • Form C (composed)
    • Form D (decomposed)
    • Form KC (canonical composed)
    • Form KD (canonical decomposed)
    • Special handling for Hangul characters!
    • Note that there is a private class java.text.Normalizer in the JDK.

Private and Confidential

demo programs
Demo Programs
  • UnicodeDemo – a Java program that demonstrates the byte sequences of different encodings and also provides some code that shows ISR and OSW in action.
  • Charsets – a Windows program by my buddy Bill Hall for playing with encodings.
  • http://www.inter-locale.com -- my personal website, with examples and demos of certain Java I18n things.

Private and Confidential

ad