internationalization an introduction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Internationalization: An Introduction PowerPoint Presentation
Download Presentation
Internationalization: An Introduction

Loading in 2 Seconds...

play fullscreen
1 / 185

Internationalization: An Introduction - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Internationalization: An Introduction. Presenter and Presentation. Addison Phillips Globalization Architect This Presentation “Internationalization and Unicode Conference” Tutorial Covers Internationalization and basic concepts, such as character encodings. Who is this guy?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Internationalization: An Introduction' - jocelin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
internationalization an introduction
Internationalization:

An Introduction

presenter and presentation
Presenter and Presentation
  • Addison Phillips
    • Globalization Architect
  • This Presentation
    • “Internationalization and Unicode Conference” Tutorial
    • Covers Internationalization and basic concepts, such as character encodings
who is this guy
Who is this guy?

Globalization Architect, Lab126(you know us as “Amazon Kindle”)

Chair, W3C Internationalization Core WG

EditorIETF LTRU-WG

internationalization is
Internationalizationis:

the design and development of a product that is enabled for target audiences that vary in culture, region, or language. [W3C]

a fundamental architectural approach to software development

related concepts
Related Concepts

Localization: creation of a product tailored to a particular target market

Translation: process of converting text from one language to another

Globalization: unified approach to creating global products, especially those that support multiple geographies simultaneously

mystic numbering m4c n7g
Opinions differ on capitalization (C12N);choose from:

i18N

I18n

I18n

I18N

Very geeky; not very internationalized (I19G?)

I N T E R N A T I O N A L I Z A T I O N

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N

I18N

Localization = L10N

Globalization = G11N

Canonicalization = C14N

Mystic Numbering (M4C N7G)
a global approach
A Global Approach
  • Internationalization turns technical problems into business decisions
  • Balance priorities based on real user distribution/requirements
    • Consider global user population as a whole
    • Consider specific market requirements on an equal footing
    • Potential markets for the product
buy in the key to success
Buy In: The Key to Success
  • For internationalization to be a success over time, there must be commitment:
    • Management
    • Product Team
    • Development Team
      • All developers, not a splinter group
globalized product development
Globalized Product Development

Internationalization turns technical problems into business decisions.

  • Localization: Choose which markets to translate user interface or documentation for with no engineering.
  • Deployment : Choose whether to serve applications from a single site, cluster of sites, or in each target market.
  • Development : Add content and features to products as necessary in each target market.
  • Integration and Interoperability: Servers and products can work together around the world, so customers can truly create “Enterprise” solutions.
aspects of internationalization
Aspects of Internationalization

Enabling—the same code supports multiple regions or cultures. Sometimes called a “global binary”.

Externalization—plan for localizability by separating “content” from code. This makes localization for specific languages, regions, or cultures easy, fast, and cheap.

Customization—add culturally specific functionality, presentation, or content to an application.

what me worry
What, me worry?

We (wrote it in Java/C#, used Unicode, etc.), so it is internationalized.

We made the assumption that the product would only ever have English screens: all our users understand it anyway.

A localized product is internationalized.

An internationalized product is slow/slower.

It takes longer to write internationalized code.

We can’t read the screens/it is too hard to test.

We have no intention of localizing, so no need to internationalize.

We don’t have any customers there.

The users in (some country) never complained, so it must work.

This product is 100% fully internationalized.

development methodologies
Development Methodologies
  • Independent of development methodology
    • Agile? Waterfall? You make the choice.
  • Encompasses the full development cycle:
    • Design
    • Development
    • QC
    • Release
    • Support
the customization approach
The Customization Approach
  • “Internationalization is something remedial”
    • “Didn’t we do internationalization in the last release?!?”
    • Internationalization involves a lot of arcane knowledge (“we don’t know what to do”)
    • “It will interrupt or slow down development.”
    • “International features are not important to our U.S. customers—and they represent our largest market.”
    • “The guys in-country have always figured it out before.”
    • “Let’s outsource it”
    • “We’ll get to it next time”
how that model really looks

International Branch

functionality gaps: intl users waiting for 2.0i now

Merges and Fixes

Lots more peopleand cost

1.0i

International Release 1.0

Lost $ and opportunitylots of cost to get there

How That Model Really Looks

bug fixes

sexy new features

1.0

1.0a

2.0

Main Line

Time

the problem with customization
The Problem with Customization

Code forks. (double, triple coding)

Lag time for international releases.

Non-adoption of localized release.

Full regression of every language.

Quality or commitment perception.

Lack of data exchange between language versions.

Difficult to repeat (every version is a repeat)

Proliferation of bugs and of support problems.

International features are cancelled.

Core product still doesn’t work/can’t address similar markets.

Loss of market share.

the internationalization approach
The Internationalization Approach

Gather requirements globally

Enable

Externalize

Customize

Test and support globally

Localize

large animal pictures

Global Code

Resources

Large Animal Pictures

Software Component

Output

Input

I/O

enterprise animal pictures
Enterprise Animal Pictures

clients

API

API

Business Logic

Business Logic

Front End

data feed

Data Store

API

Business Logic

Data Store

Operating Env.

partner or provider

Operating Env.

internationalization issues
Internationalization Issues
  • Text Processing
    • Character encodings, including Unicode, spelling, word breaks, collation, and so on
  • Language
    • Of the software (localization)
    • Of solutions built using the software (localizability, data)
  • Locale-affected formats
    • dates, numbers and the like
  • Regionally-affected formats
    • names, addresses, currency, and the like
  • Time-related issues
    • time zone, calendar, holidays, work rules and the like
  • Cultural adaptation
    • presentation, style, position, color use, and the like
  • Legal requirements
    • accessibility, SOX, DRM, moderation, security, content, and the like
well it depends
“Well, it depends…”

Making Good Design Decisions

  • Generalize designs
    • Locale independent data structures
    • Locale sensitive display
  • Externalize cultural or linguistic variations
  • Customize as a last resort
levels of enablement
Levels of Enablement
  • Not Enabled
  • Single-Language-at-a-Time (SLAAT)

All components run in the same language and encoding environment correctly.

  • Multi-Locale

Unicode support; components run in different locales, languages, encodings, and time zones

test your assumptions
Test Your Assumptions
  • Gender:
  • Male
  • Female
enabling
Enabling

Making Code Aware of Culture

what is enabling
What is “enabling”?
  • Enabled software:

adapts the display, processing, validation, storage, and transmission of data according to the cultural, linguistic, and regional needs of the users

    • Text, Characters, and Encodings
    • Locale Awareness
    • Times and Time Zones

A “global binary” is a single object-code version that is used in all markets, regardless of localization.

the biggest source of woe
The Biggest Source of Woe

“Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.”

~Glen PerkinsGlobalization Architect

a lot of jargon
A lot of jargon

Real and bogus jargon you might encounter:

Real Jargon

Multibyte

Variable width

Wide character

Character encoding

Coded character set

Bidi or bidirectional

Glyph, character, code unit

Unicode

Potentially Bogus Jargon

kanji

double-byte language

extended ASCII

ANSI

encoding agnostic

how the computer sees the world
How the computer sees the world

“bits”: 010000010101101101101000

“byte” or “octet”: 01000001 (0x41)

  • code unit: a unit of physical storage and information interchange
    • represent numbers
    • come in various sizes (e.g. 7, 8, 16, 32, 64 bits)
  • how do we map text to the numbers used by computers?
from text to bits

… 0xC3 0x80 …

From text to bits

À

Glyphs

  • A “glyph” is screen unit of text: it’s a picture of what users think of as a character.
  • A “grapheme” is a single visual unit of text.

Characters

  • A “character” is a single logical unit of text.
  • A “character set” is aset of characters.
  • A “code point” is a number assigned to a character in a character set.
  • A “coded character set” is a character set where each character has a code point.

Bytes

  • A “character encoding” maps a sequence of code points (“characters”) to a sequence of code units (such as bytes).
  • A “code unit” is a single logical unit of storage.

U+00C0

coded character set
Coded Character Set
  • Collection (repertoire) of characters, that is: a set.
  • Organized so that each character has a unique numeric (typically integer) value (code point).
  • Examples:
    • Unicode
    • ASCII (ANSI X3.4)
    • ISO 646
    • JIS X 208
    • Latin-1 (ISO 8859-1)

Character sets are often associated with a particular language or writing system.

character encoding

U+00C0 0xC3 0x80

Character Encoding
  • Maps a sequence of code points (characters) to a sequence of code units (e.g. bytes).
    • Some encodings use another unit instead of the byte. For example, some encodings use a 16-bit, 32-bit, or 64-bit code unit.
usually the most important slide in this entire presentation
(usually the most important slide in this entire presentation)

In memory, on disk, on the network, etc.

All texthas a character encoding

When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.

common encoding problems
Common Encoding Problems

Mojibakegarbage characters

Question Marks(conversion not supported)

Tofuhollow boxes

slide38
Tofu

Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example)

Not usually a bug: it’s a display problem

Can mask or masquerade as character corruption.

sources of mojibake
Sources of Mojibake
  • View text using the wrong encoding
  • Apply a transfer encoding and forget to remove it
  • Convert to an encoding twice

Convert to or from the wrong encoding

Overzealous escaping

Conversion to entities (“entitization”)

Multiple conversions

ascii
ASCII
  • 7 bits = 27 = 128 characters
  • Enough for “U.S. English”
latin 1 iso 8859 1
Latin-1(ISO 8859-1)

ASCII for characters 0x00 through 0x7F

Accented letters and other symbols 0x80 through 0xFF

one character many encodings
One character—many encodings!

char Cp1252 Cp437 Cp850

È 0xC8 ? 0xD4

windows code pages
Windows Code Pages

Windows’s encodings (called “code pages”) are generally based on standard encodings—plus some additional characters.

Example:

  • CP 1252 is based on ISO 8859-1, but includes 27 “extra” characters in the C1 control range (0x80-0x9F)
code page
Originally an IBM character encoding term.

IBM numbered their character sets with “CCSIDs” (coded character set ids) and numbered the corresponding character encodings as “code pages”.

Microsoft borrowed code pages to create PC-DOS.

Microsoft defines two kinds of code pages:

“ANSI” code pages are the ones used by Windows GUI programs.

“OEM” code pages are the ones used by command shell/command line programs.

Neither “ANSI” nor “OEM” refer to a particular encoding standard or standards body in this context.

Avoid the use of ANSI and OEM when referring to encodings.

Code Page
beyond single byte encodings
Beyond Single Byte Encodings
  • So far we’ve been looking at single-byte encodings:
    • one byte per character
    • 1 byte = 1 character (= 1 glyph?)
    • 256 character maximum
    • Good enough for most alphabetic languages
  • Some languages need more characters.
  • What about the “double-byte” languages?
  • Don’t those take two bytes per character?

丏丣並

À

methods of reaching beyond single byte
Escape sequences to select another character set

Example: ISO 2022 uses escape sequences to select various encodings

Use a larger code unit (“wide” character encoding)

Example: IBM DBCS code pages or Unicode UTF-16

216 = 64K characters

232 = 4.2 billion characters

Use a variable-width encoding

Variable width encodings use different numbers of code units to represent different types of characters within the same encoding

Methods of reaching beyond single-byte
multibyte encodings
One or more bytes per character

1 byte != 1 character

May use 1, 2, 3, or 4 bytes per character

May use shift or escape sequences

May encode more than one character set

In fact, single-byte encodings are a special case of multibyte!

Multibyte Encoding: Any “variable-width” encoding that uses the byte as its code unit.

Multibyte Encodings
simple multibyte encodings
Specific byte ranges encoding characters that take more than one byte.

A “lead byte”

One or more “trailing bytes”

Code point != code unit

1-4-1

(code point)

0x82 0xA0

Simple Multibyte Encodings

A

1-3-33

(code point)

0x61

lead byte

trail byte

single byte

jis x 213 a multibyte character set
JIS X 213

11,233 characters

(2) 94x94 character planes

JIS X 213: A “Multibyte” Character Set
shift jis a multibyte encoding
Shift_JIS: A Multibyte Encoding
  • In order to reach more characters, Shift_JIS characters start with a limited range of “lead bytes”
  • These can be followed by a larger range of byte values (“trail byte”)
shift jis55
Shift-JIS
  • Lead bytes can be trail byte values
  • Trail bytes include ASCII values
  • Trail bytes include special values such as 0x5C (“\”)

int pos = strchr(mybuf, ‘@’);

more complex multibyte systems
Stateful Encodings

ex. IBM “MBCS” code pages [SI/SO shift between 1-byte and 2-byte characters]

ISO 2022 [escape sequence changes character set being encoded]

More Complex Multibyte Systems
encoding conversion

Templates

ISO 8859-1

Output(HTML, XML, etc.)

Content

UTF-8

Process

Data

Shift_JIS

Encoding Conversion

Document formats often require a single character encoding be used for all parts of the document.

  • Common Encoding Conversion Tools and Libraries
  • iconv (Unix)
  • ICU (C, C++, Java)
  • perl Encode
  • Java (native2ascii, IO/NIO)
  • (etc.)
  • When data is merged, the encodings must be merged also (or some of the data will be “mojibake”).
encoding conversion as filter
Encoding Conversion as Filter

Encoding conversion acts as a “filter”

  • Replacement characters (“question marks”) replace characters from the source character set that are not present in the target character set.

ISO 8859-1

ÀàС£

ISO 8859-1

ÀàС£

??????»èç?????

????

UTF-8

ÀàС£

??????»èç?????

????

UTF-8

детски»èçينس文字

Shift_JIS

文字化け

? (0x3F) is the replacement character for ISO 8859-1

too many fish in the sea
Need for more converters and conversion maps

Difficulty of passing, storing, and processing data in multiple encodings

Too many character sets…

…leads to what we call “code page hell”

Too Many Fish in the Sea
the idea behind unicode
Fights mojibake because:

characters are from the common repertoire;

characters are encoded according to one of the encoding forms;

characters are interpreted with Unicode semantics;

unknown characters are not corrupted

Basic Principles

Universal repertoire 

Logical order

Efficiency

Unification

Characters, not glyphs

Dynamic composition

Semantics

Stability

Plain Text

Convertibility

The Idea Behind Unicode
unicode iso 1064663
Unicode (ISO 10646)

Unicode is a character set that supports all of the world’s languages and writing systems.

  • Code space of up to 0x10FFFF characters (about 1.1 million)
  • Unicode and ISO 10646 are maintained in sync.
    • Unicode is maintained by an industry consortium.
    • ISO 10646 is maintained by the ISO.
what are planes
Divide Unicode in equal sized regions of code points.

17 planes (0 through 0x10), each with 65,535 characters.

Plane 0 is called the Basic Multilingual Plane (BMP).

> 99% of text in the wild lives in the BMP

Planes 1 through 0x10 are called supplementary planes.

What are “planes”?
unicode as the universal character set
An organized collection of characters.

Each character has a code point

aka Unicode Scalar Value (USV)

U+0041 <= hex notation

Unicode as the Universal Character Set
compatibility characters
Compatibility Characters

Compatibility Characters

includes presentation forms

legacy encoding: a term for non-Unicode character encodings.

Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings:

①②③45Ⅵ

¾Lj¼Nj½dž

︴︷︻︽﹁﹄

ヲィゥォェュ゙

ﺲ ﺳ ﻫ ﺽ ﵬ ﷺ

fi fl ffi ffl ſt ﬔ

unicode encodings
Unicode Encodings
  • UTF-32
    • Uses 32-bit code units.
    • All characters are the same width.
  • UTF-16
    • Uses 16-bit code units.
    • BMP characters use one 16-bit code unit.
    • Supplementary characters use two special 16-bit code units: a “surrogate pair”.
  • UTF-8
    • Uses 8-bit code units (bytes!)
    • It’s a multi-byteencoding!
    • Characters use between 1 and 4 bytes.
    • ASCII is ASCII in UTF-8
unicode encodings compared
Unicode Encodings Compared

A (U+0041)

UTF-32: 0x0000041

UTF-16: 0x0041

UTF-8: 0x41

À(U+00C0)

UTF-32: 0x000000C0

UTF-16: 0x00C0

UTF-8: 0xC2 0x80

ቐ(U+1251)

UTF-32: 0x00001251

UTF-16: 0x1251

UTF-8: 0xE1 0x89 0x91

𐌸(U+10338)

0x00010338

0xD800 0xDF38

0xF0 0x90 0x8C 0xB8

utf 32
UTF-32

Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”)

Each character takes exactly one code unit.

U+1251 ቑ 0x00001251

U+10338 𐌸 0x00010338

advantages and disadvantages of utf 32
Advantages and Disadvantages of UTF-32
  • Easy to process
    • each logical character takes one code unit
    • can use pointer arithmetic
  • Not commonly used
    • Not efficient for storage
      • 11 bits are never used
      • BMP characters are the most common—16 bits wasted for each of these
    • Affected by processor architecture (Big-Endian vs. Little-Endian)
utf 16
UTF-16
  • Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”)
    • BMP characters use one unit
    • Supplementary characters use a “surrogate pair”, special code points that don’t do anything else.

0x1251 U+1251 ቑ

0xD800 0xDF38 U+10338 𐌸

High Surrogate

Low Surrogate

Unique Ranges!

0xD800-DBFF

0xDC00-DFFF

advantages and disadvantages of utf 16
Advantages and Disadvantages of UTF-16
  • Most common languages and scripts are encoded in the BMP.
    • Less wasteful than UTF-32
    • Simpler to process (excepting surrogates)
    • Commonly supported in major operating environments, programming languages, and libraries
  • May not be suitable for all applications
    • Affected by processor architecture (Big-Endian vs. Little-Endian)
    • Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.
utf 8
UTF-8
  • 7-bit ASCII is itself
  • All other characters take 2, 3, or 4 bytes each
    • lead bytes have a special pattern
    • trailing bytes range from 0x80->0xBF

Code Points

Lead Bytes

Trail Bytes

0xxxxxxx

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

11110xxx 10xxxxxx 10xxxxxx10xxxxxx

< 0x80

< 0x800

< 0x10000

Supplementary

advantages and disadvantages of utf 8
Advantages and Disadvantages of UTF-8

ASCII-compatible

Default or recommended encoding for many Internet standards

Bit pattern highly detectable (over longer runs)

Non-endian

Streaming

C char* friendly

Easy to navigate

Multibyte encoding requires additional processing awareness

Non-shortest form checking needed

Less efficient than UTF-16 for large runs of Asian text

byte order mark bom
Byte Order Mark (BOM)

U+FEFF

  • Used to indicate the “byte-order” of UTF-16 code units
    • 0xFE FF; 0xFF FE
  • Also used as a Unicode signature by some software (Windows’s Notepad editor, for example) for UTF-8
    • 0xEF BB BF

Appears as a character or renders as junk in some formats or on some systems. For example, older browsers render it as three bits of mojibake.

the replacement character
The Replacement Character

U+FFFD

  • Indicates a bad byte sequence or a character that could not be converted.
  • Equivalent to “question marks” in legacy encoding conversions

there was a character here, but it is gone now

composing characters using combining marks
Composing Characters Using Combining Marks

Composition can create “new” characters

Base + non-spacing (“combining”) characters

A+ ˚ = Å

U+0041 + U+030A = U+00C5

a + ˆ + . = ậ

U+0061 + U+0302 + U+0323 = U+1EAD

a + . + ˆ = ậ

U+0061 + U+0323 + U+0302 = U+1EAD

complex scripts
Complex Scripts

ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท

ญั=ญ+ั

glyph = consonant + vowel

ญัตติที่เสนอได้ผ่านที่ประชุมด้วยมติเอกฉันท(word boundaries)

hindi
Hindi

What is Unicode?

यूनिकोड क्या है?

यूनिकोड

यूनिकोड

न + ि = नि

tamil

U+0B95 U+0BBE U+0BC6

Combining mark drawn to the left of the base character

Tamil

கொ

‘ko’

normalization

Ǻ

  • Unicode Normalization has to deal with more issues:
  • single or multiple combining marks
  • compatibility characters
  • presentation forms

U+01FA

U+00C5 U+0301

U+00C1 U+030A

U+212B U+0301

U+0041 U+0301 U+030A

U+0041 U+030A U+0301

abc

Normalization

Abc

ABC

abc

abC

aBc

four normalization forms
Four Normalization Forms

ways to represent:

U+01FA

U+00C5 U+0301

U+00C1 U+030A

U+212B U+0301

U+0041 U+0301 U+030A

U+0041 U+030A U+0301

Ǻ

  • Form D

canonical decomposition

  • Form C

canonical decomposition followed by composition

  • Form KC

kompatibility decomposition followed by composition

  • Form KD

kompatibilitydecomposition

unicode defines character properties
Unicode Defines Character Properties

Unicode provides additional information:

  • Character name
  • Character class
  • “ctype” information, such as if it’s a digit, number, alphabetic, etc.
  • Directionality (LTR, RTL, etc.) and the Bidi Algorithm
  • Case mappings (UPPER, lower, and Titlecase)
  • Default Collation and the Unicode Collation Algorithm (UCA)
  • Identifier names
  • Regular Expression syntaxes
  • Normalization
  • Compatibility information

Many of these items are in the form of Unicode Technical Reports

unicode character database
Unicode Character Database

code point

name

character class

combining level

bidi class

case mappings

canonical decomposition

mirroring

default grapheme clustering

ӑ (U+04D1)

CYRILLIC SMALL LETTER A WITH BREVE

  • letter
  • non-combining
  • left-to-right
  • decomposes to U+0430 U+0306
  • ӐU+04D0 is uppercase (and titlecase)
bi directional scripts
Bi-directional Scripts

Some languages are written predominantly from left-to-right (LTR).

Some languages are written predominantly from right-to-left (RTL).

(A few can be written top-to-bottom or using other schemes)

Unicode defines character “directionality” and a “Bidi” algorithm for rendering text.

  • Uses logical, not visual, order.
  • Uses levels of “embedding”.
  • Requires markup changes in some HTML for full support.
embedding and logical order
Embedding and “Logical Order”

Characters are encoded in logical order.

Visual order is determined by the layout.

  • Override and bidi control characters
  • “Indeterminate” characters
transfer encodings
Transfer Encodings

A transfer encoding syntax is a reversible transform of encoded data which may (or may not) include textual data represented in one or more character encoding schemes.

Abcソース

=?UTF-8?B?QWJj44K944O844K5?=

Abcソース

Email headers

URIs

IDN (domain names)

that s great i ll just use unicode
“That’s great: I’ll just use Unicode”
  • Remember “all text has an encoding”?
    • user input via forms
    • email
    • data feeds
    • existing, legacy data
    • database instances
    • uploads
  • Use UTF-8 for HTML and Web forms
  • Use UTF-8 in your APIs
  • Check that data really is UTF-8
  • Control encoding via code; avoid hardcoding the encoding
  • Watch out for legacy encodings
    • Convert to Unicode as soon as practical.
    • Convert from Unicode as late as possible.
    • Wrap Unicode-unfriendly technologies
slide91
Map Your System

APIs

  • use Unicode encoding
  • hide internal storage encoding

Data Stores, Local I/O

  • use Unicode encoding
  • consider an encoding conversion plan

Front Ends

  • use Unicode encoding

Back Ends, External Data

  • Uses Unicode?
  • If not, what encoding?
  • Store the encoding!

Your System

Convert to Legacy

Unicode Interface

Unicode Cloud

API

Detect / Convert

Legacy Encoding

Unicode

Capture Encoding

Detect / Convert

Input

slide92
<?php

header("Content-type: text/html; charset=UTF-8");

?>

<html>

<head>

<meta

http-equiv="Content-Type"

content="text/html; charset=UTF-8” />

<title>Fight 文字化け!</title>

</head>

HTML

Set Web server to declare UTF-8 in HTTP Content-Type header

Declare UTF-8 in META tag header

Actually use UTF-8 as the encoding!!

counting things
Counting Things

varchar(110)

यूनिकोड (4 glyphs)

यूनिकोड (7 characters)

E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (21 bytes)

Be aware of whether you need to count glyphs, characters, or bytes:

  • Is the limit “screen positions”, “characters”, or “bytes of storage”?
  • Should you be using a different limit? Which one are you actually counting?
e nabling c ode for formats and presentation

Enabling Code for formats and presentation

Adapting code to language, regional, and cultural variation

don t code what you think you know
Don’t Code What You Think You Know

5/2/7

1.234

4.32.MD

sometime in February? sometime in May?sometime in 2005?

more than 1000? less than 2?

number, time, currency?morning or afternoon?

time formats
Time Formats

5:00 AM

5:00 PM

10:00 PM

Don’t forget to identify time zone!

U.S.A.: 4:00 p.m.

France: 16.00

Japan: 1600

Japan: ごご4:00

Korea: 오후 4:32

Thai: 16:32 น.

Albanian: 4.32.MD

Arabic: 04:32 م

more examples
More Examples

Assumptions about date tokens:

USA: Sun, Mon, Tue 3 positions, titlecase

French: lun. mar. mer.four positions lowercase

Russian: Пн Вв Срtwo positions, Cyrillic

USA: Jan, Feb, Mar 3 positions, titlecase

French: janv. févr. mars avr. variable (4 or 5)positions, lowercase

Spanish (Spain): ene, feb, mar not titlecase

Spanish (Americas): Ene, Feb, Mar titlecase

calendars what year is it
Calendars: What Year Is It?
  • Legal, ceremonial, or popular requirement
    • Gregorian 2007
    • Japan Emperor:19 Heisei (平成19年 )
    • Thailand (Buddhist): 2551 (Gregorian + 543)
    • Chinese (traditional): 4704 (lunar)
    • Hebrew 5767תשסו(lunar)
    • Hijri (Islamic) 1428 (lunar)
    • Armenian 1456 (ԹՎ ՌՆԾԶ )
    • etc. etc. etc.
weekends and holidays
Weekends and Holidays
  • When is the weekend?
    • Friday is part of the weekend in some countries.
  • Both official and unofficial holidays vary widely in number. Here are a few to watch for:
    • USA: July 4, MLK, President’s Day, Veteran’s Day,Flag Day, Columbus Day, Thanksgiving…
    • Japan: Golden Week
    • China: New Year’s
    • Britain: Guy Fawke’s Day, Boxing Day
    • France: Bastille Day
    • Spain: Reyes Magos
number and list formats
Number and List Formats

2 345,67, 1 012,34, 45,67

2 345,67; 1 012,34; 45,67 easier to read

Grouping and decimal separators:

England: 12,345.67

Germany: 12.345,67

Switzerland: 12’345,67

Swiss money: 12’345.67

France: 12 345,67

India: 12,34,567.89

France uses a non-breaking space!

India: number of digits in groupings changes!

List delimiters & separators can conflict

French example:

2 345,67, 1 012,34, 45,67hard to read

collation a fancy word for sorting
Collation (A FANCY WORD FOR “SORTING”)

English:ABC...RSTUVWXYZ

German: AÄB...NOÖ...SßTUÜV…YZ

Swedish/Finnish: AB...STUVWXYZÅÄÖ

Norwegian:AB...VWXYÜZÆØÅ

organizing information
Organizing Information
  • “Alphabet” differences
  • Additional information
    • for example: yomi
  • ASCII vs. the world
  • Mixed information sets
should i be writing all of this down
“Should I be writing all of this down…”

Wide range of variation

Obscure formats

Difficult to obtain reliable information on formats

Lots of work to implement and maintain

Enabling means not having to know (m)any of the details

supporting international formats
Supporting International Formats
  • Use neutral data structures
    • Makes code independent of locale
    • Most data types are locale-neutral:
      • Boolean
      • String, char
      • Number classes
      • Date, Calendar
  • Encapsulate formatting/validation in a function
    • Format style chosen dynamically at runtime
    • Format details don’t have to be specified or researched
    • APIs know the gory details
essence of enabling
Essence of Enabling
  • Object to Presentation, Presentation to Object
    • Integers
    • Floats
    • Percents
    • Currencies
    • Dates
    • Times
    • Durations
    • Collation (lists)
    • Weights/measures/sizes
    • Resources (user interface strings)

java.lang.Locale

user presentation

locale
Locale

an identifier or data structure that allows programmers to access culturally and linguistically affected functionality in a system.

numberformat demo code

public String formatNumber(int column, Number n, Locale l) {

NumberFormat format;

Currency c;

switch (column) {

default:

case 1:

format = NumberFormat.getInstance(l);

break;

case 2:

format = NumberFormat.getIntegerInstance(l);

break;

case 3:

format = NumberFormat.getPercentInstance(l);

break;

case 4:

format = NumberFormat.getCurrencyInstance(l);

try {

c = Currency.getInstance(l);

} catch (IllegalArgumentException e) {

// can get here if you specify a locale with no

// country or for one with a territory that isn't

// supported (like my favorite territory 'AQ'

// in which case we use the Almighty Buck

c = Currency.getInstance("USD");

}

format.setCurrency(c);

break;

case 0:

return n.toString();

}

return format.format(n);

}

NumberFormat Demo Code
break iterator

BreakIteratoriter = BreakIterator.getWordInstance(b_locale);

iter.setText(str);

int pos = iter.first(); // points to the start of the string

pos = iter.next(); // so move to next break

int longest = 0;

while (pos != BreakIterator.DONE) {

String sub = str.substring(last, pos).trim();

// …

last = pos;

pos = iter.next();

}

Break Iterator

Break iterators allow you to break text into characters, words, lines, and sentences.

In the demo, we use a word break iterator to find word-breaks. We also use a character break iterator to find approximate glyph breaks.

collator

Collator nativeCol = Collator.getInstance(b_locale);

bMap = new TreeSet(nativeCol);

Collator

Collator is the class that does linguistically correct sorting. In the demo, it’s really easy to use: Java Collections can take a comparator and do all the work internally. All we have to do is provide the right one.

complex types
Complex Types
  • Data structures, APIs, or classes built from basic types must include similar capabilities.
    • Store data in a locale-neutral or independent format.
    • Display in a language/regional/culturally sensitive manner
    • Convert from locale format to locale-neutral or locale-independent storage format.
design time and data structures
Design Time and Data Structures
  • Identify your own “locale bias”
    • Field names matter!
      • “Postal Code”, not “ZIP code”.
      • Family Name/Given Name, not First Name/Last Name
    • Avoid problematic fields
      • Postal address parsing? Area code? Etc.
currency
Currency
  • Currency formatting is usually similar to number formatting. But things can vary widely here, too:
    • $1,100.00 [USA]
    • €1 100,00 [France-Euro]
    • ¥1,100 [Japan]
    • 1.100$00 Esc. [Portugal, obsolete]
    • SFr. 1’000.00 [Switzerland]
  • Currency associated with the locale doesn’t always apply. Store the currency type with value.
    • Use ISO 4217 std. codes (USD, JPY, EUR, RUR)
  • Not always one symbol.
  • Not always two decimal places.
  • $100 + ¥100 = $101
  • Consider neutral displays!
being locale neutral
Being Locale Neutral
  • Avoid or reduce locale-affected display to increase portability
    • Use unambiguous formats, such as ISO 8601-like dates, especially in log files and the like
      • 2005-04-01 14:17:00 UTC
    • Use consistent formats (‘user locale’), especially in columns or collections of data

Amount Currency 351,234.56 USD 102,556.78 EUR 65,336.00 JPY 212,345.00 INR

Amount Currency

351,234.56 USD

102 556,78 EUR

65336 JPY

2,12,345.00 INR

string is the thing
“String is the Thing”
  • Text doesn’t get translated on the fly.
  • Don’t use text as an identifier or foreign key.
    • Use ID Numbers or not-human-readable values instead of requiring text fields to match.
    • “Intrinsic” data value versus “display” data value.
  • Enumerated values displayed as strings.
  • Use display strings.

Enumerated

ACCOUNTS_PAYABLE

Displayed

“Accounts Payable”

“pagável de clientes”

english like construction
English-like Construction
  • Concatenation
    • String1 + string2
  • Pluralization
    • Dog + “s” = “dogs” (sheeps??)
  • Lists
    • 1.23, 2.23, 3.36
    • 1,23, 2,23, 3,36?

This topic will be covered in greater depth in the section on localization.

databases
Databases
  • Most databases can only handle one collation sequence per instance or one collation per index.
    • Remove reliance on alphalists.
    • Self-collate short lists.
    • Pre-collate long lists?
  • Example: NLS_SORT controls the way Oracle returns data (collation sequence).
    • Global environment variable.
    • Not necessarily under your control.
    • Indices are built on a predetermined or binary sort.
enabling summary
Enabling Summary
  • Understand Encodings and Unicode
    • All text has an encoding!
  • Be Locale-Aware
    • Create locale-neutral data structures
    • Separate display from storage
it s about time
It’s About Time

Dates, Times, Durations, Calendars and Time Zones

computer vs wall time
Computer vs. Wall Time
  • Incremental Time
    • Clock ticks since epoch (the ticks and epoch vary)
    • Usually UTC-based
  • Field-based Time
    • Zone independent
      • birth date, start date, end date
    • Zone dependent
      • recurring meeting schedule
time zone
Time Zone
  • a geographical region that has common rules for determining local time.
  • These include:
    • Offset from UTC
    • Daylight Savings (Summer Time) behavior
    • Historic changes in offset or DST behavior
    • Political control
time zone affected scenarios
Time Zone Affected Scenarios
  • Zone independent
    • only “incremental” times are necessary
  • Local time, past only
    • future changes to time zone rules not applicable
    • example: logging system
  • Local time, both past and future
    • time zone rule changes may affect some time values
    • example: calendar program
  • Floating times
    • events not tied to a specific time zone
    • example: birthdate, start date, definition of “night” for phone usage
  • Recurring events
    • events that recur—sometimes during and sometimes not during daylight savings.
    • example: weekly status meeting
time zone identifiers
Time Zone Identifiers
  • Offset
    • Etc/UTC
    • Etc/GMT+1
  • Ocean/Island(City)
    • Atlantic/Canary
    • Pacific/Auckland
    • Pacific/Pago_Pago
  • Continent/City
  • America/Los_Angeles
  • Europe/Paris
  • Asia/Tokyo
  • Antarctica/DumontDUrville
  • Continent/Region/City
    • America/Indiana/ Indianapolis

Often based on the time zone information database (tzinfo). These identifiers are sometimes called the Olson ids.

locale neutral formats
Locale-Neutral Formats
  • Use locale-neutral formats for interchange:
    • ISO 8601
    • Incremental time values (e.g. time_t)
    • Distinguish time zone if necessary for interpretation
      • Offset is not the same as time zone

SQL data types and XML formats are often field-based, while programming languages are usually incremental.

At any given time, in UTC, it is the same time everywhere that time is measured.

durations and repeating events
Durations and Repeating Events

Wall-time:this meeting is at 2 PM Pacific time every Tuesday

  • interval between meetings may vary in number of seconds
    • Daylight time transitions
    • Changes in DST rules

Fixed-duration:run the virus scanner every 57 minutes

  • interval is always 342000 milliseconds
calendars
Calendars

Gregorian

Japanese Imperial

Hijri

Thai Buddhist

Chinese Traditional

Jewish

Astronomy

Friday, January 20, 2006

الجمعة، 20 ذو الحجة، 1426

2006年1月20日星期五

二○○六年一月二十日星期五

平成18年1月20日

平成十八年一月二十日

วันศุกร์ที่ 20 มกราคม พ.ศ. 2549

วันศุกร์ที่ ๒๐ มกราคม พ.ศ. ๒๕๔๙

Calendars affect the field values calculated for a given event. “Roll” of values such as month, week, day, etc. depend on such relationships. Calendar code then converts to incremental times.

formatting dates and times
Formatting Dates and Times

October 10, 14H 6:05:45 AM JST

value being formatted

1034197545321L

defines relation to “wall time”

Asia/Tokyo

defines rules for calculating field values

Japanese Imperial

Requires more than just a locale!

  • date
  • time zone
  • calendar
example java date formatting
Example: Java Date Formatting

Computer Time (Data Structure)

java.util.Date: long integer, milliseconds since “epoch” of January 1, 1970, 00:00 UTC

externalization

Externalization

Moving language and culturally affected data and components out of code.

what is localization
What is Localization?
  • The process of tailoring a product to a specific target market.
    • Translation of messages
    • Adaptation to local preferences
    • Addition (or subtraction) of content or features
localization is obvious
Localization is obvious…
  • “Localization” is not “Internationalization”!
  • Localizability is internationalization.
    • Externalize text
    • Externalize presentation
    • Dynamic composition
    • Distribution of language content
    • “Plug-in” features
avoiding forks

Global Binary

Resources

Resources

Resources

Resources

Avoiding Forks

English Version

version française

Deutsche Version

日本語版

forked code woes
Forked Code Woes

Hard to fix and maintain

Different versions in the field

Delays in releasing localized product

Different functionality by region

Confusing for customers/users

Versions are not interoperable and might not be able to exchange data!

other benefits
Other Benefits

Rename or re-brand product

Fix spelling or grammar mistakes

Fix usability

Make terminology consistent

… all without a rebuild!

what is a resource
What is a ‘Resource’?

any application component loaded dynamically at runtime, rather than compiled into the application

  • in Localization: source code files containing language, region, or culturally-affected materials

$SET 1 Prompts

1 ENTER FIRST NAME

2 ENTER LAST NAME

$

$set 2 Error Messages

1 NAME NOT ON DATA BASE

2 ILLEGAL INPUT

a gencat message catalog file

  • Text
  • Error messages
  • Icons
  • Pictures
  • Fonts
  • Colors
  • Graphics
  • Sizes
  • Positions
  • Magic Numbers
  • Mnemonics (“Alt+G”, “F4”, etc.)
  • File Locations
  • Dictionaries
  • Glossaries
  • Grammar Rules
  • Code
non translatable resources
Non-Translatable Resources
  • Some content should be externalized but not translated
    • Sometimes referred to as “DNT” for “do not translate”
  • Externalize? Yes…
    • Segregate DNT material from translated material if possible (by using separate resource files or separate resource blocks within a file).
    • Developers can’t always tell when something should or should not be DNT… and neither can translators (context is missing)
the locale in localization

Global Binary

Resources

The “Locale” in “Localization”

Resources “fall back” to find the best match

zh-Hans-SG (Chinese, Simplified script, Singapore)

zh-Hans (Chinese, Simplified script)

zh (Chinese)

(root)

Falling back

sparse population
Sparse Population
  • A given language resource may not contain a complete set of resources.
    • Some resource language fall back for each sub-resource (such as a particular value)
  • “appName” “Démo”

“dialogTitle” “Bonjour monde”

“appName” “Demo”

“maxRows” 57

  • “dialogTitle” “Hello World”
getting the right locale
Getting the Right Locale

Client Locale

One request might serve multiple purposes or be seen in multiple contexts

Server Locale

client

API Request Locale

System Mgmt Locale

Front End

API

Business Logic

Business Logic

Data Store

Data Store

Operating Env.

Operating Env.

resources and translation

“key”, “ðìsplàÿ stríñg”

“dialogTitle”, “Ðîálòg Tïtlè”

“aMessage”, “Thìß ís â Mésßãgê.

Resources and Translation

“key”, “display string”

“dialogTitle”, “Dialog Title”

“aMessage”, “This is a message.”

Pseudo-Translation

don t build from text fragments
Don’t Build From Text Fragments
  • Text fragments are hard to translate
    • Fragments may not follow grammar rules
      • Cannot know which parts go together
      • Parts can be reused in incompatible ways

String1 = There are

String2 = no

String3 = tables in

String4 = files.

[] files out of [] were deleted.

An error occurred at [] on [].

Page [] of []

Processing: []% complete.

There are files.

There are no files.

There are 50 files.

There are tables in files.

There are no tables in files.

issues with text composition
Issues With Text Composition
  • Count:
    • There were one errors found.
    • You have earned your 22th set of bonus points.
  • Gender:
    • “Documenti del Chris“
    • "Documenti della Chris”
    • "Documenti - Chris"
  • Case
  • Grammatical Structure
    • SOV, SVO, etc.
  • Word Order and Inter-word Dependency
sentence parts must agree
Sentence Parts Must Agree
  • Endings, Gender, Plurality, Case
    • e.g. Japanese counting uses different words for different kinds of objects
    • e.g. Slavic languages use different endings for singular, few, many…
message format apis
Message Format APIs

There were {0} tables on {1}.

There were {0,number,integer} tables on {1,date,short}.

{1,date}に{0,number,integer}のテーブルがあった。

Number replacement variables.

Provide typing and formatting information where possible.

Externalize as a single unitary string.

complex message formatting
Complex Message Formatting

Examples:

  • ordinal numbers (1st, 2nd, 3rd, 4th, etc.)
  • complex messages, such as “27 seconds ago” vs. “10 minutes ago”

0:There were no errors.

1:There was {0} error.

2:There were {0} errors.

0:не было ошибок

1:была {0} ошибка

2:были {0} ошибки

5:были {0} ошибок

number of resources may need to vary by locale or language

There were no errors.

There was 1 error.

There were 2 errors.

“choice format” APIs allow for different resources to be used based on runtime values.

images and icons
Images and Icons

Avoid metaphors

Avoid cultural sensitivities

Avoid body parts

Replace as necessary

Avoid putting text into graphics

Graphic: $20

Text: $0.06

images and culture
Images and Culture

Beware your biases—even “good” ones.

Check out our new website for India!

isn t it swell
Isn’t it Swell?

English is very succinct.

  • Words in other languages are often longer
  • Sentences may be longer
  • Characters may be larger (taller, wider, or require a bigger point size)
more swollen text
More Swollen Text
  • 30% in length (alphabetics, abjads, etc.)
  • 30% in height (ideographics)
  • But… a rule of thumb, not a “fact”
    • Measure your results with care.
managing english text
Managing English Text

String Building??

Abbrev. Eng.

string is the thing156
String is the Thing

String is the Thing?

dereferencing
Dereferencing

Minimize sentence building

Minimize arguments per string

Use subject:predicate wherever possible

Don’t do this:

Your balance is $100.00.

When you can do this:

Balance: $100.00

dynamic vs static layout
Dynamic vs. Static Layout

Magic numbers

Externalized layouts

Mnemonics

Colors

localizing styles
Localizing Styles
  • Bolding is not universal for emphasis
    • Italicization, Capitalization, etc. are also not universal (some scripts don’t have these attributes)
  • Use Logical not Presentational names
    • Describe the function not the appearance. For example, use “emphasis” instead of “italics”.

中国

Amikake

Wakiten

use of color
Use of Color

“Going Down”

“Going Up”

input method editors
Some languages require software to assemble keystrokes into characters
  • Asian languages with vary large character sets
  • Complex scripts with vowel-killers and other contextual editing requirements

Applications that interact directly with key-pressed events can disable or disrupt IME input.

  • On- and over-the-spot editing
Input Method Editors
when is it okay
When is it okay?
  • Content should be highly localized or have locale-specific requirements:
    • customization lets you address this requirement in the most localized possible manner
externalization redux

currencies

currencies

dates

dates

My Application

My Application

numbers

numbers

times

times

images

images

Address

formats

Address

formats

Language and Locale

Independent Code

colors

colors

titles

titles

Legal

rules

Legal

rules

Dependent Code

Accounting

rules

Accounting

rules

sounds

sounds

text

text

Externalization Redux
large animal pictures166

Global Code

Resources

Large Animal Pictures

Software Component

Output

Input

I/O

Code can be a resource!

customization examples
Customization Examples

Generic API

Generic Implementation

USImplementation

DE Implementation

?? Implementation

  • Postal address validation
  • Postal code validation
  • Telephone number formatter
  • “Personality” questions
    • blood type vs. sun sign
  • Personal name formatter
    • first/last position, space, highlighting, formality, etc.
  • Tax codes and shipping schedules
example postal addresses
Example: Postal Addresses

address1 varchar(32)

address2 varchar(32)

city varchar(16)

state char(2)

zip char(5)

country char(2)

address1 varchar(64)

address2 varchar(64)

city varchar(64)

province varchar(64)

postcode varchar(64)

i18n

public interface Address {

public class genericAddress implements Address {

public class USAddress extends genericAddress {

public class UKAddress extends genericAddress {

country=US, postcode=‘WC2 1GH’ // error

country=UK, postcode=‘95111’ // error

country=DE, postcode=‘1A4喪’ // okay?

building global software

Building Global Software

Beyond Just Coding: Localization, QA, and all that

the internationalization cycle
The internationalization cycle
  • Encompasses the full development cycle:
    • Requirements
    • Design
    • Development
    • QC
    • Release
    • Support
what is internationalization qa
What is “internationalization QA”?
  • Does the enabled product work correctly?
    • Non-English configurations
    • Non-ASCII data and encoding support
    • Cross time zone support
    • Market specific features or customizations
  • Does localization appear correctly?
    • Is the product localizable?

What makes this different from “regular” QA?

growing and pruning the matrix
Growing (and Pruning) the Matrix

Include non-English configurations in your test matrix; include non-ASCII data in your tests.

Be prepared to prune the test matrix.

what to test with
What to Test With
  • Test Non-English configurations
    • Non-English locales (lying to your machine)
    • Native configurations (when does it make sense?)
  • Test Non-ASCII data
    • Encodings, encodings, everywhere
    • Non-ASCII character values
  • Test Across Time Zones
    • Two or more time zones; consider international date line (“it’s tomorrow in Japan”) and DST issues
planning testing
Planning Testing

Initially

  • Get tools that are enabled!
    • Automation allows greater coverage, but only if it works.
  • Plan encodings and locales as part of the test matrix.
  • Acquire third-party products as necessary.

Increasing Maturity

Use test driven development practices.

Get developers to write unit tests that are internationalized.

Put the ‘i18n’ bugs into the regression suite.

configuring machines
Configuring Machines

Create both native and simulated environments:

  • Native operating systems may have minor but sometimes critical differences (folder names, keywords, localized registry entries)
  • Most features don’t run into native differences (easier to work with English-localized machines)
  • Don’t buy physical keyboards (use software keyboards) unless your application relies on scan codes from keys
incorporate
Incorporate

Localization is part of the release process too.

    • Changes to the user interface cost the localization team time and money.
    • (Changes to the product cost the documentation and QA folks too)
  • May need to institute change control or a UI freeze
simultaneous shipment simship
Simultaneous Shipment (Simship)

Ideally, to maximize opportunity, ship the target languages the same day as the source language.

  • It might not make sense for your product.
  • But it might not be as difficult as you think it is. It might even be good for you.
distribution of content
Distribution of Content
  • How does the localized text get into the running product?
    • Satellite assemblies, DLLs, shared libraries
    • Message catalogs
    • Special directory
    • Database
    • Etc.
more distribution
More Distribution

“Specific Language” (per-language)

“Language Included” (one or more languages)

“Language Pack” (product plus something)

English

English

English

Global Binary

+

German

German

German

French

French

French

completing the product
Completing the Product
  • Static content is often under source control and can be localized “normally”
  • Dynamic content may include the initial set of data or other items which need to be localized beyond software.
    • Demos and Demo Data
    • Dictionary, Language add-ons
    • Local offers, links to Web store, etc.
    • Packaging
    • Regulatory
quality checking and development methodologies
Quality Checking and Development Methodologies
  • Translation is a human-oriented task.
    • Translation time lines are linear with volume.
  • Localized product should be tested for functionality
    • translation can break things
    • usually the first language finds most of the bugs
  • Translations should be checked for quality
  • Development cycle has to include time for translators and quality assurance to catch up.
    • This does not mean “no agile” or “no changes”
    • Do pilot language(s) or moving-target translation; do better UI design and usability reviews; etc.
internationalization
Internationalization

… is a fundamental architectural approach: it is how software is built.

  • Design
  • Enabling
  • Externalization
  • Customization
  • Testing and Support
  • Lifecycle
slide185
“Would you please write the code for I18N on the whiteboard before you go?”

#import i18n.h

#define UNICODE

Q&A