Compact encodings of unicode
1 / 18

Compact Encodings of Unicode - PowerPoint PPT Presentation

  • Updated On :

Compact Encodings of Unicode. Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency. Agenda. Encodings in files and protocols Not: Processing encoding forms Unicode “is too big” Issues and non-issues How to reduce size of Unicode text Choice of encoding

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Compact Encodings of Unicode' - kanoa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compact encodings of unicode l.jpg

Compact Encodings of Unicode

Markus W. Scherer

Unicode/G11N Software Engineer

IBM Globalization Center of Competency

Agenda l.jpg

  • Encodings in files and protocols

    • Not: Processing encoding forms

  • Unicode “is too big”

    • Issues and non-issues

  • How to reduce size of Unicode text

    • Choice of encoding

    • Optional compression

  • Examples and comparisons

22nd International Unicode Conference

What is icu l.jpg
What is ICU?

  • Internationalization libraries for C, C++, Java*

    • Open source – non-viral

    • Sponsored by IBM

    • Sun’s Java licenses an earlier ICU version; ICU4J updates it.

  • Unicode standard compliant

    • full supplementary support

  • Cross-platform; extensible and customizable

  • High performance and thread-safe

    • Multiple locales in same thread – simultaneously

  • Converters for all Unicode charsets & hundreds of legacy codepages


22nd International Unicode Conference

Encodings of unicode l.jpg
Encodings of Unicode

  • Common Unicode character set

  • External encodings

    • Files and protocols

    • Almost always byte-serialized

    • Character Encoding Schemes/charsets

  • Processing encodings

    • Character Encoding Forms, often 16/32-bit

    • Different requirements

    • Topic for different presentation…

22nd International Unicode Conference

Unicode is too big l.jpg
Unicode “is too big”?

  • Perceived large size of Unicode text

    • Compared with legacy codepages

  • Size matters

    • Low-speed connections (dial-up, mobile)

    • Little memory (PDA, cell phone, embedded)

  • Size does not matter when…

    • Images & other binaries swamp text size

    • High-speed network

    • Temporary documents

    • Large amounts of memory

22nd International Unicode Conference

How big is it l.jpg
How big is it?

  • Size depends on language/script

  • Bytes/char for some language groups:

22nd International Unicode Conference

Legacy codepages l.jpg
Legacy codepages

  • Compact because

    • Designed for single/few languages

    • Few characters compared with Unicode

  • Conversion problems

    • Fallback/substitution of unmappable chars

    • Mapping table differences

    • Loss of parts of text common

  • Large number/size of mapping tables

22nd International Unicode Conference

Reduce unicode text size l.jpg
Reduce Unicode text size

  • Choice of encoding

    • Encodings designed for different purposes

    • Compactness vs. direct applicability vs. software support etc.

  • General-purpose compression

    • Best on top of compact encoding

    • Not available in all applications

22nd International Unicode Conference

Utf 8 16 l.jpg

  • Designed for processing but all-purpose

  • UTF-8:

    • Byte-based, ASCII-compatible

    • BMP: up to 3 bytes/char

  • UTF-16 (BE/LE):

    • Byte-serialization of 16-bit form, not ASCII-compatible

    • BE/LE forms or Byte Order Mark

    • BMP: always 2 bytes/char

22nd International Unicode Conference

Utf 7 l.jpg

  • 7-bit encoding designed for email

    • Obsolete: email now 8-bit-safe

  • Partially ASCII-compatible

  • BMP: 2.67 bytes/char plus overhead

    • Base64-encoded UTF-16BE

  • Stateful

22nd International Unicode Conference

Scsu bocu 1 l.jpg

  • About as compact as legacy codepages

    • 1 byte/char for small scripts, 2 for CJK; stateful

    • Compress short strings better than LZW (zip) etc.

  • SCSU:

    • Limited* ASCII compatibility (initial state)

    • Complex state, many encoding choices

    • Indeterministic; arbitrary byte values

    • Established encoding, supported in

      • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)

22nd International Unicode Conference

Bocu 1 l.jpg

  • BOCU-1:

    • Delta-encoding; avoids control codes

    • MIME text-compatible but not ASCII

    • Deterministic

    • Preserves binary order (for sorting, databases)

    • New encoding; supported by ICU

22nd International Unicode Conference

Scsu bocu 1 text sizes l.jpg
SCSU & BOCU-1 text sizes

  • Average bytes/char relative to UTF-8

22nd International Unicode Conference

Encoding vs compression l.jpg
Encoding vs. compression

  • For example: BOCU-1 with WinZip

22nd International Unicode Conference

Performance l.jpg

  • Converter performance

    • Roundtrip to/from UTF-16 with ICU:

      • SCSU: 45%..125% of UTF-8 roundtrip time

      • BOCU-1: 40%..160% of UTF-8 roundtrip time

  • Depends on encoding ratio

    • Fast for small scripts, 1 byte/char

  • Separate compression adds to I/O time

  • Conversion time typically swamped by

    • Transmission (low-bandwidth connections)

      • Shorter texts transmit faster!

    • Parsing/processing

22nd International Unicode Conference

Further considerations l.jpg
Further considerations

  • In-document encoding declarations require ASCII readability (XML, HTML)

  • Protocol may limit byte values (SMTP)

    • TES required for some encodings

      • base64 for SCSU or UTF-16 in emails

      • Increases text size

  • Compression removes ASCII readability and uses arbitrary byte values

22nd International Unicode Conference

Conclusion l.jpg

  • UTF-8 and/or UTF-16 work in most cases

  • Size of text often not critical

  • When small text size needed:

    • Use SCSU or BOCU-1

    • Consider compression

    • Make sure receiver can handle it

22nd International Unicode Conference

References l.jpg

  • Forms of Unicode:

  • Character Encoding Model: UTR #17

  • SCSU: UTS #6

  • BOCU-1:

  • ICU homepage:

  • Unicode Consortium:

  • IBM developerWorks:

22nd International Unicode Conference