Compact encodings of unicode l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Compact Encodings of Unicode PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

Compact Encodings of Unicode. Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency. Agenda. Encodings in files and protocols Not: Processing encoding forms Unicode “is too big” Issues and non-issues How to reduce size of Unicode text Choice of encoding

Download Presentation

Compact Encodings of Unicode

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Compact encodings of unicode l.jpg

Compact Encodings of Unicode

Markus W. Scherer

Unicode/G11N Software Engineer

IBM Globalization Center of Competency


Agenda l.jpg

Agenda

  • Encodings in files and protocols

    • Not: Processing encoding forms

  • Unicode “is too big”

    • Issues and non-issues

  • How to reduce size of Unicode text

    • Choice of encoding

    • Optional compression

  • Examples and comparisons

22nd International Unicode Conference


What is icu l.jpg

What is ICU?

  • Internationalization libraries for C, C++, Java*

    • Open source – non-viral

    • Sponsored by IBM

    • Sun’s Java licenses an earlier ICU version; ICU4J updates it.

  • Unicode standard compliant

    • full supplementary support

  • Cross-platform; extensible and customizable

  • High performance and thread-safe

    • Multiple locales in same thread – simultaneously

  • Converters for all Unicode charsets & hundreds of legacy codepages

  • http://oss.software.ibm.com/icu/

22nd International Unicode Conference


Encodings of unicode l.jpg

Encodings of Unicode

  • Common Unicode character set

  • External encodings

    • Files and protocols

    • Almost always byte-serialized

    • Character Encoding Schemes/charsets

  • Processing encodings

    • Character Encoding Forms, often 16/32-bit

    • Different requirements

    • Topic for different presentation…

22nd International Unicode Conference


Unicode is too big l.jpg

Unicode “is too big”?

  • Perceived large size of Unicode text

    • Compared with legacy codepages

  • Size matters

    • Low-speed connections (dial-up, mobile)

    • Little memory (PDA, cell phone, embedded)

  • Size does not matter when…

    • Images & other binaries swamp text size

    • High-speed network

    • Temporary documents

    • Large amounts of memory

22nd International Unicode Conference


How big is it l.jpg

How big is it?

  • Size depends on language/script

  • Bytes/char for some language groups:

22nd International Unicode Conference


Legacy codepages l.jpg

Legacy codepages

  • Compact because

    • Designed for single/few languages

    • Few characters compared with Unicode

  • Conversion problems

    • Fallback/substitution of unmappable chars

    • Mapping table differences

    • Loss of parts of text common

  • Large number/size of mapping tables

22nd International Unicode Conference


Reduce unicode text size l.jpg

Reduce Unicode text size

  • Choice of encoding

    • Encodings designed for different purposes

    • Compactness vs. direct applicability vs. software support etc.

  • General-purpose compression

    • Best on top of compact encoding

    • Not available in all applications

22nd International Unicode Conference


Utf 8 16 l.jpg

UTF-8/16

  • Designed for processing but all-purpose

  • UTF-8:

    • Byte-based, ASCII-compatible

    • BMP: up to 3 bytes/char

  • UTF-16 (BE/LE):

    • Byte-serialization of 16-bit form, not ASCII-compatible

    • BE/LE forms or Byte Order Mark

    • BMP: always 2 bytes/char

22nd International Unicode Conference


Utf 7 l.jpg

UTF-7

  • 7-bit encoding designed for email

    • Obsolete: email now 8-bit-safe

  • Partially ASCII-compatible

  • BMP: 2.67 bytes/char plus overhead

    • Base64-encoded UTF-16BE

  • Stateful

22nd International Unicode Conference


Scsu bocu 1 l.jpg

SCSU & BOCU-1

  • About as compact as legacy codepages

    • 1 byte/char for small scripts, 2 for CJK; stateful

    • Compress short strings better than LZW (zip) etc.

  • SCSU:

    • Limited* ASCII compatibility (initial state)

    • Complex state, many encoding choices

    • Indeterministic; arbitrary byte values

    • Established encoding, supported in

      • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)

22nd International Unicode Conference


Bocu 1 l.jpg

BOCU-1

  • BOCU-1:

    • Delta-encoding; avoids control codes

    • MIME text-compatible but not ASCII

    • Deterministic

    • Preserves binary order (for sorting, databases)

    • New encoding; supported by ICU

22nd International Unicode Conference


Scsu bocu 1 text sizes l.jpg

SCSU & BOCU-1 text sizes

  • Average bytes/char relative to UTF-8

22nd International Unicode Conference


Encoding vs compression l.jpg

Encoding vs. compression

  • For example: BOCU-1 with WinZip

22nd International Unicode Conference


Performance l.jpg

Performance

  • Converter performance

    • Roundtrip to/from UTF-16 with ICU:

      • SCSU: 45%..125% of UTF-8 roundtrip time

      • BOCU-1: 40%..160% of UTF-8 roundtrip time

  • Depends on encoding ratio

    • Fast for small scripts, 1 byte/char

  • Separate compression adds to I/O time

  • Conversion time typically swamped by

    • Transmission (low-bandwidth connections)

      • Shorter texts transmit faster!

    • Parsing/processing

22nd International Unicode Conference


Further considerations l.jpg

Further considerations

  • In-document encoding declarations require ASCII readability (XML, HTML)

  • Protocol may limit byte values (SMTP)

    • TES required for some encodings

      • base64 for SCSU or UTF-16 in emails

      • Increases text size

  • Compression removes ASCII readability and uses arbitrary byte values

22nd International Unicode Conference


Conclusion l.jpg

Conclusion

  • UTF-8 and/or UTF-16 work in most cases

  • Size of text often not critical

  • When small text size needed:

    • Use SCSU or BOCU-1

    • Consider compression

    • Make sure receiver can handle it

22nd International Unicode Conference


References l.jpg

References

  • Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

  • Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/

  • SCSU: UTS #6 http://www.unicode.org/reports/tr6/

  • BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html

  • ICU homepage: http://oss.software.ibm.com/icu/

  • Unicode Consortium:http://www.unicode.org/

  • IBM developerWorks:http://www.ibm.com/developerworks/unicode/

22nd International Unicode Conference


  • Login