compact encodings of unicode
Download
Skip this Video
Download Presentation
Compact Encodings of Unicode

Loading in 2 Seconds...

play fullscreen
1 / 18

Compact Encodings of Unicode - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Compact Encodings of Unicode. Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency. Agenda. Encodings in files and protocols Not: Processing encoding forms Unicode “is too big” Issues and non-issues How to reduce size of Unicode text Choice of encoding

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compact Encodings of Unicode' - kanoa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compact encodings of unicode

Compact Encodings of Unicode

Markus W. Scherer

Unicode/G11N Software Engineer

IBM Globalization Center of Competency

agenda
Agenda
  • Encodings in files and protocols
    • Not: Processing encoding forms
  • Unicode “is too big”
    • Issues and non-issues
  • How to reduce size of Unicode text
    • Choice of encoding
    • Optional compression
  • Examples and comparisons

22nd International Unicode Conference

what is icu
What is ICU?
  • Internationalization libraries for C, C++, Java*
    • Open source – non-viral
    • Sponsored by IBM
    • Sun’s Java licenses an earlier ICU version; ICU4J updates it.
  • Unicode standard compliant
    • full supplementary support
  • Cross-platform; extensible and customizable
  • High performance and thread-safe
    • Multiple locales in same thread – simultaneously
  • Converters for all Unicode charsets & hundreds of legacy codepages
  • http://oss.software.ibm.com/icu/

22nd International Unicode Conference

encodings of unicode
Encodings of Unicode
  • Common Unicode character set
  • External encodings
    • Files and protocols
    • Almost always byte-serialized
    • Character Encoding Schemes/charsets
  • Processing encodings
    • Character Encoding Forms, often 16/32-bit
    • Different requirements
    • Topic for different presentation…

22nd International Unicode Conference

unicode is too big
Unicode “is too big”?
  • Perceived large size of Unicode text
    • Compared with legacy codepages
  • Size matters
    • Low-speed connections (dial-up, mobile)
    • Little memory (PDA, cell phone, embedded)
  • Size does not matter when…
    • Images & other binaries swamp text size
    • High-speed network
    • Temporary documents
    • Large amounts of memory

22nd International Unicode Conference

how big is it
How big is it?
  • Size depends on language/script
  • Bytes/char for some language groups:

22nd International Unicode Conference

legacy codepages
Legacy codepages
  • Compact because
    • Designed for single/few languages
    • Few characters compared with Unicode
  • Conversion problems
    • Fallback/substitution of unmappable chars
    • Mapping table differences
    • Loss of parts of text common
  • Large number/size of mapping tables

22nd International Unicode Conference

reduce unicode text size
Reduce Unicode text size
  • Choice of encoding
    • Encodings designed for different purposes
    • Compactness vs. direct applicability vs. software support etc.
  • General-purpose compression
    • Best on top of compact encoding
    • Not available in all applications

22nd International Unicode Conference

utf 8 16
UTF-8/16
  • Designed for processing but all-purpose
  • UTF-8:
    • Byte-based, ASCII-compatible
    • BMP: up to 3 bytes/char
  • UTF-16 (BE/LE):
    • Byte-serialization of 16-bit form, not ASCII-compatible
    • BE/LE forms or Byte Order Mark
    • BMP: always 2 bytes/char

22nd International Unicode Conference

utf 7
UTF-7
  • 7-bit encoding designed for email
    • Obsolete: email now 8-bit-safe
  • Partially ASCII-compatible
  • BMP: 2.67 bytes/char plus overhead
    • Base64-encoded UTF-16BE
  • Stateful

22nd International Unicode Conference

scsu bocu 1
SCSU & BOCU-1
  • About as compact as legacy codepages
    • 1 byte/char for small scripts, 2 for CJK; stateful
    • Compress short strings better than LZW (zip) etc.
  • SCSU:
    • Limited* ASCII compatibility (initial state)
    • Complex state, many encoding choices
    • Indeterministic; arbitrary byte values
    • Established encoding, supported in
      • Various tools & editors (SC UniPad), ICU, Symbian OS (cell phones/PDAs)

22nd International Unicode Conference

bocu 1
BOCU-1
  • BOCU-1:
    • Delta-encoding; avoids control codes
    • MIME text-compatible but not ASCII
    • Deterministic
    • Preserves binary order (for sorting, databases)
    • New encoding; supported by ICU

22nd International Unicode Conference

scsu bocu 1 text sizes
SCSU & BOCU-1 text sizes
  • Average bytes/char relative to UTF-8

22nd International Unicode Conference

encoding vs compression
Encoding vs. compression
  • For example: BOCU-1 with WinZip

22nd International Unicode Conference

performance
Performance
  • Converter performance
    • Roundtrip to/from UTF-16 with ICU:
      • SCSU: 45%..125% of UTF-8 roundtrip time
      • BOCU-1: 40%..160% of UTF-8 roundtrip time
  • Depends on encoding ratio
    • Fast for small scripts, 1 byte/char
  • Separate compression adds to I/O time
  • Conversion time typically swamped by
    • Transmission (low-bandwidth connections)
      • Shorter texts transmit faster!
    • Parsing/processing

22nd International Unicode Conference

further considerations
Further considerations
  • In-document encoding declarations require ASCII readability (XML, HTML)
  • Protocol may limit byte values (SMTP)
    • TES required for some encodings
      • base64 for SCSU or UTF-16 in emails
      • Increases text size
  • Compression removes ASCII readability and uses arbitrary byte values

22nd International Unicode Conference

conclusion
Conclusion
  • UTF-8 and/or UTF-16 work in most cases
  • Size of text often not critical
  • When small text size needed:
    • Use SCSU or BOCU-1
    • Consider compression
    • Make sure receiver can handle it

22nd International Unicode Conference

references
References
  • Forms of Unicode: http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/
  • Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/
  • SCSU: UTS #6 http://www.unicode.org/reports/tr6/
  • BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html
  • ICU homepage: http://oss.software.ibm.com/icu/
  • Unicode Consortium:http://www.unicode.org/
  • IBM developerWorks:http://www.ibm.com/developerworks/unicode/

22nd International Unicode Conference

ad