Getting started with icu
Download
1 / 41

Getting Started with ICU - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Getting Started with ICU. Vladimir Weinstein Eric Mader. Agenda. Getting & setting up ICU4C Using conversion engine Using break iterator engine Getting & setting up ICU4J Using collation engine Using message formats. Getting ICU4C. http://oss.software.ibm.com/icu/

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Getting Started with ICU' - chickoa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Getting started with icu

Getting Started with ICU

Vladimir Weinstein

Eric Mader

26th Internationalization and Unicode Conference


Agenda
Agenda

  • Getting & setting up ICU4C

  • Using conversion engine

  • Using break iterator engine

  • Getting & setting up ICU4J

  • Using collation engine

  • Using message formats

26th Internationalization and Unicode Conference


Getting icu4c
Getting ICU4C

  • http://oss.software.ibm.com/icu/

  • Get the latest release

  • Get the binary package

  • Source download for modifying build options

  • CVS for bleeding edge:

  • :pserver:anoncvs@oss.software.ibm.com:/usr/cvs/icu

26th Internationalization and Unicode Conference


Setting up icu4c
Setting up ICU4C

  • Unpack binaries

  • If you need to build from source

    • Windows:

      • MSVC .Net 2003 Project,

      • CygWin + MSVC 6,

      • just CygWin

    • Unix: runConfigureICU

      • make install

      • make check

26th Internationalization and Unicode Conference


Testing icu4c
Testing ICU4C

  • Windows - run: cintltst, intltest, iotest

  • Unix - make check (again)

  • See it for yourself:

#include <stdio.h>

#include "unicode/utypes.h"

#include "unicode/ures.h"

main() {

UErrorCode status = U_ZERO_ERROR;

UResourceBundle *res = ures_open(NULL, "", &status);

if(U_SUCCESS(status)) {

printf("everything is OK\n");

} else {

printf("error %s opening resource\n", u_errorName(status));

}

ures_close(res);

}

26th Internationalization and Unicode Conference


Conversion engine opening
Conversion Engine - Opening

  • ICU4C uses open/use/close paradigm

  • Open a converter:

UErrorCode status = U_ZERO_ERROR;

UConverter *cnv = ucnv_open(encoding, &status);

if(U_FAILURE(status)) {

/* process the error situation, die gracefully */

}

  • Almost all APIs use UErrorCode for status

  • Check the error code!

26th Internationalization and Unicode Conference


What converters are available
What Converters are Available

  • ucnv_countAvailable() – get the number of available converters

  • ucnv_getAvailable – get the name of a particular converter

  • Lot of frameworks allow this examination

26th Internationalization and Unicode Conference


Converting text chunk by chunk
Converting Text Chunk by Chunk

char buffer[DEFAULT_BUFFER_SIZE];

char *bufP = buffer;

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

if(U_FAILURE(status)) {

if(status == U_BUFFER_OVERFLOW_ERROR) {

status = U_ZERO_ERROR;

bufP = (UChar *)malloc((len + 1) * sizeof(char));

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

} else {

/* other error, die gracefully */

}

}

/* do interesting stuff with the converted text */

26th Internationalization and Unicode Conference


Converting text character by character
Converting Text Character by Character

UChar32 result;

char *source = start;

char *sourceLimit = start + len;

while(source < sourceLimit) {

result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);

if(U_FAILURE(status)) {

/* die gracefully */

}

/* do interesting stuff with the converted text */

}

  • Works only from code page to Unicode

26th Internationalization and Unicode Conference


Converting text piece by piece
Converting Text Piece by Piece

while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) {

source = inBuf;

sourceLimit = inBuf + count;

do {

target = uBuf;

targetLimit = uBuf + uBufSize;

ucnv_toUnicode(conv, &target, targetLimit,

&source, sourceLimit, NULL,

feof(f)?TRUE:FALSE, /* pass 'flush' when eof */

/* is true (when no more data will come) */

&status);

if(status == U_BUFFER_OVERFLOW_ERROR) {

// simply ran out of space – we'll reset the

// target ptr the next time through the loop.

status = U_ZERO_ERROR;

} else {

// Check other errors here and act appropriately

}

text.append(uBuf, target-uBuf);

count += target-uBuf;

} while (source < sourceLimit); // while simply out of space

}

26th Internationalization and Unicode Conference


Clean up
Clean up!

  • Whatever is opened, needs to be closed

  • Converters use ucnv_close

  • Sample uses conversion to convert code page data from a file

26th Internationalization and Unicode Conference


Break iteration introduction
Break Iteration - Introduction

  • Four types of boundaries:

    • Character, word, line, sentence

  • Points to a boundary between two characters

  • Index of character following the boundary

  • Use current() to get the boundary

  • Use first() to set iterator to start of text

  • Use last() to set iterator to end of text

26th Internationalization and Unicode Conference


Break iteration navigation
Break Iteration - Navigation

  • Use next() to move to next boundary

  • Use previous() to move to previous boundary

  • Returns DONE if can’t move boundary

26th Internationalization and Unicode Conference


Break itaration checking a position
Break Itaration – Checking a position

  • Use isBoundary() to see if position is boundary

  • Use preceeding() to find boundary at or before

  • Use following() to find boundary at or after

26th Internationalization and Unicode Conference


Break iteration opening
Break Iteration - Opening

  • Use the factory methods:

Locale locale = …; // locale to use for break iterators

UErrorCode status = U_ZERO_ERROR;

BreakIterator *characterIterator =

BreakIterator::createCharacterInstance(locale, status);

BreakIterator *wordIterator =

BreakIterator::createWordInstance(locale, status);

BreakIterator *lineIterator =

BreakIterator::createLineInstance(locale, status);

BreakIterator *sentenceIterator =

BreakIterator::createSentenceInstance(locale, status);

  • Don’t forget to check the status!

26th Internationalization and Unicode Conference


Set the text
Set the text

  • We need to tell the iterator what text to use:

UnicodeString text;

readFile(file, text);

wordIterator->setText(text);

  • Reuse iterators by calling setText() again.

26th Internationalization and Unicode Conference


Break iteration counting words in a file
Break Iteration - Counting words in a file:

int32_t countWords(BreakIterator *wordIterator, UnicodeString &text)

{

U_ERROR_CODE status = U_ZERO_ERROR;

UnicodeString word;

UnicodeSet letters(UnicodeString("[:letter:]"), status);

int32_t wordCount = 0;

int32_t start = wordIterator->first();

for(int32_t end = wordIterator->next();

end != BreakIterator::DONE;

start = end, end = wordIterator->next())

{

text->extractBetween(start, end, word);

if(letters.containsSome(word)) {

wordCount += 1;

}

}

return wordCount;

}

26th Internationalization and Unicode Conference


Break iteration breaking lines
Break Iteration – Breaking lines

int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text,

int32_t location)

{

int32_t len = text.length();

while(location < len) {

UChar c = text[location];

if(!u_isWhitespace(c) && !u_iscntrl(c)) {

break;

}

location += 1;

}

return breakIterator->previous(location + 1);

}

26th Internationalization and Unicode Conference


Break iteration cleaning up
Break Iteration – Cleaning up

  • Use delete to delete the iterators

delete characterIterator;

delete wordIterator;

delete lineIterator;

delete sentenceIterator;

26th Internationalization and Unicode Conference


Useful links
Useful Links

  • Homepage: http://oss.software.ibm.com/icu/

  • API documents: http://oss.software.ibm.com/icu/apiref/index.html

  • User guide: http://oss.software.ibm.com/icu/userguide/

26th Internationalization and Unicode Conference


Getting icu4j
Getting ICU4J

  • Easiest – pick a .jar file off download section on http://oss.software.ibm.com/icu4j

  • Use the latest version if possible

  • For sources, download the source .jar

  • For bleeding edge, use the latest CVS

  • :pserver:anoncvs@oss.software.ibm.com:/usr/cvs/icu4j

26th Internationalization and Unicode Conference


Setting up icu4j
Setting up ICU4J

  • Check that you have the appropriate JDK version

  • Try the test code (ICU4J 3.0 or later):

import com.ibm.icu.util.ULocale;

import com.ibm.icu.util.UResourceBundle;

public class TestICU {

public static void main(String[] args) {

UResourceBundle resourceBundle =

UResourceBundle.getBundleInstance(null,

ULocale.getDefault());

}

}

  • Add ICU’s jar to classpath on command line

  • Run the test suite

26th Internationalization and Unicode Conference


Building icu4j
Building ICU4J

  • Need ant in addition to JDK

  • Use ant to build

  • We also like Eclipse

26th Internationalization and Unicode Conference


Collation engine
Collation Engine

  • More on collation in a couple of hours!

  • Used for comparing strings

  • Instantiation:

ULocale locale = new ULocale("fr");

Collator coll = Collator.getInstance(locale);

// do useful things with the collator

  • Lives in com.ibm.icu.text.Collator

26th Internationalization and Unicode Conference


String comparison
String Comparison

  • Works fast

  • You get the result as soon as it is ready

  • Use when you don’t need to compare same strings many times

int compare(String source, String target);

26th Internationalization and Unicode Conference


Sort keys
Sort Keys

  • Used when multiple comparisons are required

  • Indexes in data bases

  • ICU4J has two classes

  • Compare only sort keys generated by the same type of a collator

26th Internationalization and Unicode Conference


Collationkey class
CollationKey class

  • JDK compatible

  • Saves the original string

  • Compare keys with compareTo method

  • Get the bytes with toByteArray method

  • We used CollationKey as a key for a TreeMap structure

26th Internationalization and Unicode Conference


Rawcollationkey class
RawCollationKey class

  • Does not store the original string

  • Get it by using getRawCollationKey method

  • Mutable class, can be reused

  • Simple and lightweight

26th Internationalization and Unicode Conference


Message format introduction
Message Format - Introduction

  • Assembles a user message from parts

  • Some parts fixed, some supplied at runtime

  • Order different for different languages:

    • English: My Aunt’s pen is on the table.

    • French: The pen of my Aunt is on the table.

  • Pattern string defines how to assemble parts:

    • English: {0}''s {2} is {1}.

    • French: {2} of {0} is {1}.

  • Get pattern string from resource bundle

26th Internationalization and Unicode Conference


Message format example
Message Format - Example

String person = …; // e.g. “My Aunt”

String place = …; // e.g. “on the table”

String thing = …; // e.g. “pen”

String pattern = resourceBundle.getString(“personPlaceThing”);

MessageFormat msgFmt = new MessageFormat(pattern);

Object arguments[] = {person, place, thing);

String message = msgFmt.format(arguments);

System.out.println(message);

26th Internationalization and Unicode Conference


Message format different data types
Message Format – Different data types

  • We can also format other data types, like dates

  • We do this by adding a format type:

String pattern = “On {0, date} at {0, time} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Jul 17, 2004 at 2:15:08 PM there was a power failure.

26th Internationalization and Unicode Conference


Message format format styles
Message Format – Format styles

  • Add a format style:

String pattern = “On {0, date, full} at {0, time, full} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.

26th Internationalization and Unicode Conference


Message format format style details
Message Format – Format style details

26th Internationalization and Unicode Conference


Message format no format type
Message Format – No format type

  • If no format type, data formatted like this:

26th Internationalization and Unicode Conference


Message format counting files
Message Format – Counting files

  • Pattern to display number of files:

There are {1, number, integer} files in {0}.

  • Code to use the pattern:

String pattern = resourceBundle.getString(“fileCount”);

MessageFormat fmt = new MessageFormat(fileCountPattern);

String directoryName = … ;

Int fileCount = … ;

Object args[] = {directoryName, new Integer(fileCount)};

System.out.println(fmt.format(args));

  • This will output messages like:

There are 1,234 files in myDirectory.

26th Internationalization and Unicode Conference


Message format problems counting files
Message Format – Problems counting files

  • If there’s only one file, we get:

There are 1 files in myDirectory.

  • Could fix by testing for special case of one file

  • But, some languages need other special cases:

    • Dual forms

    • Different form for no files

    • Etc.

26th Internationalization and Unicode Conference


Message format choice format
Message Format – Choice format

  • Choice format handles all of this

  • Use special format element:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • Using this pattern with the same code we get:

There are no files in thisDirectory.

There is one file in thatDirectory.

There are 1,234 files in myDirectory.

26th Internationalization and Unicode Conference


Message format choice format patterns
Message Format – Choice format patterns

  • Selects a string based on number

  • If string is a format element, process it

  • Splits real line into two or more ranges

  • Range specifiers separated by vertical bar (“|”)

  • Lower limit, separator, string

  • Separator indicates type of lower limit:

26th Internationalization and Unicode Conference


Message format choice pattern details
Message Format – Choice pattern details

  • Here’s our pattern again:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • First range is [0..1)

    • Really [-∞..1)

  • Second range is [1..1]

  • Third range is (1..∞]

26th Internationalization and Unicode Conference


Message format other details
Message Format – Other details

  • Format style can be a pattern string

    • Format type number: use DecimalFormat pattern

    • Format type date, time: use SimpleDateFormat pattern

  • Quoting in patterns

    • Enclose special characters in single quotes

    • Use two consecutive single quotes to represent one

The '{' character, the '#' character and the '' character.

26th Internationalization and Unicode Conference


Useful links1
Useful Links

  • Homepage: http://oss.software.ibm.com/icu4j/

  • API documents: http://oss.software.ibm.com/icu4j/doc/

  • User guide: http://oss.software.ibm.com/icu/userguide/

26th Internationalization and Unicode Conference