Getting started with icu l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 43

Getting Started with ICU PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Getting Started with ICU. Vladimir Weinstein Eric Mader Steven R. Loomis. Agenda. Getting & setting up ICU4C Using conversion engine Using break iterator engine Getting & setting up ICU4J Using collation engine Using message formats Example analysis. Getting ICU4C.

Download Presentation

Getting Started with ICU

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Getting started with icu l.jpg

Getting Started with ICU

Vladimir Weinstein

Eric Mader

Steven R. Loomis


Agenda l.jpg

Agenda

  • Getting & setting up ICU4C

  • Using conversion engine

  • Using break iterator engine

  • Getting & setting up ICU4J

  • Using collation engine

  • Using message formats

  • Example analysis

27th Internationalization and Unicode Conference


Getting icu4c l.jpg

Getting ICU4C

  • http://ibm.com/software/globalization/icu

  • Get the latest release

  • Get the binary package

  • Source download for modifying build options

  • CVS for bleeding edge – read instructions

27th Internationalization and Unicode Conference


Setting up icu4c l.jpg

Setting up ICU4C

  • Unpack binaries

  • If you need to build from source

    • Windows:

      • MSVC .Net 2003 Project,

      • CygWin + MSVC 6,

      • just CygWin

    • Unix: runConfigureICU

      • make install

      • make check

27th Internationalization and Unicode Conference


Testing icu4c l.jpg

Testing ICU4C

  • Windows - run: cintltst, intltest, iotest

  • Unix - make check (again)

  • See it for yourself:

#include <stdio.h>

#include "unicode/utypes.h"

#include "unicode/ures.h"

main() {

UErrorCode status = U_ZERO_ERROR;

UResourceBundle *res = ures_open(NULL, "", &status);

if(U_SUCCESS(status)) {

printf("everything is OK\n");

} else {

printf("error %s opening resource\n", u_errorName(status));

}

ures_close(res);

}

27th Internationalization and Unicode Conference


Conversion engine opening l.jpg

Conversion Engine - Opening

  • ICU4C uses open/use/close paradigm

  • Open a converter:

UErrorCode status = U_ZERO_ERROR;

UConverter *cnv = ucnv_open(encoding, &status);

if(U_FAILURE(status)) {

/* process the error situation, die gracefully */

}

  • Almost all APIs use UErrorCode for status

  • Check the error code!

27th Internationalization and Unicode Conference


What converters are available l.jpg

What Converters are Available

  • ucnv_countAvailable() – get the number of available converters

  • ucnv_getAvailable – get the name of a particular converter

  • Lot of frameworks allow this examination

27th Internationalization and Unicode Conference


Converting text chunk by chunk l.jpg

Converting Text Chunk by Chunk

char buffer[DEFAULT_BUFFER_SIZE];

char *bufP = buffer;

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

if(U_FAILURE(status)) {

if(status == U_BUFFER_OVERFLOW_ERROR) {

status = U_ZERO_ERROR;

bufP = (UChar *)malloc((len + 1) * sizeof(char));

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

} else {

/* other error, die gracefully */

}

}

/* do interesting stuff with the converted text */

27th Internationalization and Unicode Conference


Converting text character by character l.jpg

Converting Text Character by Character

UChar32 result;

char *source = start;

char *sourceLimit = start + len;

while(source < sourceLimit) {

result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);

if(U_FAILURE(status)) {

/* die gracefully */

}

/* do interesting stuff with the converted text */

}

  • Works only from code page to Unicode

27th Internationalization and Unicode Conference


Converting text piece by piece l.jpg

Converting Text Piece by Piece

while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) {

source = inBuf;

sourceLimit = inBuf + count;

do {

target = uBuf;

targetLimit = uBuf + uBufSize;

ucnv_toUnicode(conv, &target, targetLimit,

&source, sourceLimit, NULL,

feof(f)?TRUE:FALSE, /* pass 'flush' when eof */

/* is true (when no more data will come) */

&status);

if(status == U_BUFFER_OVERFLOW_ERROR) {

// simply ran out of space – we'll reset the

// target ptr the next time through the loop.

status = U_ZERO_ERROR;

} else {

// Check other errors here and act appropriately

}

text.append(uBuf, target-uBuf);

count += target-uBuf;

} while (source < sourceLimit); // while simply out of space

}

27th Internationalization and Unicode Conference


Clean up l.jpg

Clean up!

  • Whatever is opened, needs to be closed

  • Converters use ucnv_close

  • Sample uses conversion to convert code page data from a file

27th Internationalization and Unicode Conference


Text boundary analysis l.jpg

Text Boundary Analysis

  • Process of locating linguistic boundaries while formatting and processing text

  • Many uses

  • Relatively straightforward for English

  • Hard for some other languages:

    • Chinese and Japanese

    • Thai

    • Hindi

27th Internationalization and Unicode Conference


Break iteration introduction l.jpg

Break Iteration - Introduction

  • Character boundaries: grapheme clusters

  • Word boundaries: word counting, double click selection

  • Line break boundaries: where to break a line

  • Sentence break boundaries: sentence counting, triple click selection

  • ICU class - BreakIterator

27th Internationalization and Unicode Conference


Break iteration starting states l.jpg

Break Iteration – starting states

  • Points to a boundary between two characters

  • Index of character following the boundary

  • Use current() to get the boundary

  • Use first() to set iterator to start of text

  • Use last() to set iterator to end of text

27th Internationalization and Unicode Conference


Break iteration navigation l.jpg

Break Iteration - Navigation

  • Use next() to move to next boundary

  • Use previous() to move to previous boundary

  • Returns DONE if can’t move boundary

27th Internationalization and Unicode Conference


Break itaration checking a position l.jpg

Break Itaration – Checking a position

  • Use isBoundary() to see if position is boundary

  • Use preceeding() to find boundary at or before

  • Use following() to find boundary at or after

27th Internationalization and Unicode Conference


Break iteration opening l.jpg

Break Iteration - Opening

  • Use the factory methods:

Locale locale = …; // locale to use for break iterators

UErrorCode status = U_ZERO_ERROR;

BreakIterator *characterIterator =

BreakIterator::createCharacterInstance(locale, status);

BreakIterator *wordIterator =

BreakIterator::createWordInstance(locale, status);

BreakIterator *lineIterator =

BreakIterator::createLineInstance(locale, status);

BreakIterator *sentenceIterator =

BreakIterator::createSentenceInstance(locale, status);

  • Don’t forget to check the status!

27th Internationalization and Unicode Conference


Set the text l.jpg

Set the text

  • We need to tell the iterator what text to use:

UnicodeString text;

readFile(file, text);

wordIterator->setText(text);

  • Reuse iterators by calling setText() again.

27th Internationalization and Unicode Conference


Break iteration counting words in a file l.jpg

Break Iteration - Counting words in a file:

int32_t countWords(BreakIterator *wordIterator, UnicodeString &text)

{

U_ERROR_CODE status = U_ZERO_ERROR;

UnicodeString word;

UnicodeSet letters(UnicodeString("[:letter:]"), status);

int32_t wordCount = 0;

int32_t start = wordIterator->first();

for(int32_t end = wordIterator->next();

end != BreakIterator::DONE;

start = end, end = wordIterator->next())

{

text->extractBetween(start, end, word);

if(letters.containsSome(word)) {

wordCount += 1;

}

}

return wordCount;

}

27th Internationalization and Unicode Conference


Break iteration breaking lines l.jpg

Break Iteration – Breaking lines

int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text,

int32_t location)

{

int32_t len = text.length();

while(location < len) {

UChar c = text[location];

if(!u_isWhitespace(c) && !u_iscntrl(c)) {

break;

}

location += 1;

}

return breakIterator->previous(location + 1);

}

27th Internationalization and Unicode Conference


Break iteration cleaning up l.jpg

Break Iteration – Cleaning up

  • Use delete to delete the iterators

delete characterIterator;

delete wordIterator;

delete lineIterator;

delete sentenceIterator;

27th Internationalization and Unicode Conference


Useful links l.jpg

Useful Links

  • Homepage: http://ibm.com/software/globalization/icu

  • API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp

27th Internationalization and Unicode Conference


Getting icu4j l.jpg

Getting ICU4J

  • Easiest – pick a .jar file off download section on http://ibm.com/software/globalization/icu

  • Use the latest version if possible

  • For sources, download the source .jar

  • For bleeding edge, use the latest CVS – see site for instructions

27th Internationalization and Unicode Conference


Setting up icu4j l.jpg

Setting up ICU4J

  • Check that you have the appropriate JDK version

  • Try the test code (ICU4J 3.0 or later):

import com.ibm.icu.util.ULocale;

import com.ibm.icu.util.UResourceBundle;

public class TestICU {

public static void main(String[] args) {

UResourceBundle resourceBundle =

UResourceBundle.getBundleInstance(null,

ULocale.getDefault());

}

}

  • Add ICU’s jar to classpath on command line

  • Run the test suite

27th Internationalization and Unicode Conference


Building icu4j l.jpg

Building ICU4J

  • Need ant in addition to JDK

  • Use ant to build

  • We also like Eclipse

27th Internationalization and Unicode Conference


Collation engine l.jpg

Collation Engine

  • More on collation tomorrow!

  • Used for comparing strings

  • Instantiation:

ULocale locale = new ULocale("fr");

Collator coll = Collator.getInstance(locale);

// do useful things with the collator

  • Lives in com.ibm.icu.text.Collator

27th Internationalization and Unicode Conference


String comparison l.jpg

String Comparison

  • Works fast

  • You get the result as soon as it is ready

  • Use when you don’t need to compare same strings many times

int compare(String source, String target);

27th Internationalization and Unicode Conference


Sort keys l.jpg

Sort Keys

  • Used when multiple comparisons are required

  • Indexes in data bases

  • ICU4J has two classes

  • Compare only sort keys generated by the same type of a collator

27th Internationalization and Unicode Conference


Collationkey class l.jpg

CollationKey class

  • JDK compatible

  • Saves the original string

  • Compare keys with compareTo method

  • Get the bytes with toByteArray method

  • We used CollationKey as a key for a TreeMap structure

27th Internationalization and Unicode Conference


Rawcollationkey class l.jpg

RawCollationKey class

  • Does not store the original string

  • Get it by using getRawCollationKey method

  • Mutable class, can be reused

  • Simple and lightweight

27th Internationalization and Unicode Conference


Message format introduction l.jpg

Message Format - Introduction

  • Assembles a user message from parts

  • Some parts fixed, some supplied at runtime

  • Order different for different languages:

    • English: My Aunt’s pen is on the table.

    • French: The pen of my Aunt is on the table.

  • Pattern string defines how to assemble parts:

    • English: {0}''s {2} is {1}.

    • French: {2} of {0} is {1}.

  • Get pattern string from resource bundle

27th Internationalization and Unicode Conference


Message format example l.jpg

Message Format - Example

String person = …; // e.g. “My Aunt”

String place = …; // e.g. “on the table”

String thing = …; // e.g. “pen”

String pattern = resourceBundle.getString(“personPlaceThing”);

MessageFormat msgFmt = new MessageFormat(pattern);

Object arguments[] = {person, place, thing);

String message = msgFmt.format(arguments);

System.out.println(message);

27th Internationalization and Unicode Conference


Message format different data types l.jpg

Message Format – Different data types

  • We can also format other data types, like dates

  • We do this by adding a format type:

String pattern = “On {0, date} at {0, time} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Jul 17, 2004 at 2:15:08 PM there was a power failure.

27th Internationalization and Unicode Conference


Message format format styles l.jpg

Message Format – Format styles

  • Add a format style:

String pattern = “On {0, date, full} at {0, time, full} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.

27th Internationalization and Unicode Conference


Message format format style details l.jpg

Message Format – Format style details

27th Internationalization and Unicode Conference


Message format no format type l.jpg

Message Format – No format type

  • If no format type, data formatted like this:

27th Internationalization and Unicode Conference


Message format counting files l.jpg

Message Format – Counting files

  • Pattern to display number of files:

There are {1, number, integer} files in {0}.

  • Code to use the pattern:

String pattern = resourceBundle.getString(“fileCount”);

MessageFormat fmt = new MessageFormat(fileCountPattern);

String directoryName = … ;

Int fileCount = … ;

Object args[] = {directoryName, new Integer(fileCount)};

System.out.println(fmt.format(args));

  • This will output messages like:

There are 1,234 files in myDirectory.

27th Internationalization and Unicode Conference


Message format problems counting files l.jpg

Message Format – Problems counting files

  • If there’s only one file, we get:

There are 1 files in myDirectory.

  • Could fix by testing for special case of one file

  • But, some languages need other special cases:

    • Dual forms

    • Different form for no files

    • Etc.

27th Internationalization and Unicode Conference


Message format choice format l.jpg

Message Format – Choice format

  • Choice format handles all of this

  • Use special format element:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • Using this pattern with the same code we get:

There are no files in thisDirectory.

There is one file in thatDirectory.

There are 1,234 files in myDirectory.

27th Internationalization and Unicode Conference


Message format choice format patterns l.jpg

Message Format – Choice format patterns

  • Selects a string based on number

  • If string is a format element, process it

  • Splits real line into two or more ranges

  • Range specifiers separated by vertical bar (“|”)

  • Lower limit, separator, string

  • Separator indicates type of lower limit:

27th Internationalization and Unicode Conference


Message format choice pattern details l.jpg

Message Format – Choice pattern details

  • Here’s our pattern again:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • First range is [0..1)

    • Really [-∞..1)

  • Second range is [1..1]

  • Third range is (1..∞]

27th Internationalization and Unicode Conference


Message format other details l.jpg

Message Format – Other details

  • Format style can be a pattern string

    • Format type number: use DecimalFormat pattern

    • Format type date, time: use SimpleDateFormat pattern

  • Quoting in patterns

    • Enclose special characters in single quotes

    • Use two consecutive single quotes to represent one

The '{' character, the '#' character and the '' character.

27th Internationalization and Unicode Conference


Useful links43 l.jpg

Useful Links

  • Homepage: http://ibm.com/software/globalization/icu

  • API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp

27th Internationalization and Unicode Conference


  • Login