getting started with icu
Download
Skip this Video
Download Presentation
Getting Started with ICU

Loading in 2 Seconds...

play fullscreen
1 / 41

Getting Started with ICU - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Getting Started with ICU. Vladimir Weinstein Eric Mader. Agenda. Getting & setting up ICU4C Using conversion engine Using break iterator engine Getting & setting up ICU4J Using collation engine Using message formats. Getting ICU4C. http://oss.software.ibm.com/icu/

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Getting Started with ICU' - chickoa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
getting started with icu

Getting Started with ICU

Vladimir Weinstein

Eric Mader

26th Internationalization and Unicode Conference

agenda
Agenda
  • Getting & setting up ICU4C
  • Using conversion engine
  • Using break iterator engine
  • Getting & setting up ICU4J
  • Using collation engine
  • Using message formats

26th Internationalization and Unicode Conference

getting icu4c
Getting ICU4C
  • http://oss.software.ibm.com/icu/
  • Get the latest release
  • Get the binary package
  • Source download for modifying build options
  • CVS for bleeding edge:
  • :pserver:[email protected]:/usr/cvs/icu

26th Internationalization and Unicode Conference

setting up icu4c
Setting up ICU4C
  • Unpack binaries
  • If you need to build from source
    • Windows:
      • MSVC .Net 2003 Project,
      • CygWin + MSVC 6,
      • just CygWin
    • Unix: runConfigureICU
      • make install
      • make check

26th Internationalization and Unicode Conference

testing icu4c
Testing ICU4C
  • Windows - run: cintltst, intltest, iotest
  • Unix - make check (again)
  • See it for yourself:

#include <stdio.h>

#include "unicode/utypes.h"

#include "unicode/ures.h"

main() {

UErrorCode status = U_ZERO_ERROR;

UResourceBundle *res = ures_open(NULL, "", &status);

if(U_SUCCESS(status)) {

printf("everything is OK\n");

} else {

printf("error %s opening resource\n", u_errorName(status));

}

ures_close(res);

}

26th Internationalization and Unicode Conference

conversion engine opening
Conversion Engine - Opening
  • ICU4C uses open/use/close paradigm
  • Open a converter:

UErrorCode status = U_ZERO_ERROR;

UConverter *cnv = ucnv_open(encoding, &status);

if(U_FAILURE(status)) {

/* process the error situation, die gracefully */

}

  • Almost all APIs use UErrorCode for status
  • Check the error code!

26th Internationalization and Unicode Conference

what converters are available
What Converters are Available
  • ucnv_countAvailable() – get the number of available converters
  • ucnv_getAvailable – get the name of a particular converter
  • Lot of frameworks allow this examination

26th Internationalization and Unicode Conference

converting text chunk by chunk
Converting Text Chunk by Chunk

char buffer[DEFAULT_BUFFER_SIZE];

char *bufP = buffer;

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

if(U_FAILURE(status)) {

if(status == U_BUFFER_OVERFLOW_ERROR) {

status = U_ZERO_ERROR;

bufP = (UChar *)malloc((len + 1) * sizeof(char));

len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

} else {

/* other error, die gracefully */

}

}

/* do interesting stuff with the converted text */

26th Internationalization and Unicode Conference

converting text character by character
Converting Text Character by Character

UChar32 result;

char *source = start;

char *sourceLimit = start + len;

while(source < sourceLimit) {

result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);

if(U_FAILURE(status)) {

/* die gracefully */

}

/* do interesting stuff with the converted text */

}

  • Works only from code page to Unicode

26th Internationalization and Unicode Conference

converting text piece by piece
Converting Text Piece by Piece

while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) {

source = inBuf;

sourceLimit = inBuf + count;

do {

target = uBuf;

targetLimit = uBuf + uBufSize;

ucnv_toUnicode(conv, &target, targetLimit,

&source, sourceLimit, NULL,

feof(f)?TRUE:FALSE, /* pass \'flush\' when eof */

/* is true (when no more data will come) */

&status);

if(status == U_BUFFER_OVERFLOW_ERROR) {

// simply ran out of space – we\'ll reset the

// target ptr the next time through the loop.

status = U_ZERO_ERROR;

} else {

// Check other errors here and act appropriately

}

text.append(uBuf, target-uBuf);

count += target-uBuf;

} while (source < sourceLimit); // while simply out of space

}

26th Internationalization and Unicode Conference

clean up
Clean up!
  • Whatever is opened, needs to be closed
  • Converters use ucnv_close
  • Sample uses conversion to convert code page data from a file

26th Internationalization and Unicode Conference

break iteration introduction
Break Iteration - Introduction
  • Four types of boundaries:
    • Character, word, line, sentence
  • Points to a boundary between two characters
  • Index of character following the boundary
  • Use current() to get the boundary
  • Use first() to set iterator to start of text
  • Use last() to set iterator to end of text

26th Internationalization and Unicode Conference

break iteration navigation
Break Iteration - Navigation
  • Use next() to move to next boundary
  • Use previous() to move to previous boundary
  • Returns DONE if can’t move boundary

26th Internationalization and Unicode Conference

break itaration checking a position
Break Itaration – Checking a position
  • Use isBoundary() to see if position is boundary
  • Use preceeding() to find boundary at or before
  • Use following() to find boundary at or after

26th Internationalization and Unicode Conference

break iteration opening
Break Iteration - Opening
  • Use the factory methods:

Locale locale = …; // locale to use for break iterators

UErrorCode status = U_ZERO_ERROR;

BreakIterator *characterIterator =

BreakIterator::createCharacterInstance(locale, status);

BreakIterator *wordIterator =

BreakIterator::createWordInstance(locale, status);

BreakIterator *lineIterator =

BreakIterator::createLineInstance(locale, status);

BreakIterator *sentenceIterator =

BreakIterator::createSentenceInstance(locale, status);

  • Don’t forget to check the status!

26th Internationalization and Unicode Conference

set the text
Set the text
  • We need to tell the iterator what text to use:

UnicodeString text;

readFile(file, text);

wordIterator->setText(text);

  • Reuse iterators by calling setText() again.

26th Internationalization and Unicode Conference

break iteration counting words in a file
Break Iteration - Counting words in a file:

int32_t countWords(BreakIterator *wordIterator, UnicodeString &text)

{

U_ERROR_CODE status = U_ZERO_ERROR;

UnicodeString word;

UnicodeSet letters(UnicodeString("[:letter:]"), status);

int32_t wordCount = 0;

int32_t start = wordIterator->first();

for(int32_t end = wordIterator->next();

end != BreakIterator::DONE;

start = end, end = wordIterator->next())

{

text->extractBetween(start, end, word);

if(letters.containsSome(word)) {

wordCount += 1;

}

}

return wordCount;

}

26th Internationalization and Unicode Conference

break iteration breaking lines
Break Iteration – Breaking lines

int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text,

int32_t location)

{

int32_t len = text.length();

while(location < len) {

UChar c = text[location];

if(!u_isWhitespace(c) && !u_iscntrl(c)) {

break;

}

location += 1;

}

return breakIterator->previous(location + 1);

}

26th Internationalization and Unicode Conference

break iteration cleaning up
Break Iteration – Cleaning up
  • Use delete to delete the iterators

delete characterIterator;

delete wordIterator;

delete lineIterator;

delete sentenceIterator;

26th Internationalization and Unicode Conference

useful links
Useful Links
  • Homepage: http://oss.software.ibm.com/icu/
  • API documents: http://oss.software.ibm.com/icu/apiref/index.html
  • User guide: http://oss.software.ibm.com/icu/userguide/

26th Internationalization and Unicode Conference

getting icu4j
Getting ICU4J
  • Easiest – pick a .jar file off download section on http://oss.software.ibm.com/icu4j
  • Use the latest version if possible
  • For sources, download the source .jar
  • For bleeding edge, use the latest CVS
  • :pserver:[email protected]:/usr/cvs/icu4j

26th Internationalization and Unicode Conference

setting up icu4j
Setting up ICU4J
  • Check that you have the appropriate JDK version
  • Try the test code (ICU4J 3.0 or later):

import com.ibm.icu.util.ULocale;

import com.ibm.icu.util.UResourceBundle;

public class TestICU {

public static void main(String[] args) {

UResourceBundle resourceBundle =

UResourceBundle.getBundleInstance(null,

ULocale.getDefault());

}

}

  • Add ICU’s jar to classpath on command line
  • Run the test suite

26th Internationalization and Unicode Conference

building icu4j
Building ICU4J
  • Need ant in addition to JDK
  • Use ant to build
  • We also like Eclipse

26th Internationalization and Unicode Conference

collation engine
Collation Engine
  • More on collation in a couple of hours!
  • Used for comparing strings
  • Instantiation:

ULocale locale = new ULocale("fr");

Collator coll = Collator.getInstance(locale);

// do useful things with the collator

  • Lives in com.ibm.icu.text.Collator

26th Internationalization and Unicode Conference

string comparison
String Comparison
  • Works fast
  • You get the result as soon as it is ready
  • Use when you don’t need to compare same strings many times

int compare(String source, String target);

26th Internationalization and Unicode Conference

sort keys
Sort Keys
  • Used when multiple comparisons are required
  • Indexes in data bases
  • ICU4J has two classes
  • Compare only sort keys generated by the same type of a collator

26th Internationalization and Unicode Conference

collationkey class
CollationKey class
  • JDK compatible
  • Saves the original string
  • Compare keys with compareTo method
  • Get the bytes with toByteArray method
  • We used CollationKey as a key for a TreeMap structure

26th Internationalization and Unicode Conference

rawcollationkey class
RawCollationKey class
  • Does not store the original string
  • Get it by using getRawCollationKey method
  • Mutable class, can be reused
  • Simple and lightweight

26th Internationalization and Unicode Conference

message format introduction
Message Format - Introduction
  • Assembles a user message from parts
  • Some parts fixed, some supplied at runtime
  • Order different for different languages:
    • English: My Aunt’s pen is on the table.
    • French: The pen of my Aunt is on the table.
  • Pattern string defines how to assemble parts:
    • English: {0}\'\'s {2} is {1}.
    • French: {2} of {0} is {1}.
  • Get pattern string from resource bundle

26th Internationalization and Unicode Conference

message format example
Message Format - Example

String person = …; // e.g. “My Aunt”

String place = …; // e.g. “on the table”

String thing = …; // e.g. “pen”

String pattern = resourceBundle.getString(“personPlaceThing”);

MessageFormat msgFmt = new MessageFormat(pattern);

Object arguments[] = {person, place, thing);

String message = msgFmt.format(arguments);

System.out.println(message);

26th Internationalization and Unicode Conference

message format different data types
Message Format – Different data types
  • We can also format other data types, like dates
  • We do this by adding a format type:

String pattern = “On {0, date} at {0, time} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Jul 17, 2004 at 2:15:08 PM there was a power failure.

26th Internationalization and Unicode Conference

message format format styles
Message Format – Format styles
  • Add a format style:

String pattern = “On {0, date, full} at {0, time, full} there was {1}.”;

MessageFormat fmt = new MessageFormat(pattern);

Object args[] = {new Date(System.currentTimeMillis()), // 0

“a power failure” // 1

};

System.out.println(fmt.format(args));

  • This will output:

On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.

26th Internationalization and Unicode Conference

message format format style details
Message Format – Format style details

26th Internationalization and Unicode Conference

message format no format type
Message Format – No format type
  • If no format type, data formatted like this:

26th Internationalization and Unicode Conference

message format counting files
Message Format – Counting files
  • Pattern to display number of files:

There are {1, number, integer} files in {0}.

  • Code to use the pattern:

String pattern = resourceBundle.getString(“fileCount”);

MessageFormat fmt = new MessageFormat(fileCountPattern);

String directoryName = … ;

Int fileCount = … ;

Object args[] = {directoryName, new Integer(fileCount)};

System.out.println(fmt.format(args));

  • This will output messages like:

There are 1,234 files in myDirectory.

26th Internationalization and Unicode Conference

message format problems counting files
Message Format – Problems counting files
  • If there’s only one file, we get:

There are 1 files in myDirectory.

  • Could fix by testing for special case of one file
  • But, some languages need other special cases:
    • Dual forms
    • Different form for no files
    • Etc.

26th Internationalization and Unicode Conference

message format choice format
Message Format – Choice format
  • Choice format handles all of this
  • Use special format element:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • Using this pattern with the same code we get:

There are no files in thisDirectory.

There is one file in thatDirectory.

There are 1,234 files in myDirectory.

26th Internationalization and Unicode Conference

message format choice format patterns
Message Format – Choice format patterns
  • Selects a string based on number
  • If string is a format element, process it
  • Splits real line into two or more ranges
  • Range specifiers separated by vertical bar (“|”)
  • Lower limit, separator, string
  • Separator indicates type of lower limit:

26th Internationalization and Unicode Conference

message format choice pattern details
Message Format – Choice pattern details
  • Here’s our pattern again:

There {1, choice, 0#are no files|

1#is one file|

1<are {1, number, integer} files} in {0}.

  • First range is [0..1)
    • Really [-∞..1)
  • Second range is [1..1]
  • Third range is (1..∞]

26th Internationalization and Unicode Conference

message format other details
Message Format – Other details
  • Format style can be a pattern string
    • Format type number: use DecimalFormat pattern
    • Format type date, time: use SimpleDateFormat pattern
  • Quoting in patterns
    • Enclose special characters in single quotes
    • Use two consecutive single quotes to represent one

The \'{\' character, the \'#\' character and the \'\' character.

26th Internationalization and Unicode Conference

useful links1
Useful Links
  • Homepage: http://oss.software.ibm.com/icu4j/
  • API documents: http://oss.software.ibm.com/icu4j/doc/
  • User guide: http://oss.software.ibm.com/icu/userguide/

26th Internationalization and Unicode Conference

ad