Text boundary analysis
Sponsored Links
This presentation is the property of its rightful owner.
1 / 154

Text Boundary Analysis PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Text Boundary Analysis. Eric Mader Advisory Software Engineer IBM. Where do I break lines?. The rain in Spain stays mainly on the plain. Where do I break lines?. The rain in Spain stays mainly on the plain. 您有坦率和誠實的聲譽。. Where do I break lines?.

Download Presentation

Text Boundary Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text Boundary Analysis

Eric Mader

Advisory Software Engineer

IBM


Where do I break lines?

The rain in Spain stays mainly on the plain.


Where do I break lines?

The rain in Spain stays mainly on the plain.

您有坦率和誠實的聲譽。


Where do I break lines?

The rain in Spain stays mainly on the plain.

您有坦率和誠實的聲譽。

ด่ๅแรงฃนึ๓อัตราลูกจ้างใหม่ให้๓๕


Even in English, this can be hard

You owe me $1,234.56... I think.


Even in English, this can be hard

You owe me $1,234.56... I think.


Word wrapping vs word selection

Word wrapping:

Some characters’ behavior is context-dependent.


Word wrapping vs word selection

Word wrapping:

Some characters’ behavior is context-dependent.

Searching by words:

Some characters’ behavior is context-dependent.


Analysis by pairs

second

ltr

dgt

sp

pun

ltr

dgt

first

sp

X

X

X

pun


Analysis by pairs

second

ltr

dgt

sp

pun

ltr

dgt

first

sp

X

X

X

pun


Analysis by pairs

second

ltr

dgt

sp

pun

-

ltr

dgt

first

sp

X

X

X

X

pun

X

-

X


Analysis by pairs

second

ltr

dgt

sp

pun

-

ltr

dgt

first

sp

X

X

X

X

pun

X

-

X


Analysis by pairs

second

ltr

dgt

sp

pun

-

nbs

ltr

dgt

first

sp

X

X

X

X

pun

X

-

X

nbs


Analysis by pairs

second

ltr

dgt

sp

pun

-

nbs

ltr

dgt

first

sp

X

X

X

X

pun

X

-

X

nbs


Analysis by pairs

second

ltr

dgt

sp

pun

-

nbs

kji

X

ltr

X

dgt

first

sp

X

X

X

X

X

X

pun

X

X

-

X

nbs

X

X

X

kji

X

X


Where pairs break down

A break position can depend on more than two characters:

You owe me $1,234.56... I think.


Where pairs break down

A break position can depend on more than two characters:

You owe me $1,234.56... I think.

4.5


Where pairs break down

A break position can depend on more than two characters:

You owe me $1,234.56... I think.

6..


Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”


Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”


Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”


Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”


Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”


An example

  • If not otherwise mentioned, each character is a “word” unto itself.

  • A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

  • A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

  • If a “word” and a “number” appear in succession with nothing between them, they’re kept together.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


The state-machine approach

$1,234.56...

$

start

A

0

%

.


Limitations

1992–1996


Limitations

1992–1996


Limitations

–1996


Limitations

1992–1996


Limitations

1992–1996


Limitations

1992–1996


Limitations

1992–1996


Automatic table building

  • If not otherwise mentioned, each character is a “word” unto itself.

  • A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

  • A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

  • If a “word” and a “number” appear in succession with nothing between them, they’re kept together.


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

let=[:L:];

dgt=[:N:];

mid-word=[[:Pd:]\”\’\.];

mid-num=[\”\’\.\,];

pre-num=[[[:Sc:]\#\.]-[¢]];

post-num=[\%\&¢];

word=({let}+({mid-word}{let}+)*);

number=({dgt}+({mid-num}{dgt}+)*);

{word}?({number}{word})*({number}{post-num}?)?;

{pre-num}({number}{word})*({number}{post-num}?)?;


Automatic table building

  • All regular-expression rules have equal precedence

  • The “winning” rule is decided using a longest-possible-match algorithm (except in certain well-defined cases)

  • Our build algorithm parses the regular expressions, builds the state table, and makes sure it’s deterministic in a single pass


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term{[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Sentence-break rules

.*?{term}[{term}{period}{end}]*{space}*;

.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};


Ignore characters

$ignore=[[:Mn:][:Me:][:Cf:]];


Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];

$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];


Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];

$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];


Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];

$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

You owe me $1,234.56... I think.


Random-access iteration

!{sent-start}{start}*{space}*{end}*{period};

![{sent-start}{lc}{digit}]{start}*{space}*{end}*{term};


Dictionary-based iteration

We hold these truths to be self-evident: that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are Life, Liberty, and the Pursuit of Happiness.


Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.


Dictionary-based iteration

$dictionary=[A-Za-z\-\’];


Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.


Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.


Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Dictionary-based iteration

themendinetonight


Text Boundary Analysis

Eric Mader

Advisory Software Engineer

IBM


  • Login