corpus based evaluation of prosodic phrase break prediction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Corpus-based evaluation of prosodic phrase break prediction PowerPoint Presentation
Download Presentation
Corpus-based evaluation of prosodic phrase break prediction

Loading in 2 Seconds...

play fullscreen
1 / 14

Corpus-based evaluation of prosodic phrase break prediction - PowerPoint PPT Presentation


  • 147 Views
  • Uploaded on

Corpus-based evaluation of prosodic phrase break prediction. Claire Brierley and Eric Atwell School of Computing, University of Leeds. Prosody and prosodic phrase breaks. In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Corpus-based evaluation of prosodic phrase break prediction' - calum


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
corpus based evaluation of prosodic phrase break prediction

Corpus-based evaluation of prosodic phrase break prediction

Claire Brierley and Eric Atwell

School of Computing, University of Leeds

Corpus Linguistics 2007, University of Birmingham

prosody and prosodic phrase breaks
Prosody and prosodic phrase breaks

In the popular mythology the

computer is a mathematics

machine: it is designed to do

numerical calculations. Yet it

is really a language machine:

its fundamental power lies in

its ability to manipulate

linguistic tokens - symbols to

which meaning has been

assigned.

Terry Winograd, 1984

punctuation is a way of annotating phrase breaks in text
Punctuation is a way of ‘annotating’ phrase breaks in text..

In the popular mythology the

computer is a mathematics

machine: it is designed to do

numerical calculations. Yet it

is really a language machine:

its fundamental power lies in

its ability to manipulate

linguistic tokens - symbols to

which meaning has been

assigned.

Terry Winograd, 1984

and is therefore one text based feature used in automatic phrase break prediction
..and is therefore one text-based feature used in automatic phrase break prediction

In the popular mythology the

computer is a mathematics

machine| it is designed to do

numerical calculations| Yet it

is really a language machine|

its fundamental power lies in

its ability to manipulate

linguistic tokens| symbols to

which meaning has been

assigned|

Terry Winograd, 1984

positional syntactic features n grams
Positional syntactic features: n-grams

<NN><VBN><NP>

Once upon a time | there will be a little girl calledUncumber. |

Uncumber will have a younger brother called Sulpice|and theywill

live with their parents|in a house in the middle of the woods.|

upon a time = trigram where we expect a boundary next

the middle of = trigram which mightinclude a boundary

live with = bigram which might include a boundary

girl called = bigram where we might have a boundary next

and which might also include a boundary…

some top class phrase break models
Some top class phrase break models

There are 2 generic approaches:

Deterministic or rule-based:chink chunkorCFP

(Liberman & Church, 1992)

They will live | with their parents | in a house | in the

middle | of the woods |

Probabilistic or statistical: e.g. as used in Festival (CSTR)

(Taylor & Black, 1998)

79% breaks-correct on MARSEC (Roach, P. et al, 1993)

shallow or chunk parsing
Shallow or chunk parsing

Source: http://ironcreek.net/phpsyntaxtree/

[S [PP [IN In] [NP [AT the] [JJ popular] [NN mythology]]][NP [AT the] [NN computer]]

[VP [BEZ is] [NP [AT a] [NN mathematics] [NN machine.]]]]

In the popular mythology | the computer is a mathematics machine.

Chunk parse rule - using NLTK version 0.6:

parse.ChunkRule('<IN><IN|DT|DTI|AT|AP|CD|OD|PPO|PN|POSS|JJ|JJT|JJS|NP|N

N|NNS>+', “<Chunk a preposition> <with sequences of other prepositions,

determiners, numbers, certain pronouns, adjectives and nouns - and these can be

in any order>”)

the classification task
The classification task

Task: to classify junctures between words

Train the model on “gold

standard” speech corpus:

training data: PoS tags

+ boundary tags

Test the model:

unseen test set

quantitative metrics

% boundaries correct?

% insertion & deletion errors?

Model type: deterministic or probabilistic?

break or non-break?

rules or features?

variant phrasing strategies and templates
Variant phrasing strategies and templates

Gold standard corpus version has lots of major boundaries

Given the state of lawlessness | that exists in Lebanon || the

uninformed outsider might reasonably expect security | at

Beirut airport || to be amongst the tightest in the world || but the

opposite is true ||

Rule-based variant

Given the state | of lawlessness | that exists | in Lebanon

the uninformed outsider | might reasonably expect security | at

Beirut airport | to be | amongst the tightest in the world | but the

opposite is true |

Score on this sentence: Recall = 83.33%; Precision = 55.55%

Aix-MARSEC Corpus: annotated transcript of 1980s BBC news commentary

variant phrasing strategies and templates10
Variant phrasing strategies and templates

Gold standard corpus version has lots of major boundaries

Given the state of lawlessness | that exists in Lebanon || the

uninformed outsider might reasonably expect security | at

Beirut airport || to be amongst the tightest in the world || but the

opposite is true ||

Intuitive prosodic phrasing

Given the state of lawlessness that exists in Lebanon |

the uninformed outsider |might reasonably expect| security | at

Beirut airport | to be amongst the tightest in the world | but the

opposite is true |

Score on this sentence: Recall = 83.33%; Precision = 71.43%

“..the very notion of evaluating a phrase-break model against a gold

standard is problematic as long as the gold standard only represents one

out of the space of all acceptable phrasings..” (Atterer and Klein, 2002)

current work developing a prosody lexicon
Current work: developing a prosody lexicon

incoming corpus text

  • already PoS-tagged
  • format: list of tuples
  • [..(‘gone’, ‘VBN’),..]

intersection with Python dictionary

  • get some more tags
  • e.g. CFP, stress pattern
  • [..(‘gone’, ‘VBN’, ‘C’, ‘1’),..]
  • these tags are text-based features

Sources used:

  • Computer-usable dictionary CUVPlus (Pedler, 2002) - incorporates C5 PoS tags
  • Lexical stress patterns derived from CELEX2 database (Baayen et al, 1995) and Carnegie-Mellon Pronouncing dictionary (CMU, 1998)
lexicon fields and lookup
Lexicon fields - and lookup
  • Python dictionary syntax stores the above information as (key, value) pairs

{ (‘cascades’, ‘NN2’) : [‘0’, ‘k&'skeIdz’, ‘Kj%’, ‘NN2:1’, ‘2’, ‘01’, ‘C’] (‘cascades’, ‘VVZ’) : [‘0’, ‘k&'skeIdz’, ‘Ia%’, ‘VVZ:-1’, ‘2’, ‘01’, ‘C’] }

  • Incoming corpus text - also in the form of (token, tag) tuples - can be matched against dictionary keys
  • Thus intersection enables corpus text to accumulate additional values which have the potential to become features for machine learning tasks
what i d like to achieve
What I’d like to achieve
  • Develop phrase break predictors representative of two generic approaches - rule-based and probabilistic and compare their performance.
  • Use the WEKA toolkit plus training data from the Aix-MARSEC corpus(Auran et al, 2004) which has linguistically sophisticated prosodic annotations, to explore a new mix of features for machine learning of phrase break prediction. This is where the prosody lexicon comes in.
  • Develop a purpose-built corpus of different text genres and different annotation schemes to moderate the process of evaluating these phrase break models against one prosodic template.
  • If I can develop a good model, then a possible contribution to the Aix-MARSEC project may be to enrich this gold standard by generating alternative prosodic markup to the corpus linguists’ analysis. Outputs from the model would potentially represent legitimate, variant phrasing strategies to those already uncovered and provide new prosodic templates for the evaluation of phrase break models.
example problem still working on it
Input text: list of token, tag

tuples

[.,('that', 'CS'), ('individual', 'JJ'),

('willingness', 'NN'), ('to', 'TO'),

('pay', 'VB'), ('should', 'MD'),

('be', 'BE'), ('the', 'ATI'), ('main',

'JJB'), ('test', 'NN'), ('of',

'IN'), ('how', 'WRB'),

('resources', 'NNS'), ('are',

'BER'), ('used',

'VBN'), ('.', '.'),.]

SEC: annotated transcript of Reith

Lecture

Input text is temporarily tagged with C5 for lexicon lookup

Mapping C5  LOB is usually a case of one-to-many

However, C5 has separate tags for ‘that’ and ‘of’ - a case of many-to-one

CJS (subordinating conjunction) or CJT (that)  CS and PRP (preposition) or PRF (of)  IN

Need to resolve this to accomplish Python dictionary lookup (preferred option) or use different lookup mechanism (hopefully not!)

Problem compounded with introduction of different PoS tag sets as consequence of planned composite test corpus

Example problem - still working on it!