Computational tools for linguists l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 154

Computational Tools for Linguists PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on
  • Presentation posted in: General

Computational Tools for Linguists . Inderjeet Mani Georgetown University [email protected] Topics. Computational tools for manual and automatic annotation of linguistic data exploration of linguistic hypotheses Case studies Demonstrations and training Inter-annotator reliability

Download Presentation

Computational Tools for Linguists

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computational tools for linguists l.jpg

Computational Tools for Linguists

Inderjeet Mani

Georgetown University

[email protected]


Topics l.jpg

Topics

  • Computational tools for

    • manual and automatic annotation of linguistic data

    • exploration of linguistic hypotheses

  • Case studies

  • Demonstrations and training

  • Inter-annotator reliability

  • Effectiveness of annotation scheme

  • Costs and tradeoffs in corpus preparation


Outline l.jpg

Topics

Concordances

Data sparseness

Chomsky’s Critique

Ngrams

Mutual Information

Part-of-speech tagging

Annotation Issues

Inter-Annotator Reliability

Named Entity Tagging

Relationship Tagging

Case Studies

metonymy

adjective ordering

Discourse markers: then

TimeML

Outline


Corpus linguistics l.jpg

Corpus Linguistics

  • Use of linguistic data from corpora to test linguistic hypotheses => emphasizes language use

  • Uses computers to do the searching and counting from on-line material

    • Faster than doing it by hand! Check?

  • Most typical tool is a concordancer, but there are many others!

  • Tools can analyze a certain amount, rest is left to human!

  • Corpus Linguistics is also a particular approach to linguistics, namely an empiricist approach

    • Sometimes (extreme view) opposed to the rationalist approach, at other times (more moderate view) viewed as complementary to it

    • Cf. Theoretical vs. Applied Linguistics


Empirical approaches in computational linguistics l.jpg

Empirical Approaches in Computational Linguistics

  • Empiricism – the doctrine that knowledge is derived from experience

  • Rationalism: the doctrine that knowledge is derived from reason

  • Computational Linguistics is, by necessity, focused on ‘performance’, in that naturally occurring linguistic data has to be processed

    • Naturally occurring data is messy! This means we have to process data characterized by false starts, hesitations, elliptical sentences, long and complex sentences, input that is in a complex format, etc.

  • The methodology used is corpus-based

    • linguistic analysis (phonological, morphological, syntactic, semantic, etc.) carried out on a fairly large scale

    • rules are derived by humans or machines from looking at phenomena in situ (with statistics playing an important role)


Example metonymy l.jpg

Example: metonymy

  • Metonymy: substituting the name of one referent for another

    • George W. Bush invaded Iraq

    • A Mercedes rear-ended me

  • Is metonymy involving institutions as agents more common in print news than in fiction?

    • “The X Vreporting”

      • Let’s start with: “The X said”

        • This pattern will provide a “handle” to identify the data


Exploring corpora l.jpg

Exploring Corpora

  • Datasets

    http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/DataSets.cgi

  • Metonymy Test using Corpora

    http://complingtwo.georgetown.edu/~gwilson/Tools/Metonymy/TheXSaid_MST.html


The x said from concordance data l.jpg

‘The X said’ from Concordance data

The preference for metonymy in print news arises because of the need to

communicate Information from companies and governments.


Chomsky s critique of corpus based methods l.jpg

Chomsky’s Critique of Corpus-Based Methods

1. Corpora model performance, while linguistics is aimed at the explanation of competence

If you define linguistics that way, linguistic theories will never be able to deal with actual, messy data

Many linguists don’t find the competence-performance distinction to be clear-cut. Sociolinguists have argued that the variability of linguistic performance is systematic, predictable, and meaningful to speakers of a language.

Grammatical theories vary in where they draw the line between competence and performance, with some grammars (such as Halliday’s Systemic Grammar) organized as systems of functionally-oriented choices.


Chomsky s critique concluded l.jpg

Chomsky’s Critique (concluded)

2. Natural language is in principle infinite, whereas corpora are finite, so many examples will be missed

Excellent point, which needs to be understood by anyone working with a corpus.

But does that mean corpora are useless?

  • Introspection is unreliable (prone to performance factors, cf. only short sentences), and pretty useless with child data.

  • Also, insights from a corpus might lead to generalization/induction beyond the corpus– if the corpus is a good sample of the “text population”

    3. Ungrammatical examples won’t be available in a corpus

    Depends on the corpus, e.g., spontaneous speech, language learners, etc.

    The notion of grammaticality is not that clear

    • Who did you see [pictures/?a picture/??his picture/*John’s picture] of?

    • ARG/ADJUNCT example


Which words are the most frequent l.jpg

Which Words are the Most Frequent?

Common Words in Tom Sawyer (71,730 words), from

Manning & Schutze p.21

Will these counts hold in a different corpus (and genre, cf. Tom)?

What happens if you have 8-9M words? (check usage demo!)


Data sparseness l.jpg

Many low-frequency words

Fewer high-frequency words.

Only a few words will have lots of examples.

About 50% of word types occur only once

Over 90% occur 10 times or less.

So, there is merit to Chomsky’s 2nd objection

Data Sparseness

Frequency of word types in

Tom Sawyer, from M&S 22.


Zipf s law frequency is inversely proportional to rank l.jpg

Zipf’s Law: Frequency is inversely proportional to rank

Empirical evaluation of Zipf’s Law on Tom Sawyer, from M&S 23.


Illustration of zipf s law brown corpus from m s p 30 l.jpg

Illustration of Zipf’s Law (Brown Corpus, from M&S p. 30)

logarithmic

scale

  • See also http://www.georgetown.edu/faculty/wilsong/IR/WordDist.html


Tokenizing words for corpus analysis l.jpg

Tokenizing words for corpus analysis

  • 1. Break on

    • Spaces? 犬に当る男の子は私の兄弟である。

      inuo butta otokonokowa otooto da

    • Periods? (U.K. Products)

    • Hyphens? data-base = database = data base

    • Apostrophes? won’t, couldn’t, O’Riley, car’s

  • 2. should different word forms be counted as distinct?

    • Lemma: a set of lexical forms having the same stem, the same pos, and the same word-sense. So, cat and cats are the same lemma.

    • Sometimes, words are lemmatized by stemming, other times by morphological analysis, using a dictionary and/or morphological rules

  • 3. fold case or not (usually folded)?

    • The the THEMark versus mark

    • One may need, however, to regenerate the original case when presenting it to the user


Counting word tokens vs word types l.jpg

Counting: Word Tokens vs Word Types

  • Word tokens in Tom Sawyer: 71,370

  • Word types: (i.e., how many different words) 8,018

  • In newswire text of that number of tokens, you would have 11,000 word types. Perhaps because Tom Sawyer is written in a simple style.


Inspecting word frequencies in a corpus l.jpg

Inspecting word frequencies in a corpus

  • http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/DataSets.cgi

  • Usage demo:

    • http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/Usage.cgi


Ngrams l.jpg

Ngrams

  • Sequences of linguistic items of length n

  • See count.pl


A test for association strength mutual information l.jpg

A test for association strength: Mutual Information

Data from (Church et al. 1991)

1988 AP corpus; N=44.3M


Interpreting mutual information l.jpg

Interpreting Mutual Information

  • High scores, e.g., strong supporter (8.85) indicates strongly associated in the corpus

    MI is a logarithmic score. To convert it, recall that X=2 log2X

    so, 28.85 461.44. So this is 461 X chance.

  • Low scores – powerful support (1.74): this is 3X chance, since 21.74 3

    I fxy fx fy x y

    1.74 2 1984 13,428 powerful support

    I = log2 (2N/1984*13428) = 1.74

  • So, doesn’t necessarily mean weakly associated – could be due to data sparseness


Mutual information over grammatical relations l.jpg

Mutual Information over Grammatical Relations

  • Parse a corpus

  • Determine subject-verb-object triples

  • Identify head nouns of subject and object NPs

  • Score subj-verb and verb-obj associations using MI


Demo of verb subj verb obj parses l.jpg

Demo of Verb-Subj, Verb-Obj Parses

  • Who devoursor what gets devoured?

  • Demo: http://www.cs.ualberta.ca/~lindek/demos/depindex.htm


Mi over verb obj relations l.jpg

MI over verb-obj relations

  • Data from (Church et al. 1991)


A subj verb mi example who does what in news l.jpg

A Subj-Verb MI Example: Who does what in news?

executivepolice politician

reprimand 16.36shoot 17.37 clamor 16.94

conceal 17.46raid 17.65 jockey 17.53

bank 18.27arrest 17.96wrangle 17.59

foresee 18.85detain 18.04woo 18.92

conspire 18.91disperse 18.14exploit 19.57

convene 19.69interrogate 18.36brand 19.65

plead 19.83swoop 18.44behave 19.72

sue 19.85evict 18.46dare 19.73

answer 20.02bundle 18.50sway 19.77

commit 20.04manhandle 18.59criticize 19.78

worry 20.04search 18.60flank 19.87

accompany 20.11confiscate 18.63proclaim 19.91

own 20.22apprehend 18.71annul 19.91

witness 20.28round 18.78favor 19.92

Data from (Schiffman et al. 2001)


Famous corpora l.jpg

‘Famous’ Corpora

  • Must see: http://www.ldc.upenn.edu/Catalog/

  • Brown Corpus

  • British National Corpus

  • International Corpus of English

  • Penn Treebank

  • Lancaster-Oslo-Bergen Corpus

  • Canadian Hansard Corpus

  • U.N. Parallel Corpus

  • TREC Corpora

  • MUC Corpora

  • English, Arabic, Chinese Gigawords

  • Chinese, ArabicTreebanks

  • North American News Text Corpus

  • Multext East Corpus – ‘1984’ in multiple Eastern/Central European langauges


Links to corpora l.jpg

Links to Corpora

  • Corpora:

    • Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/

    • Oxford Text Archive http://sable.ox.ac.uk/ota/

    • Project Gutenberg http://www.promo.net/pg/

    • CORPORA list http://www.hd.uib.no/corpora/archive.html

  • Other:

    • Chris Manning’s Corpora Page

    • http://www-nlp.stanford.edu/links/statnlp.html#Corpora

    • Michael Barlow’s Corpus Linguistics page http://www.ruf.rice.edu/~barlow/corpus.html

    • Cathy Ball’s Corpora tutorial http://www.georgetown.edu/faculty/ballc/corpora/tutorial.html


Summary introduction l.jpg

Summary: Introduction

  • Concordances and corpora are widely used and available, to help one to develop empirically-based linguistic theories and computer implementations

  • The linguistic items that can be counted are many, but “words” (defined appropriately) are basic items

  • The frequency distribution of words in any natural language is Zipfian

    • Data sparseness is a basic problem when using observations in a corpus sample of language

  • Sequences of linguistic items (e.g., word sequences – n-grams) can also be counted, but the counts will be very rare for longer items

  • Associations between items can be easily computed

    • e.g., associations between verbs and parser-discovered subjs or objs


Outline28 l.jpg

Topics

Concordances

Data sparseness

Chomsky’s Critique

Ngrams

Mutual Information

Part-of-speech tagging

Annotation Issues

Inter-Annotator Reliability

Named Entity Tagging

Relationship Tagging

Case Studies

metonymy

adjective ordering

Discourse markers: then

TimeML

Outline


Using pos in concordances l.jpg

Using POS in Concordances

deal is more often a verb

In Fiction 2000

deal is more often a noun

in English Gigaword

deal is more prevalent in

Fiction 2000 than Gigaword


Pos tagging what is it l.jpg

POS Tagging – What is it?

  • Given a sentence and a tagset of lexical categories, find the most likely tag for each word in the sentence

  • Tagset – e.g., Penn Treebank (45 tags, derived from the 87-tag Brown corpus tagset)

  • Note that many of the words may have unambiguous tags

  • Example

    Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN

    People/NNS continue/VBP to/TOinquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN


More details of pos problem l.jpg

More details of POS problem

  • How ambiguous?

    • Most words in English have only one Brown Corpus tag

      • Unambiguous (1 tag) 35,340 word types

      • Ambiguous (2- 7 tags) 4,100 word types = 11.5%

        • 7 tags: 1 word type “still”

    • But many of the most common words are ambiguous

      • Over 40% of Brown corpus tokens are ambiguous

  • Obvious strategies may be suggested based on intuition

    • to/TO race/VB

    • the/DT race/NN

    • will/MD race/NN

  • Sentences can also contain unknown words for which tags have to be guessed: Secretariat/NNP is/VBZ


  • Different english part of speech tagsets l.jpg

    Different English Part-of-Speech Tagsets

    • Brown corpus - 87 tags

      • Allows compound tags

        • “I'm” tagged as PPSS+BEM

          • PPSS for "non-3rd person nominative personal pronoun" and BEM for "am, 'm“

    • Others have derived their work from Brown Corpus

      • LOB Corpus: 135 tags

      • Lancaster UCREL Group: 165 tags

      • London-Lund Corpus: 197 tags.

      • BNC – 61 tags (C5)

      • PTB – 45 tags

    • To see comparisons ad mappings of tagsets, go to www.comp.leeds.ac.uk/amalgam/tagsets/tagmenu.html


    Ptb tagset 36 main tags 9 punctuation tags l.jpg

    PTB Tagset (36 main tags + 9 punctuation tags)


    Ptb tagset development l.jpg

    PTB Tagset Development

    • Several changes were made to Brown Corpus tagset:

      • Recoverability

        • Lexical: Same treatment of Be, do, have, whereas BC gave each its own symbol

          • Do/VB does/VBZ did/VBD doing/VBG done/VBN

        • Syntactic: Since parse trees were used as part of Treebank, conflated certain categories under the assumption that they would be recoverable from syntax

          • subject vs. object pronouns (both PP)

          • subordinating conjunctions vs. prepositions on being informed vs. on the table (both IN)

          • Preposition “to” vs. infinitive marker (both TO)

      • Syntactic Function

        • BC: the/DT one/CD vs. PTB: the/DT one/NN

        • BC: both/ABX vs.

        • PTB: both/PDT the boys, the boys both/RB, both/NNS of the boys, both/CC boys and girls


    Ptb tagging process l.jpg

    PTB Tagging Process

    • Tagset developed

    • Automatic tagging by rule-based and statistical pos taggers

    • Human correction using an editor embedded in Gnu Emacs

    • Takes under a month for humans to learn this (at 15 hours a week), and annotation speeds after a month exceed 3,000 words/hour

    • Inter-annotator disagreement (4 annotators, eight 2000-word docs) was 7.2% for the tagging task and 4.1% for the correcting task

    • Manual tagging took about 2X as long as correcting, with about 2X the inter-annotator disagreement rate and an error rate that was about 50% higher.

    • So, for certain problems, having a linguist correct automatically tagged output is far more efficient and leads to better reliability among linguists compared to having them annotate the text from scratch!


    Automatic pos tagging l.jpg

    Automatic POS tagging

    • http://complingone.georgetown.edu/~linguist/


    A baseline strategy l.jpg

    Choose the most likely tag for each ambiguous word, independent of previous words

    i.e., assign each token to the pos-category it occurred in most often in the training set

    E.g., race – which pos is more likely in a corpus?

    This strategy gives you 90% accuracy in controlled tests

    So, this “unigram baseline” must always be compared against

    A Baseline Strategy


    Beyond the baseline l.jpg

    Beyond the Baseline

    • Hand-coded rules

    • Sub-symbolic machine learning

    • Symbolic machine learning


    Machine learning l.jpg

    Machine Learning

    • Machines can learn from examples

    • Learning can be supervised or unsupervised

    • Given training data, machines analyze the data, and learn rules which generalize to new examples

    • Can be sub-symbolic (rule may be a mathematical function) –e.g. neural nets

    • Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules)

    • In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora


    A probabilistic approach to pos tagging l.jpg

    What you want to do is find the “best sequence” of pos-tags C=C1..Cn for a sentence W=W1..Wn.

    (Here C1 is pos_tag(W1)).

    In other words, find a sequence of pos tags Cthat maximizes P(C| W)

    Using Bayes’ Rule, we can say

    P(C| W) = P(W | C) * P(C) / P(W )

    Since we are interested in finding the value of C which maximizes the RHS, the denominator can be discarded, since it will be the same for every C

    So, the problem is: Find C which maximizes

    P(W | C) * P(C)

    Example: He will race

    Possible sequences:

    He/PP will/MD race/NN

    He/PP will/NN race/NN

    He/PP will/MD race/VB

    He/PP will/NN race/VB

    W = W1 W2 W3

    = He will race

    C = C1 C2 C3

    Choices:

    C= PP MD NN

    C= PP NN NN

    C = PP MD VB

    C = PP NN VB

    A Probabilistic Approach to POS tagging


    Independence assumptions l.jpg

    Independence Assumptions

    • P(C1….Cn) i=1, n P(Ci| Ci-1)

      • assumes that the event of a pos-tag occurring is independent of the event of any other pos-tag occurring, except for the immediately previous pos tag

        • From a linguistic standpoint, this seems an unreasonable assumption, due to long-distance dependencies

    • P(W1….Wn | C1….Cn) i=1, n P(Wi| Ci)

      • assumes that the event of a word appearing in a category is independent of the event of any other word appearing in a category

        • Ditto

    • However, the proof of the pudding is in the eating!

      • N-gram models work well for part-of-speech tagging


    A statistical method for pos tagging l.jpg

    will|MD

    .8

    .4

    race|NN

    .4

    .8

    he|PP

    1

    1

    <s>|

    lex(B)

    .3

    .6

    .2

    will|NN

    .2

    race|VB

    .6

    .7

    A Statistical Method for POS Tagging

    MD NN VB PRP

    he 0 0 0 .3

    will .8 .2 0 0

    race 0 .4 .6 0

    Find the value of C1..Cn which maximizes:

    i=1, n P(Wi| Ci) * P(Ci| Ci-1)

    Pos bigram

    probs

    lexical generation

    probabilities

    lexical generation probs

    C|R MD NN VB PRP

    MD .4 .6

    NN .3 .7

    PP .8 .2

    1

    pos bigram probs


    Finding the best path through an hmm l.jpg

    Score(I) = Max J pred I [Score(J)* transition(I|J)]* lex(I)

    Score(B) = P(PP|)* P(he|PP) =1*.3=.3

    Score(C)=Score(B) *P(MD|PP) * P(will|MD) = .3*.8*.8= .19

    Score(D)=Score(B) *P(NN|PP) * P(will|NN) = .3*.2*.2= .012

    Score(E) = Max [Score(C)*P(NN|MD), Score(D)*P(NN|NN)] *P(race|NN) =

    Score(F) = Max [Score(C)*P(VB|MD), Score(D)*P(VB|NN)]*P(race|VB)=

    Finding the best path through an HMM

    C

    E

    will|MD

    .8

    .4

    race|NN

    .4

    Viterbi

    algorithm

    A

    .8

    he|PP

    1

    1

    <s>|

    lex(B)

    .3

    F

    B

    .6

    .2

    will|NN

    .2

    race|VB

    .6

    .7

    D


    But data sparseness bites again l.jpg

    But Data Sparseness Bites Again!

    • Lexical generation probabilities will lack observations for low-frequency and unknown words

    • Most systems do one of the following

      • Smooth the counts

        • E.g., add a small number to unseen data (to zero counts). For example, assume a bigram not seen in the data has a very small probability, e.g., .0001.

        • Backoff bigrams with unigrams, etc.

      • Use lots more data (you’ll still lose, thanks to Zipf!)

      • Group items into classes, thus increasing class frequency

        • e.g., group words into ambiguity classes, based on their set of tags. For counting, alll words in an ambiguity class are treated as variants of the same ‘word’


    A symbolic learning method l.jpg

    A Symbolic Learning Method

    • HMMs are subsymbolic – they don’t give you rules that you can inspect

    • A method called Transformational Rule Sequence learning (Brill algorithm) can be used for symbolic learning (among other approaches)

    • The rules (actually, a sequence of rules) are learnt from an annotated corpus

    • Performs at least as accurately as other statistical approaches

    • Has better treatment of context compared to HMMs

      • rules which use the next (or previous) pos

        • HMMs just use P(Ci| Ci-1) or P(Ci| Ci-2Ci-1)

      • rules which use the previous (next) word

        • HMMs just use P(Wi|Ci)


    Brill algorithm overview l.jpg

    Assume you are given a training corpus G (for gold standard)

    First, create a tag-free version V of it

    Notes:

    As the algorithm proceeds, each successive rule becomes narrower (covering fewer examples, i.e., changing fewer tags), but also potentially more accurate

    Some later rules may change tags changed by earlier rules

    1. First label every word token in V with most likely tag for that word type from G. If this ‘initial state annotator’ is perfect, you’re done!

    2. Then consider every possible transformational rule, selecting the one that leads to the most improvement in V using G to measure the error

    3. Retag V based on this rule

    4. Go back to 2, until there is no significant improvement in accuracy over previous iteration

    Brill Algorithm (Overview)


    Brill algorithm detailed l.jpg

    1. Label every word token with its most likely tag (based on lexical generation probabilities).

    2. List the positions of tagging errors and their counts, by comparing with ground-truth (GT)

    3. For each error position, consider each instantiation I of X, Y, and Z in Rule template. If Y=GT, increment improvements[I], else increment errors[I].

    4. Pick the I which results in the greatest error reduction, and add to output

    e.g., VB NN PREV1OR2TAG DT improves 98 errors, but produces 18 new errors, so net decrease of 80 errors

    5. Apply that I to corpus

    6. Go to 2, unless stopping criterion is reached

    Most likely tag:

    P(NN|race) = .98

    P(VB|race) = .02

    Is/VBZ expected/VBN to/TO race/NNtomorrow/NN

    Rule template: Change a word from tag X to tag Y when previous tag is Z

    Rule Instantiation to above example: NN VB PREV1OR2TAG TO

    Applying this rule yields:

    Is/VBZ expected/VBN to/TO race/VB tomorrow/NN

    Brill Algorithm (Detailed)


    Example of error reduction l.jpg

    Example of Error Reduction

    From Eric Brill (1995):

    Computational Linguistics, 21, 4, p. 7


    Example of learnt rule sequence l.jpg

    Example of Learnt Rule Sequence

    • 1. NN VB PREVTAG TO

      • to/TO race/NN->VB

    • 2. VBP VB PREV1OR20R3TAG MD

      • might/MD vanish/VBP-> VB

    • 3. NN VB PREV1OR2TAG MD

      • might/MD not/MD reply/NN -> VB

    • 4. VB NN PREV1OR2TAG DT

      • the/DT great/JJ feast/VB->NN

    • 5. VBD VBN PREV1OR20R3TAG VBZ

      • He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP


    Handling unknown words l.jpg

    Handling Unknown Words

    • Can also use the Brill method

    • Guess NNP if capitalized, NN otherwise.

    • Or use the tag most common for words ending in the last 3 letters.

    • etc.

    Example Learnt Rule Sequence

    for Unknown Words


    Pos tagging using unsupervised methods l.jpg

    Reason: Annotated data isn’t always available!

    Example: the can

    Let’s take unambiguous words from dictionary, and count their occurrences after the

    the .. elephant

    the .. guardian

    Conclusion: immediately after the, nouns are more common than verbs or modals

    Initial state annotator: for each word, list all tags in dictionary

    Transformation template:

    Change tag of word to tag Y if the previous (next) tag (word) is Z, where is a set of 2 or more tags

    Don’t change any other tags

    POS Tagging using Unsupervised Methods


    Error reduction in unsupervised method l.jpg

    Error Reduction in Unsupervised Method

    • Let a rule to change to Y in context C be represented as Rule(, Y, C).

      • Rule1: {VB, MD, NN}NNPREVWORD the

      • Rule2: {VB, MD, NN} VB PREVWORD the

    • Idea:

      • since annotated data isn’t available, score rules so as to prefer those where Y appears much more frequently in the context C than all others in 

        • frequency is measured by counting unambiguously tagged words

        • so, prefer {VB, MD, NN}NNPREVWORD the

          to{VB, MD, NN} VB PREVWORD the

          since dict-unambiguous nouns are more common in a corpus after the than dict-unambiguous verbs


    Summary pos tagging l.jpg

    Summary: POS tagging

    • A variety of POS tagging schemes exist, even for a single language

    • Preparing a POS-tagged corpus requires, for efficiency, a combination of automatic tagging and human correction

    • Automatic part-of-speech tagging can use

      • Hand-crafted rules based on inspecting a corpus

      • Machine Learning-based approaches based on corpus statistics

        • e.g., HMM: lexical generation probability table, pos transition probability table

      • Machine Learning-based approaches using rules derived automatically from a corpus

    • Combinations of different methods often improve performance


    Outline54 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Inter-Annotator Reliability

    Named Entity Tagging

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    Adjective ordering l.jpg

    Adjective Ordering

    • *A political serious problem

    • *A social extravagant life

    • *red lovely hair

    • *old little lady

    • *green little men

    • Adjectives have been grouped into various classes to explain ordering phenomena


    Collins cobuild l2 grammar l.jpg

    Collins COBUILD L2 Grammar

    • qualitative < color < classifying

    • Qualitative – expresses a quality that someone or something has, e.g., sad, pretty, small, etc.

      • Qualitative adjectives are gradable, i.e., the person or thing can have more or less of the quality

    • Classifying – used to identify the class something belongs to, i.e.., distinguishing

      • financial help, American citizens.

      • Classifying adjectives aren’t gradable.

    • So, the ordering reduces to

      • Gradable < color < non-gradable

        • A serious political problem

        • Lovely red hair

        • Big rectangular green Chinese carpet


    Vendler 68 l.jpg

    Vendler 68

    • A9 < A8 < …A2 < A1x <A1m < …<A1a

    • A9: probably, likely, certain

    • A8: useful, profitable, necessary

    • A7: possible, impossible

    • A6: clever, stupid, reasonable, nice, kind, thoughtful, considerate

    • A5: ready, willing, anxious

    • A4: easy

    • A3: slow, fast, good, bad, weak, careful, beautiful

    • A2: contrastive/polar adjectives: long-short, thick-thin, big-little, wide-narrow

    • A1j: verb-derivatives: washed

    • A1i: verb-derivatives: washing

    • A1h: luminous

    • A1g: rectangular

    • A1f: color adjectives

    • A1a: iron, steel, metal

      big rectangular green Chinese carpet


    Other adjective ordering theories l.jpg

    Other Adjective Ordering Theories

    Collins COBUILD: gradable < color < non-gradable

    Goyvaerts, Q&G, Dixon: size < age < color

    Goyvaerts, Q&G: color < denominal

    Goyvaerts, Dixon: shape < color


    Testing the theories on large corpora l.jpg

    Testing the Theories on Large Corpora

    • Selective coverage of a particular language or (small) set of languages

    • Based on categories that aren’t defined precisely that are computable

    • Based on small large numbers of examples

    • Test gradable < color < non-gradable


    Computable tests for gradable adjectives l.jpg

    Computable Tests for Gradable Adjectives

    • Submodifiers expressing gradation

      • very|rather|somewhat|extremely A

        • But what about “very British”?

          http://complingtwo.georgetown.edu/~gwilson/Tools/Adj/GW_Grad.txt

    • Periphrastic comparatives

      • “more A than“ | "the most A“

    • Inflectional comparatives

      • -er|-est

        http://complingtwo.georgetown.edu/~gwilson/Tools/Adj/BothLists.txt


    Challenges data sparseness l.jpg

    Challenges: Data Sparseness

    • Data sparseness

      • Only some pairs will be present in a given corpus

        • few adjectives on the gradable list may be present

      • Even fewer longer sequences will be present in a corpus

        • Use transitivity?

          • small < red, red < wooden -->small < red < wooden?


    Challenges tool incompleteness l.jpg

    Challenges: Tool Incompleteness

    • Search pattern will return many non-examples

      • Collocations

        • common or marked ones

          • American “green card”

          • national Blue Cross

      • Adjective Modification

        • bright blue

      • POS-tagging errors

      • May also miss many examples


    Results from corpus analysis l.jpg

    Results from Corpus Analysis

    • G < C < not G generally holds

    • However, there are exceptions

      • Classifying/Non-Gradable < Color

        After all, the maple leaf replaced the British red ensign as Canada's flag almost 30 years ago.

        http://complingtwo.georgetown.edu/~gwilson/Tools/Adj/Color2.html

        where he stood on a stage festooned with balloons displaying the Palestinian green, white and red flag

        http://complingtwo.georgetown.edu/~gwilson/Tools/Adj/Color4.html

      • Color < Shape

        paintings in which pink, roundish shapes, enriched with flocking, gel, lentils and thread, suggest the insides of the female body.

        http://complingtwo.georgetown.edu/~gwilson/Tools/Adj/Color4.html


    Summary adjective ordering l.jpg

    Summary: Adjective Ordering

    • It is possible to test concrete predictions of a linguistic theory in a corpus-based setting

    • The testing means that the machine searches for examples satisfying patterns that the human specifies

    • The patterns can pre-suppose a certain/high degree of automatic tagging, with attendant loss of accuracy

    • The patterns should be chosen so that they provide “handles” to identify the phenomena of interest

    • The patterns should be restricted enough that the number of examples the human has to judge is not infeasible

    • This is usually an iterative process


    Outline65 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Named Entity Tagging

    Inter-Annotator Reliability

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    The art of annotation 101 l.jpg

    The Art of Annotation 101

    • Define Goal

    • Eyeball Data (with the help of Computers)

    • Design Annotation Scheme

    • Develop Example-based Guidelines

    • Unless satisfied/exhausted, goto 1

    • WriteTraining Manuals

    • Initiate HumanTraining Sessions

    • Annotate Data / Train Computers

      • Computers can also help with the annotation

  • Evaluate Humans and Computers

  • Unless satisfied/exhausted, goto 1


  • Annottation methodology picture l.jpg

    Annottation Methodology Picture

    Raw

    Corpus

    Initial

    Tagger

    Annotation

    Editor

    Annotation

    Guidelines

    Machine

    Learning

    Program

    Rule

    Apply

    Learned

    Rules

    Raw

    Corpus

    Annotated

    Corpus

    Annotated

    Corpus

    Knowledge

    Base?


    Goals of an annotation scheme l.jpg

    Goals of an Annotation Scheme

    • Simplicity – simple enough for a human to carry out

    • Precision – precise enough to be useful in CLI applications

    • Text-based – annotation of an item should be based on information conveyed by the text, rather than information conveyed by background information

    • Human-centered – should be based on what a human can infer from the text, rather than what a machine can currently do or not do

    • Reproducible – your annotation should be reproducible by other humans (i.e., inter-annotator agreement should be high)

      • obviously, these other humans may have to have particular expertise and training


    What should an annotation contain l.jpg

    What Should An Annotation Contain

    • Additional Information about the text being annotated – e.g., EAGLES external and internal criteria

    • Information about the annotator – who, when, what version of tool, etc. (usually in meta-tags associated with the text)

    • The tagged text itself

    • Example:

    • http://www.emille.lancs.ac.uk/spoken.htm


    External and internal criteria eagles l.jpg

    External and Internal Criteria (EAGLES)

    • External: participants, occasion, social setting, communicative function

      • origin: Aspects of the origin of the text that are thought to affect its structure or content.

      • state: the appearance of the text, its layout and relation to non-textual matter, at the point when it is selected for the corpus.

      • aims: the reason for making the text and the intended effect it is expected to have.

    • Internal: patterns of language use

      • Topic (economics, sports, etc.)

      • Style (formal/informal, etc.)


    External criteria state eagles l.jpg

    External Criteria – state (EAGLES)

    • Mode

      • spoken

        • participant awareness: surreptitious/warned/aware

        • venue: studio/on location/telephone

      • written

    • Relation to the medium

      • written: how it is laid out, the paper, print, etc.

      • spoken: the acoustic conditions, etc.

    • Relation to non-linguistic communicative matter

      • diagrams, illustrations, other media that are coupled with the language in a communicative event.

    • Appearance

      • e.g., advertising leaflets, aspects of presentation that are unique in design and are important enough to have an effect on the language.


    Examples of annotation schemes changing the way we do business l.jpg

    Examples of annotation schemes (changing the way we do business!)

    • POS tagging annotation – Penn Treebank Scheme

    • Named entity annotation – ACE Scheme

    • Phrase Structure annotation – Penn Treebank scheme

    • Time Expression annotation – TIMEX2 Scheme

    • Protein Name Annotation – GU Scheme

    • Event Annotation – TimeML Scheme

    • Rhetorical Structure Annotation - RST Scheme

    • Coreference Annotation, Subjectivity Annotation, Gesture Annotation, Intonation Annotation, Metonymy Annotation, etc., etc.

    • Etc.

    • Several hundred schemes exist, for different problems in different languages


    Pos tag formats non sgml to sgml l.jpg

    POS Tag Formats: Non-SGML – to SGML

    • CLAWS tagger: non-SGML

      • What_DTQ can_VM0 CLAWS_NN2 do_VDI to_PRP Inderjeet_NP0 's_POS noonsense_NN1 text_NN1 ?_?

    • Brill tagger: non-SGML

      • What/WP can/MD CLAWS/NNP do/VB to/TO Inderjeet/NNP 's/POS noonsense/NN text/NN ?/.

    • Alembic POS tagger:

      • <s><lex pos=WP>What</lex> <lex pos=MD>can</lex> <lex pos=NNP>CLAWS</lex> <lex pos=VB>do</lex> <lex pos=TO>to</lex> <lex pos=NNP>Inderjeet</lex> <lex pos=POS>'</lex><lex pos=PRP>s</lex> <lex pos=VBP>noonsense</lex> <lex pos=NN>text</lex> <lex pos=".">?</lex></s>

    • Conversion to SGML is pretty trivial in such cases


    Sgml standard generalized markup language l.jpg

    A general markup language for text

    HTML is an instance of an SGML encoding

    Text Encoding Initiative (TEI): defines SGML schemes for marking up humanities text resources as well as dictionaries

    Examples:

    <p><s>I’m really hungry right now.</s><s>Oh, yeah?</s>

    <utt speak=“Fred” date=“10-Feb-1998”>That is an ugly couch.</utt>

    Note: some elements (e.g., <p>) can consist just of a single tag

    Character references: ways of referring to the non-ASCII characters using a numeric code

    &#229; (this is in decimal) &#xE5; (this is in hexadecimal)

    å

    Entity references: are used to encode a special character or sequence of characters via a symbolic name

    r&eacute;sum&eacute.;

    &docdate;

    SGML (Standard Generalized Markup Language)


    Slide75 l.jpg

    A document type definition, or DTD, is used to define a grammar of legal SGML structures for a document

    e.g., para should consist of one or more sentences and nothing else

    SGML parser verifies that document is compliant with DTD

    DTD’s can therefore be used for XML as well

    DTDs can specify what attributes are required, in what order, what their legit values are, etc.

    The DTDs are often ignored in practice!

    DTD:

    <!ENTITY writer SYSTEM "http://www.mysite.com/all-entities.dtd">

    <!ATTLIST payment type (check|cash) "cash">

    XML:

    <author>&writer;</author>

    <payment type="check">

    DTDs


    Slide76 l.jpg

    XML

    • “Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML.

    • Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.” www.w3.org/XML/

    • Defines a simplified subset of SGML, designed especially for Web applications

    • Unlike HTML, separates out display (e.g., XSL) from content (XML)

    • Example

      <p/><s><lex pos=“WP”>What</lex> <lex pos=“MD”>can</lex></s>

    • Makes use of DTDs, but also RDF Schemas


    Rdf schemas l.jpg

    RDF Schemas

    • Example of Real RDF Schema:

    • http://www.cs.brandeis.edu/~jamesp/arda/time/documentation/TimeML.xsd (see EVENT tag and attributes)


    Inline versus standoff annotation l.jpg

    Inline versus Standoff Annotation

    • Usually, when tags are added, an annotation tool is used, to avoid spurious insertions or deletions

    • The annotation tool may use inline or standoff annotation

    • Inline – tags are stored internally in (a copy of) the source text.

      • Tagged text can be substantially larger than original text

      • Web pages are a good example – i.e., HTML tags

    • Standoff – tags are stored internally in separate files, with information as to what positions in the source text the tags occupy

      • e.g., PERSON 335 337

      • However, the annotation tool displays the text as if the tags were in-line


    Summary annotation issues l.jpg

    Summary: Annotation Issues

    • A ‘best-practices’ methodology is widely used for annotating corpora

    • The annotation process involves computational tools at all stages

    • Standard guidelines are available for use

    • To share annotated corpora (and to ensure their survivability), it is crucial that the data be represented in a standard rather than ad hoc format

    • XML provides a well-established, Web-compliant standard for markup languages

    • DTDs and RDF provide mechanisms for checking well-formedness of annotation


    Outline80 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Inter-Annotator Reliability

    Named Entity Tagging

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    Background l.jpg

    Background

    • Deborah Schiffrin. Anaphoric then: aspectual, textual, and epistemic meaning. Linguistics 30 (1992), 753-792

    • Schiffrin xamines uses of then in data elicited via 20 sociolinguistic interviews, each an hour long

    • Distinguishes two anaphoric temporal senses, showing that they are differentiated by clause position

    • Shows that they have systematic effects on aspectual interpretation

    • A parallel argument is made for two epistemic temporal senses


    Schiffrin temporal and non temporal senses l.jpg

    Schiffrin: Temporal and Non-Temporal Senses

    • Anaphoric Senses

      • ‘Narrative’ temporal sense (shifts reference time)

        • And then I uh lived there until I was sixteen

      • Continuing Temporal sense (continues a previous reference time)

        • I was only a little boy then.

    • Epistemic senses

      • Conditional ‘sentences’ (rare, but often have temporal antecedents in her data)

        • But if I think about it for a few days -- well, then I seem to remember a great deal

        • …if I’m still in the situation where I am now….I’m, not gonna have no more then

      • Initiation-response-evaluation sequences (‘in that case’?)

        • Freda: Do y’ still need the light?

        • Debby: Um.

        • Freda” W’ll have t’ go in then. Because the bugs are out.


    Schiffrin s argument simplified and its test l.jpg

    Schiffrin’s Argument (Simplified) and Its Test

    • Shifting RT thens (call these Narrative) & then in if-then conditionals

      • similar semantic function

      • mainly clause-initial

    • Continuing RT thens (call these Temporal) & IRE thens

      • similar semantic function

      • mainly clause final

      • stative verb more likely (since RT overlaps, verbs conveying duration are expected)

    • Call the rest Other

      • isn’t differentiated into if-then versus IRE

      • So, only part of her claims tested


    So what do we do then l.jpg

    So, What do we do Then?

    • Define environments of interest, each one defined by a pattern

    • For each environment

      • Find examples matching the pattern

      • If classifying the examples is manageable, carry it out and stop

      • Otherwise restrict the environment by adding new elements to the pattern, and go back to 1

    • So, for each final environment, we claim that X% of the examples in that environment are of a particular class

    • Initial ‘then’ Pattern: (^|_CC|_RB)\s*then\w+\s+\w

    • Final ‘then’ Pattern: [^\,]\s+then[\.\?\'\;\!\:]


    Exceptions l.jpg

    Non-Narrative Initial ‘then’

    then there [be]

    then come

    then again

    then and now

    only then

    even then

    so then

    Non-Temporal Final ‘then’

    What then?

    All right/OK [,] then

    And then?

    Exceptions


    Results l.jpg

    Results

    • Other is a presence in final position in fiction and broadcast news, and in initial position in print news. Is this real or artifact of catch-all class?

    • Conclusion: only part of her claims tested. But those claims are borne out across three different genres and much more data!


    Outline87 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Inter-Annotator Reliability

    Named Entity Tagging

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    Considerations in inter annotator agreement l.jpg

    Considerations in Inter-Annotator Agreement

    • Size of tagset

    • Structure of tagset

    • Clarity of Guidelines

    • Number of raters

    • Experience of raters

    • Training of raters

      • Independent ratings (preferred)

      • Consensus (not preferred)

    • Exact, partial, and equivalent matches

    • Metrics

    • Lessons Learned: Disagreement patterns suggest guideline revisions


    Protein names l.jpg

    Considerable variability in the forms of the names

    Multiple naming conventions

    Researchers may name a newly discovered protein based on

    function

    sequence features

    gene name

    cellular location

    molecular weight

    discoverer

    or other properties

    Prolific use of abbreviations and acronyms

    fushi tarazu 1 factor homolog

    Fushi tarazu factor (Drosophila) homolog 1

    FTZ-F1 homolog ELP

    steroid/thyroid/retinoic nuclear hormone receptor homolog nhr-35

    V-INT 2 murine mammary tumor virus integration site oncogene homolog

    fibroblast growth factor 1 (acidic) isoform 1 precursor

    nuclear hormone receptor subfamily 5, Group A, member 1

    Protein Names


    Guidelines v1 toc l.jpg

    Guidelines v1 TOC


    Agreement metrics l.jpg

    Agreement Metrics


    Example for f measure scorer output protein name tagging l.jpg

    Example for F-measure: Scorer Output (Protein Name Tagging)

    REFERENCE CANDIDATE

    CORR        FTZ-F1 homolog ELP   |           FTZ-F1 homolog ELP INCO              M2-LHX3 |                              M2SPUR             |                               -SPUR                                    |                            LHX3

    Precision = ¼ = 0.25

    Recall = ½ = 0.5

    F-measure = 2 * ¼ * ½ / ( ¼ + ½ ) = 0.33


    The importance of disagreement l.jpg

    The importance of disagreement

    • Measuring inter-annotator agreement is very useful in “debugging” the annotation scheme

    • Disagreement can lead to improvements in the annotation scheme

    • Extreme disagreement can lead to abandonment of the scheme


    V2 assessment abs2 l.jpg

    Old Guidelines

    protein0.71 F

    acronym0.85 F

    array-protein0.15 F

    New Guidelines

    protein 0.86 F

    long-form0.71 F

    these are only ~4% of tags

    V2 Assessment (ABS2)


    Timex2 annotation scheme l.jpg

    TIMEX2 Annotation Scheme

    Time Points <TIMEX2 VAL="2000-W42">the third week of October</TIMEX2>

    Durations <TIMEX2 VAL=“PT30M”>half an hour long</TIMEX2>

    Indexicality <TIMEX2 VAL=“2000-10-04”>tomorrow</TIMEX2>

    Sets <TIMEX2 VAL=”XXXX-WXX-2" SET="YES” PERIODICITY="F1W" GRANULARITY=“G1D”>every Tuesday</TIMEX2>

    Fuzziness <TIMEX2 VAL=“1990-SU”>Summer of 1990 </TIMEX2>

    <TIMEX2 VAL=“1999-07-15TMO”>This morning</TIMEX2>

    Non-specificity <TIMEX2 VAL="XXXX-04" NON_SPECIFIC=”YES”>April</TIMEX2> is usually wet.

    For guidelines, tools, and corpora, please see timex2.mitre.org


    Timex2 inter annotator agreement l.jpg

    TIMEX2 Inter-Annotator Agreement

    193 NYT news docs

    5 annotators

    10 pairs of annotators

    • Human annotation quality is ‘acceptable’ on EXTENT and VAL

    • Poor performance on Granularity and Non-Specific

      • But only a small number of instances of these (200 ~ 6000)

    • Annotators deviate from guidelines, and produce systematic errors (fatigue?)

      • several years ago: PXY instead of PAST_REF

      • all day: P1D instead of YYYY-MM-DD


    Tempex in qanda l.jpg

    TempEx in Qanda


    Summary inter annotator reliability l.jpg

    Summary: Inter-Annotator Reliability

    • There’s no point going on with an annotation scheme if it can’t be reproduced

    • There are standard methods for measuring inter-annotator reliability

    • An analysis of inter-annotator disagreements is critical for “debugging” an annotation scheme


    Outline99 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Inter-Annotator Reliability

    Named Entity Tagging

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    Information extraction l.jpg

    Information Extraction

    • Types

      • Flag names of people, organizations, places,…

      • Flag and normalize time expressions, phrases such as time expressions, measure phrases, currency expressions, etc.

      • Group coreferring expressions together

      • Find relations between named entities (works for, located at, etc.)

      • Find events mentioned in the text

      • Find relations between events and entities

      • A hot commercial technology!

    • Example patterns:

      • Mr. ---,

      • , Ill.


    Message understanding conferences mucs l.jpg

    Message Understanding Conferences (MUCs)

    • Idea: precise tasks to measure success, rather than test suite of input and logical forms.

    • MUC-1 1987 and MUC-2 1989 - messages about navy operations

    • MUC-3 1991 and MIC-4 1992 - news articles and transcripts of radio broadcasts about terrorist activity

    • MUC-5 1993 - news articles about joint ventures and microelectronics

    • MUC-6 1995 - news articles about management changes, + additional tasks of named entity recognition, coreference, and template element

    • MUC-7 1998 – mostly multilingual information extraction

    • Has also been applied to hundreds of other domains - scientific articles, etc., etc.


    Historical perspective l.jpg

    Historical Perspective

    • Until MUC-3 (1993), many IE systems used a Knowledge Engineering approach

      • They did something like full chart parsing with a unification-based grammar with full logical forms, a rich lexicon and KB

      • E.g., SRI’s Tacitus

    • Then, they discovered that things could work much faster using finite-state methods and partial parsing

    • And that using domain-specific rather than general purpose lexicons simplified parsing (less ambiguity due to fewer irrelevant senses)

    • And that these methods worked even better for the IE tasks

      • E.g., SRI’s Fastus, SRA’s Nametag

    • Meanwhile, people also started using statistical learning methods from annotated corpora

      • Including CFG parsing


    An instantiated scenario template l.jpg

    An instantiated scenario template

    Source

    Wall Street Journal, 06/15/88

    MAXICARE HEALTH PLANS INC and UNIVERSAL HEALTH SERVICES INC have dissolved a joint venture which provided health services.


    Templates can get complex muc 5 l.jpg

    Templates Can get Complex! (MUC-5)


    2002 automatic content extraction ace program entity types l.jpg

    2002 Automatic Content Extraction (ACE) Program: Entity Types

    • Person

    • Organization

    • (Place)

      • Location – e.g., geographical areas, landmasses, bodies of water, geological formations

      • Geo-Political Entity – e.g., nations, states, cities

        • Created due to metonymies involving this class of places

        • The riots in Miami

        • Miami imposed a curfew

        • Miami railed against a curfew

    • Facility – buildings, streets, airports, etc.


    Ace entity attributes and relations l.jpg

    ACE Entity Attributes and Relations

    • Attributes

      • Name: An entity mentioned by name

      • Pronoun

      • Nominal

    • Relations

      • AT: based-in, located, residence

      • NEAR: relative-location

      • PART: part-of, subsidiary, other

      • ROLE: affiliate-partner, citizen-of, client, founder, general-staff, manager, member, owner, other

      • SOCIAL: associate, grandparent, parent, sibling, spouse, other-relative, other-personal, other-professional


    Designing an information extraction task l.jpg

    Designing an Information Extraction Task

    • Define the overall task

    • Collect a corpus

    • Design an Annotation Scheme

      • linguistic theories help

    • Use Annotation Tools

      • - authoring tools

      • -automatic extraction tools

    • Apply to annotation to corpus, assessing reliability

    • Use training portion of corpus to train information extraction (IE) systems

    • Use test portion to test IE systems, using a scoring program


    Annotation tools l.jpg

    Annotation Tools

    • Specialized authoring tools used for marking up text without damaging it

    • Some tools are tied to particular annotation schemes


    Annotation tool example alembic workbench l.jpg

    Annotation Tool Example: Alembic Workbench


    Callisto java successor to alembic workbench l.jpg

    Callisto (Java successor to Alembic Workbench)


    Relationship annotation callisto l.jpg

    Relationship Annotation: Callisto


    Steps in information extraction l.jpg

    Steps in Information Extraction

    • Tokenization

      • Language Identification

      • Document Zoning

      • Sentence and Word Tokenization

    • Morphological and Lexical Processing

      • Tagging entities of interest

      • Specific trigger lexicons

      • Dealing with unknown words

      • Part-of-Speech Tagging

      • Word-Sense Tagging

      • Morphological Analysis

    • Parsing

      • Finite-State Parsing (usually just chunking)

    • Domain Semantics

      • Coreference

      • Merging Partial Results


    Morphological analysis l.jpg

    Morphological Analysis

    • Inflectional morphology, mostly

    • For simple languages (English, Japanese) – simple inflectional module suffices

    • For more complex languages (Spanish) – a finite-state transducer is used

    • For morphologically very complex languages (Arabic, Hebrew) – complex finite state transducer architectures

    • For languages with productive noun compounding (German) – specialized module needed


    Finite state parsing for ie l.jpg

    A.C. Nielesen CO.NGsaid VGGeorge GarrickNG, 40 years old, presidentNGof information Resources Inc.NG's London-based European Information Services operationNG, will becomeVGpresidentNGand chief operating officerNGof Nielsen Marketing Research USANG, a unit NG of Dun & Bradstree Corp. NG

    First find NG, VG, particles; ignore PP attachment; ignore clause boundaries; maybe ignore modifiers that aren’t domain-relevant

    Later transducers handle more complex phenomena:

    relative clauses (e.g., look for second verb for marking end of rc; subject relatives: associate subject with first and second verb; object relatives: associate object with head noun before rel mod)

    general clause segmentation

    coordination

    appositives

    PP argument attachment (only for verbs important in domain whose subcat info is provided – rest are adverbial adjuncts)

    Finite-State Parsing for IE


    Example text processing l.jpg

    KEY:

    Trigger word tagging

    Named Entity tagging

    Chunk parsing: NGs, VGs, preps, conjunctions

    Bridgestone Sports Co.said Friday it has set up a joint ventureinTaiwanwith a local concern and a Japanese trading house to produce golf clubs to be shippedtoJapan.

    Example Text Processing

    CompanyNGSet-UPVGJoint-VentureNGwithCompanyNG

    ProduceVGProductNG

    The joint venture, Bridgestone Sports Taiwan Cp., capitalized at 20 million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.


    Merging structures l.jpg

    Activity:

    Type: PRODUCTION

    Company:

    Product: golf clubs

    Start-date:

    Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.

    Merging Structures

    The joint venture, Bridgestone Sports Taiwan Cp., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.

    Activity:

    Type: PRODUCTION

    Company: Bridgestone Sports Taiwan Co

    Product: iron and “metal wood” clubs

    Start-date: DURING 1990


    Coreference l.jpg

    Coreference

    • Coreference means establishing referential relations between expressions.

      • Pronouns ..Mr. Gates …he, the testimony….it

      • Definite NPs Microsoft….the company

      • Indefinite NPs the building…an apartment

      • Proper Names Bill Gates…William Gates…. Mr. Gates

      • Temporal Expressions today, three weeks from Monday

      • Headless Determiners all, the one, five

      • Prenominals aluminum siding …the price of aluminum

      • Events they attacked at dawn…the attack

    • Types of relationships:

      • Identity, Part-whole

      • Set-subset the jurors…five ….

      • Set-member the jurors…on


    Statistical named entity tagging l.jpg

    Statistical Named Entity Tagging

    • Typically, treat it as a word-level tagging problem

      • To get phrase-level tags, one could greedily concatenate adjacent tags

        • this will fail to separate ‘like’ tags

    • Approaches can separately model words at start, end, or middle of name

      • BBN Identifinder does that

        P(C|W) = P(W, C)/P(W)

         argmaxCP(W, C)

        P(Ci|Ci-1, wi-1) first word in a name

        * P(<w, f>i=first|Ci, Ci-1) first word in a name

        * P(<w,f>i|<w,f>i-1, Ci) all but the first word in a name

        Word features f includes information about capitalization, initials, etc.


    Information extraction metrics l.jpg

    Precision: Correct Answers/Answers Produced

    Recall: Correct Answers /Total Possible Correct

    F-measure - uses a parameter  to weight precision versus recall (=1 for balance)

    F = (B2+1) PR / B2(P+R)

    F =.6 for the relationship/event extraction task (ceiling) in MUC

    F = .95+ for named entity task in MUC

    = .8 or so for coreference task

    Information Extraction Metrics


    Ie and qa evaluations l.jpg

    IE and QA Evaluations

    Names in English

    Names from audio @ 0%15% word error

    Names in JapaneseNames in Chinese

    Relations

    Question Answering

    Event extraction

    Current status for various information extraction and question-answering components


    Summary information extraction l.jpg

    Summary: Information Extraction

    • A variety of IE tasks and methods are available

    • Named entities, relations, and event templates can be filled, as well as coreference relations

    • Linguistic information used can be hand-crafted or corpus-based

    • Domain knowledge, where needed, is hand-crafted

    • Performance on names is better than on relations, while “deep” templates have shown a 60% ceiling effect


    Outline122 l.jpg

    Topics

    Concordances

    Data sparseness

    Chomsky’s Critique

    Ngrams

    Mutual Information

    Part-of-speech tagging

    Annotation Issues

    Inter-Annotator Reliability

    Named Entity Tagging

    Relationship Tagging

    Case Studies

    metonymy

    adjective ordering

    Discourse markers: then

    TimeML

    Outline


    Motivation for temporal information extraction l.jpg

    Motivation for Temporal Information Extraction

    • Story Understanding

      • Question-answering

      • Summarization

    • Focus on temporal aspects of narrative


    Chronology of the marathon mini story l.jpg

    before

    02172004

    02182004

    during

    run

    finishes

    or

    during

    during

    twist ankle

    before

    push

    Chronology of ‘The Marathon’ (mini-story)

    Yesterday Holly wasrunning a marathon whenshe twisted her ankle. David hadpushed her.

    1. When did the running occur?

    Yesterday.

    2. When did the twisting occur?

    Yesterday, during the running.

    3. Did the pushing occur before the twisting?

    Yes.

    4. Did Holly keep running after twisting her ankle?

    5. Maybe not????


    Factors influencing event ordering l.jpg

    Factors influencing Event Ordering

    (1) Max entered the room. He had drunk a lot of wine.

    TENSE: Past perfect indicates drinking precedes entering.

    (2) Max entered the room. Mary was seated behind the desk.

    ASPECT: State of ‘being seated’ overlaps with ‘entering’.

    (3) He had borrowed some shirts from local villagers after his backpack went down.

    TEMPORAL MODIFIER: Going down precedes borrowing, based on temporal adverbial after

    (4) Iraq was defeated during the Gulf War. In ancient times it was the cradle of civilization.

    TIMEX: Being the cradle precedes being defeated, based on explicit time expression.

    (5) Max stood up. John greeted him.

    NARR_CONVENTION: Narrative convention applies, with ‘standing up’ preceding ‘greeting’

    (6) Max fell. John pushed him.

    DISCOURSE_REL: Narrative convention overridden, based on Explanation relation

    (7) A drunken man died in the central Philippines when he put a firecracker under his armpit.

    DISCOURSE_REL: dying after putting, with temporal modifier used to instantiate Explanation relation

    (8) U.N. Secretary- General Boutros Boutros-Ghali Sunday opened a meeting of .... Boutros-Ghali arrived in Nairobi from South Africa, accompanied by Michel...

    WORLD KNOWLEDGE: arrival at the place of a meeting precedes opening a meeting


    What s needed for computing chronologies l.jpg

    What’s Needed for Computing Chronologies?

    Representation of tense and aspect

    Representation of events and time

    Linking of events and time

    Result: a temporal constraint network

    Here, both events and times are represented as pairs of points (nodes)

    Ordering relations (edges) are <, =

    Chronology

    -participants

    -participants

    Event

    Event

    Time

    Yesterday, Holly was running ….

    02172004

    02172004

    y1

    <

    y2

    during

    <

    >

    <

    x1

    x2

    run

    run

    [Verhagen 2004]


    Timeml annotation l.jpg

    TimeML Annotation

    • TimeML is a proposed metadata standard for markup of events and their temporal anchoring and ordering

    • Consists of EVENT tags, TIMEX3 tags, and LINK tags

      • EVENTS are grouped into classes and have tense and aspect features

      • LINKS include overt and covert links

        • Can be within or across sentences


    How timeml differs from previous markups l.jpg

    How TimeML Differs from Previous Markups

    • Extends TIMEX2 annotation to TIMEX3

      • Temporal Functions: three years ago

      • Anchors to events and other temporal expressions: three years after the Gulf War

      • Addresses problem with Granularity/Periodicity: three days every month

      • Inserts start/end points for Durations: two weeks from June 7

    • Identifies signals determining interpretation of temporal expressions;

      • Temporal Prepositions:for, during, on, at;

      • Temporal Connectives: before, after, while.

    • Identifies event expressions;

      • tensed verbs; has left, was captured, will resign;

      • stative adjectives; sunken, stalled, on board;

      • event nominals; merger, Military Operation, Gulf War;

    • Creates dependencies between events and times:

      • Anchoring; John left on Monday.

      • Orderings; The party happened after graduation.

      • Embedding; John said Mary left.


    Tlink l.jpg

    TLINK

    • TLINK or Temporal Link represents the temporal relationship holding between events or between an event and a time, and establishes a link between the involved entities, making explicit if they are:

    • Simultaneous (happening at the same time)

    • Identical: (referring to the same event)

    • John drove to Boston. During his drive he ate a donut.

    • One before the other:

      • The police looked into the slayings of 14 women.In six of the cases suspects have already been arrested.

    • One immediately before the other:

    • All passengers died when the plane crashed into the mountain.

    • One including the other:

    • John arrived in Boston last Thursday.

    • One holding during the duration of the other:

    • One being the beginning of the other:

    • John was in the gym between 6:00 p.m. and 7:00 p.m.

    • One being the ending of the other:

    • John was in the gym between 6:00 p.m. and 7:00 p.m.


    Slink l.jpg

    SLINK

    SLINK or Subordination Link is used for contexts introducing relations between two events, or an event and a signal, of the following sort:

    Modal: Relation introduced mostly by modal verbs (should, could, would, etc.) and events that introduce a reference to a possible world --mainly I_STATEs:

    John should have bought some wine.

    Mary wanted John to buy some wine.

    Factive: Certain verbs introduce an entailment (or presupposition) of the argument's veracity. They include forget in the tensed complement, regret, manage:

    John forgot that he was in Boston last year.

    Mary regrets that she didn't marry John.

    Counterfactive: The event introduces a presupposition about the non-veracity of its argument: forget (to), unable to (in past tense), prevent, cancel, avoid, decline, etc.

    John forgot to buy some wine.

    John prevented the divorce.

    Evidential: Evidential relations are introduced by REPORTING or PERCEPTION:

    John said he bought some wine.

    Mary saw John carrying only beer.

    Negative evidential: Introduced by REPORTING (and PERCEPTION?) events conveying negative polarity:

    John denied he bought only beer.

    Negative: Introduced only by negative particles (not, nor, neither, etc.), which will be marked as SIGNALs, with respect to the events they are modifying:

    John didn't forget to buy some wine.

    John did not want to marry Mary.


    Role of the machine in human annotation l.jpg

    Role of the machine in human annotation

    • In cases of dense annotation (events, pos tags, word-sense tags, etc.), it can be too tedious for a human to annotate everything

    • In such cases, it’s helpful to have a computer program pre-annotate the data that the human then corrects

    • The machine can also interact to flag invalid entries

    • The machine can also provide visualization

    • The machine can also augment the annotation with information that can be inferred


    Annotating chronology in the marathon l.jpg

    Annotating Chronology in The Marathon


    Pre closure l.jpg

    Pre-Closure


    Post closure l.jpg

    Post-Closure


    Automatic timex2 tagging l.jpg

    Automatic TIMEX2 tagging

    • http://complingone.georgetown.edu/~linguist/


    Timeml annotation issues l.jpg

    Problems

    Weaknesses in guidelines

    Links between subordinate clause and main clause of same/diff sentence

    Difficulties in annotating states

    Granularity of temporal relations (72% agreement on temporal relations on common links)

    Density of links. Number of links is quadratic in the number of events, but less than half the eventualities are linked.

    So, inter-annotator agreement on links likely to be low.

    Solutions

    Adding more annotation conventions

    Lightening the annotation.

    Expanding annotation using temporal reasoning.

    Using heavily mixed-initiative approach

    Providing user with visualization tools during annotation.

    Note: such problems are characteristic of semantic and discourse-level annotations!

    TimeML Annotation Issues


    Timebank browser and timeml tools l.jpg

    TimeBank Browser and TimeML tools

    • http://corpora.dutchboy.net/timebank/

    • http://complingone.georgetown.edu/~linguist/


    Strategy for automatically inferring linguistic information l.jpg

    Strategy for Automatically Inferring Linguistic Information

    • Develop a corpus of TimeML annotated documents

      • TimeML represents temporal adverbials, tense, grammatical aspect, temporal relations

      • Takes into account subordination and (to an extent) vagueness

      • Work on metric constraints for durations of states is ongoing (Hobbs)

    • Develop initial computer taggers to tag Events, Times, and Links in the corpus

    • Correct the corpus using a human

    • Ensure that the annotations can be reproduced accurately

      • Inter-annotator reliability

    • Use the corpus to train improved computer taggers


    At the florist s mini story l.jpg

    At the Florist’s (mini-story)

    • a. John went into the florist shop.

    • b. He had promised Mary some flowers.

    • c. She said she wouldn’t forgive him if he forgot.

    • d. So he picked out three red roses.

    • From (Webber 1988)


    Chronology of at the florist s l.jpg

    Chronology of At the Florist’s


    At the florist s a rhetorical structure theory account l.jpg

    Assumes abstract nodes which are Rhetorical Relations

    Rhetorical relation annotations are not easily reproduced

    question of inter-annotator reliability

    Narration

    Explanation Ed

    Ea Elaboration

    Narration

    Eb Ec

    At the Florist’s: A Rhetorical Structure Theory account


    Temporal relations as surrogates for rhetorical relations l.jpg

    When E1 is left-sibling of E2 and E1 < E2, then typically, Narration(E1, E2)

    When E1 is right-sibling of E2 and E1 < E2, then typically Explanation(E2, E1)

    When E2 is a child node of E1, then typically Elaboration(E1, E2)

    Temporal Relations as Surrogates for Rhetorical Relations

    a. John went into the florist shop.

    b. He had promised Mary some flowers.

    c. She said she wouldn’t forgive him if he forgot.

    d. So he picked out three red roses.

    Expl

    Elab

    Narr

    constraints: {Eb < Ec, Ec < Ea, Ea < Ed}


    Temporal discourse model annotation conventions l.jpg

    Temporal Discourse Model Annotation Conventions

    • Each tree is rooted in an abstract node.

    • In the absence of any temporal adverbials or discourse markers, a tense shift will license the creation of an abstract node, with the tense shifted event being the leftmost daughter of the abstract node. The abstract node will then be inserted as the child of the immediately preceding text node.

    • In the absence of temporal adverbials and discourse markers, a stative event will always be placed as a child of the immediately preceding text event when the latter is non-stative, and as a sibling of the previous event when the latter isstative(as in a scene-setting fragment of discourse).


    Representing states l.jpg

    Approach: Minimality

    A tensed stative predicate is represented as a node in the tree (progressives are treated as stative).

    John walked home. He was feeling great.

    We represent the state of feeling great as being minimally a part of the event of walking, without committing to whether it extends before or after the event

    A constraint is added to C indicating that this inclusion is minimal.

    Problem: Incompleteness

    Max entered the room. He was wearing a black shirt

    The system will not know whether the shirt was worn after he entered the room.

    Representing States


    Tdms and drt l.jpg

    TDMs and DRT

    EaEbEcxyzt1t2t3 [

    enter(Ea, x, theWhiteHart) & man (x) & PROG(wear(Eb, x, y) & black-jacket(y) &serve(Ec, Bill, x, z) & beer(z) & t1 < n & Ea t1 & t2 < n & Eb o t2 & Eb Ea & t3 < n & Ec t3 & Ea < Ec]


    What s needed for computing tdms l.jpg

    What’s Needed for Computing TDMs?

    • A Corpus of TDMs, annotated with high inter-annotator reliability

    • ‘Syntactic’ parsers for TDMs, trained on the corpus


    Conclusion l.jpg

    Conclusion

    • There are lots of computational tools for manual and automatic annotation of linguistic data and exploration of linguistic hypotheses

    • The automatic tools aren’t perfect, but neither are humans!

    • An annotation scheme must be tested using guidelines and inter-annotator reliability

    • Annotations must be prepared and used within standard XML-based frameworks

    • There are many costs and tradeoffs in corpus preparation

    • The yields can considerably speed up the pace of linguistic research


    Desiderata for indian language work l.jpg

    Desiderata for Indian Language Work

    • The data needs to be encoded using standard character encoding schemes – UNICODE, or else ISCII

    • Annotation needs to follow the best-practices methodology, including proof of replicability, and XML representation

    • Experience has shown that linguists and computer scientists can work in synergy on this

    • Once corpora are prepared according to these guidelines, automatic tools can be developed in India and abroad and used to improve linguistic processing of Indian languages

      • Morphological analyzers, stemmers, etc.

      • Part-of-speech taggers

      • Syntactic Parsers

      • Word-Sense Disambiguators

      • Temporal Taggers

      • Information Extraction Systems

      • Text Summarizers

      • Statistical MT Systems

      • etc.


    Free resources contact me l.jpg

    Free Resources (contact me)

    • TIMEX2 corpora and tools: timex2.mitre.org (English, Korean, Spanish)

    • TimeML and annotation tools: www.timeml.org

    • AQUAINT corpus, and TimeML software: watch this space

    • PRONTO and iprolink corpora, guidelines, tagsets

    • (see my web site)

    Thank You!


    The changing environment l.jpg

    The Changing Environment

    • If statistical rules induced from examples perform just as well as rules derived from intuition, then this suggests that probabilistic statistical linguistic rules might help explain or model human linguistic behavior.

    • It also suggests that humans might learn from experience by means of induction using statistical regularities.

    • For many years, corpus linguistic research rarely examined statistics above the level of words, due to the lack of availability of broad-coverage parsers and statistical models that could handle syntax and other levels of ‘hidden structure’ (Manning 2003).

    • The present climate with plenty of tools and statistical models, should allow corpus linguistics to extend its descriptive and explanatory scope dramatically.


    Ngrams details l.jpg

    Ngrams Details

    • Consider a sequence of words W1…Wn, “I saw a rabbit”.

    • What’s P(W1…Wn)? Note that we can’t find sequences of length n, and count them - there won’t be enough data.

    • Chain Rule of probability:

      P(W1, .. ,Wn ) =

      P(W1)P(W2|W1) P(W3|W1,W2)..P(Wn|W1,W2, ..,Wn-1 )

      • But you still have the problem of lacking enough data

    • Bigram model

      • Approximates P(Wn|W1…Wn-1) by P(Wn|Wn-1)

      • Assumes the probability of a word depends just on the previous word. This means, that you don’t have to look back more than one word.

      • P(I saw a rabbit) = P(I|<s>)*P(saw|I)*P(a|saw)*P(rabbit|a)

      • More generally: P(W1….Wn) i=1, n P(Wi| Wi-1)

    • A trigram model, would look 2 words back into the past

      • P(I saw a rabbit) =P(saw|<s> I)*P(a| I saw)*P(rabbit|saw a)


    Pos tagging based on n grams l.jpg

    POS Tagging Based on N-grams

    • Problem: Find C which maximizes P(W | C) * P(C)

    • Here W=W1..Wn and C=C1..Cn (these were sequences, remember?)

      P(W1, .. ,Wn ) =

      P(W1)P(W2|W1) P(W3|W1,W2)..P(Wn|W1,W2, ..,Wn-1 )

      • Using the bigram model, we get:

        P(W1….Wn | C1….Cn) i=1, n P(Wi| Ci)

        P(C1….Cn) i=1, n P(Ci| Ci-1)

    • So, we want to find the value of C1..Cn which maximizes:

      i=1, n P(Wi| Ci) * P(Ci| Ci-1)

    pos

    bigram

    probs, estimated

    from training data

    lexical generation

    probabilities, estimated

    from training data


    Problems in event anchoring l.jpg

    Problems in Event Anchoring

    • States

      • John walked home. He was feeling great.

        • How long does “feeling great” last?

          • => We need a “minimal” duration for states

      • a. Mary entered the President’s Office.b. A copy of the budget was on the president’s desk. c. The president’s financial advisor stood beside it. d. The president sat regarding both admiringly. e. The advisor spoke. (Dowty 1986)

        • Was the budget on the desk before she entered the office?

          • => “perceived scene” presents an imperfective view of states, not indicating their true onsets

    • Vagueness

      The attack lasted 2-3 weeks.

      Recently, Holly turned 16.

      Next summer, Holly may run

      Three days later, David pushed her

      • => temporal reasoning has to deal with vagueness


    Problems in event anchoring contd l.jpg

    Problems in Event Anchoring (contd)

    • Vagueness (contd)

      • John hurried to Mary’s house after work. But Mary had already left for dinner.

      • => we need to track ‘reference time’ and decide when reference times coincide

    • Modality

      • John should have brought some wine.

        • Did he bring wine? No.

      • John prevented the divorce.

        • Did the divorce happen? No.

        • => we need to know about subordination

    • Implicit Information

      Yesterday, Holly fell. (implicit “on”)

      Holly fell. David pushed her. (implicit “because”)

      • => we need discourse modeling


  • Login