Finite state methods in natural language processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Finite-State Methods in Natural Language Processing PowerPoint PPT Presentation


  • 32 Views
  • Uploaded on
  • Presentation posted in: General

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 18, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings

Download Presentation

Finite-State Methods in Natural Language Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Finite state methods in natural language processing

Finite-State Methods in Natural Language Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 18, 2005


Course outline

Course Outline

  • July 18:

    • Intro to computational morphology

    • XFST

  • Readings

    • Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

    • Karttunen and Beesley, “25 Years of Finite-State Morphology”

    • Chapter 1: “Gentle Introduction” (B&K)

  • July 20:

    • Regular expressions

    • More on XFST

  • Readings

    • Chapter 2: “Systematic Introduction”

    • Chapter 3: “The XFST interface”


  • Finite state methods in natural language processing

    • July 25

      • Concatenative morphotactics

      • Constraining non-local dependencies

    • Readings

      • Chapter 4. “The LEXC Language”

      • Chapter 5. “Flag Diacritics”

  • July 27

    • Non-concatenative morphotactics

      • Reduplication, interdigitation

  • Readings

    • Chapter 8. “Non-Concatenative Morphotactics”


  • Finite state methods in natural language processing

    • August 1

      • Realizational morphology

    • Readings

      • Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

      • Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

  • August 3

    • Optimality theory

  • Readings

    • Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

    • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.


  • Getting credit for lsa 207

    Getting credit for LSA 207

    • There will be three assignments, given on each Wednesday. The first two are to be turned in by the following Monday, the last one by the following Friday.

    • You will get credit for the course if you solve at least two of the three assignments. The solutions will involve programming in the xfst scripting language. The problems will be easy to solve if you have attended the class.

    • If you have any problems in doing the assignments, Michael Wagner and I will be happy to help you.


    Textbook

    Textbook

    Copies will arrive in the

    Linguistics Department

    tomorrow afternoon.

    You can purchase a copy there

    tomorrow as soon as the books

    have arrived.

    Starting Wednesday, books can

    Be purchased from our TA,

    Michael Wagner.

    The price is $35.

    With the book comes a

    software CD for Solaris,

    Linux, MacOSX and Windows

    operating systems.


    Lsa 207 web site

    LSA 207 Web site

    • http://lsa.dlp.mit.edu/Class/207

    • You can use this username and password to access materials:

    • Username: LSA207

    • Password: seunsehi207

    • Your are free to copy, modify and use the slides for whatever purpose provided that you give appropriate credit to the original source.

    • The readings for Wednesday’s class (“Finite-State Constraints”, “25 Years of Finite-State Morphology” and “Gentle Introduction” (Chapter 1 of B&K book) are posted on the web site).


    Software

    Software

    • The software on the Book CD dates back to the Spring of 2003. For an update, point your browser to

      • http://www.stanford.edu/~laurik/.lsa207/

    • Please read the README file and the License Agreement before downloading the software.

    • The updated software supports UTF-8 encoded Unicode input/output. The Book version supports only Latin-1 (ISO-8859-1).

    • The XFST application will be available locally on some computers (ask Michael).

    • Check out the web site for the Book:

      • http://www.fsmbook.com/


    Finite state methods in nlp

    Finite-State Methods in NLP

    • Domains of Application

      • Tokenization

      • Sentence breaking

      • Spelling correction

      • Morphology (analysis/generation)

      • Phonological disambiguation (Speech Recognition)

      • Morphological disambiguation (“Tagging”)

      • Pattern matching (“Named Entity Recognition”)

      • Shallow Parsing

    • Types of Finite-State Systems

      • Classical (non-weighted) automata

      • Weighted (associated with weights in a semi-ring)

      • Binary relations (simple transducers)

      • N-ary relations (multi-tape transducers)


    Computational morphology

    Analysis

    Generation

    hang V Past

    leaf N Pl

    leave N Pl

    leave V Sg3

    leaves

    hanged

    hung

    Computational morphology


    Two challenges

    Two challenges

    • Morphotactics

      • Words are composed of smaller elements that must be combined in a certain order:

        • piti-less-ness is English

        • piti-ness-less is not English

    • Phonological alternations

      • The shape of an element may vary depending on the context

        • pity is realized as pitiin pitilessness

        • die becomes dy in dying


    Morphology is regular rational

    Morphology is regular (=rational)

    • The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation.

    • A regular relation consists of ordered pairs of strings.

      • leaf+N+Pl : leaveshang+V+Past : hung

  • Any finite collection of such pairs is a regular relation.

  • Regular relations are closed under operations such as concatenation, iteration, union, and composition.

    • Complex regular relations can be derived from simple relations.


  • Morphology is finite state

    Morphology is finite-state

    • A regular relation can be defined using the metalanguage of regular expressions.

    • [{talk} | {walk} | {work}]

    • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

    • A regular expression can be compiled into a finite-state transducer that implements the relation computationally.


    Compilation

    Finite-state transducer

    +Base:

    final

    state

    +3rdSg:s

    a

    t

    +Progr:i

    :n

    :g

    a

    l

    k

    w

    o

    r

    +Past:e

    :d

    initial

    state

    Compilation

    Regular expression

    • [{talk} | {walk} | {work}]

    • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];


    Generation

    Generation

    work+3rdSg --> works

    +Base:

    +3rdSg:s

    a:a

    t:t

    +Progr:i

    :n

    :g

    a:a

    l:l

    k:k

    w:w

    o:o

    r:r

    +Past:e

    :d


    Analysis

    Analysis

    +Base:

    +3rdSg:s

    a:a

    t:t

    +Progr:i

    :n

    :g

    a:a

    l:l

    k:k

    w:w

    o:o

    r:r

    +Past:e

    :d

    talked --> talk+Past


    Xfst demo 1

    XFST Demo 1

    start xfst

    • xfst[0]: regex

    • [{talk} | {walk} | {work}]

    • [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

    % xfst

    xfst[0]:

    compile a regular expression

    xfst[1]: apply up walked

    walk+Past

    apply the result

    xfst[1]: apply down talk+SgGen3

    talks


    Lexical transducer

    vouloir +IndP +SG + P3

    Finite-state transducer

    veut

    citation form

    inflection codes

    v

    o

    u

    l

    o

    i

    r

    +IndP

    +SG

    +P3

    v

    e

    u

    t

    inflected form

    Lexical transducer

    • Bidirectional: generation or analysis

    • Compact and fast

    • Comprehensive systems have been built for over 40 languages:

      • English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …


    How lexical transducers are made

    Morphotactics

    Lexicon

    Regular Expression

    Lexicon

    FST

    Lexical Transducer

    (a single FST)

    Compiler

    composition

    Rules

    Regular Expressions

    Rule

    FSTs

    Alternations

    f

    a

    t

    +Adj

    +Comp

    t

    e

    f

    a

    t

    r

    How lexical transducers are made


    Sequential model

    fst 1

    fst 2

    fst n

    Sequential Model

    Lexical form

    Ordered sequence

    of rewrite rules

    (Chomsky & Halle ‘68)

    can be modeled

    by a cascade of

    finite-state transducers

    Johnson ‘72

    Kaplan & Kay ‘81

    Intermediate form

    ...

    Surface form


    Discovery and rediscovery

    Discovery and Rediscovery

    • C. Douglas Johnson (1972) showed that

      • phonological rewrite rules are interpreted in a way that makes them less powerful than they appear

      • rewrite rules can be modeled by finite transducers

      • for any two finite transducers applied in a sequence there exists an equivalent single transducer (Schützenberger 1961).

    • Johnson’s result was ignored and forgotten, rediscovered by Ronald M. Kaplan and Martin Kay at Xerox around 1980.


    Application constraint

    Application constraint

    • Phonological rewrite rules are not as powerful as they appear because of the constraint that a rule does not apply to its own output. (Johnson 1972, Kaplan&Kay 1980).


    Sequential application

    Sequential application

    k a N p a n

    k a m p a n

    k a m m a n

    N -> m / _ p

    p -> m / m _


    Sequential application in detail

    N:m

    2

    p

    m

    N:m

    ?

    m

    0

    ?

    p

    1

    N

    N

    m

    p

    1

    ?

    m

    0

    ?

    p:m

    Sequential application in detail

    k a N p a n

    k a m p a n

    k a m m a n

    0 0 0 2 0 0 0

    0 0 0 1 0 0 0


    Composition

    3

    0

    2

    1

    Composition

    N:m

    p:m

    k a N p a n

    k a m m a n

    N:m

    m

    0 0 0 3 0 0 0

    m

    ?

    ?

    p

    p:m

    N:m

    m

    N

    ?

    N

    N


    Parallel model

    fst n

    Parallel Model

    Lexical form

    ...

    fst 2

    fst 1

    Surface form

    Set of parallel

    of two-level rules (constraints)

    compiled into finite-state automata

    interpreted as transducers

    Koskenniemi ‘83


    Sequential vs parallel rules

    Koskenniemi 1983

    Chomsky&Halle 1968

    Lexical form

    Lexical form

    rule 1

    rule 1

    ...

    rule 2

    rule 1

    rule n

    Intermediate form

    Surface form

    intersect

    ...

    FST

    rule n

    Surface form

    Sequential vs. parallel rules

    compose


    Rewrite rules

    Rewrite rules

    • ? u: ty ? A s

    • ? u: t I y ? A s

    • ? u: t u y ? a s

    • ? o:t u y ? a s

    Epenthesis

    Harmony

    Yawelmani Vowel Harmony Kisseberth 1969

    Lowering


    Two level constraints

    Two-level constraints

    ? u:t 0 y ? A s

    ? o: t u y ? a s

    Epenthesis: Insert u or i (underspecification)

    Harmony: Rounding next to a round V of the same height.

    Lowering: Long u always realized as long o.

    Underlying representation controls all three alternations.


    Rewrite rules vs constraints

    Rewrite Rules vs. Constraints

    • Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated.

    • One approach may be more convenient than the other for particular applications.


    The big picture

    {a}

    Language

    or

    Relation

    describes

    encodes

    Regular Expression

    Finite-State Network

    compiles into

    a

    a

    The Big Picture


    Xfst demo 2

    XFST Demo 2

    xfst[0]: define Cat {cat} | {tiger} | {lion};

    defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths. ...

    xfst[0]:

    xfst[0]: set verbose off

    xfst[0]: define Dog {dog} | {spaniel} | {poodle};

    xfst[0]: regex Cat | Dog ;

    xfst[1]: apply up

    apply up> dog

    dog

    apply up> panther

    apply up>

    apply up> END;

    xfst[1]: define Animal

    xfst[0]:


    Finite state methods in natural language processing

    xfst[0]: regex Cat & Dog;

    xfst[1]: print net

    Sigma: a c d e g i l n o p r s t

    Size: 13, Label Map: Default

    Net:

    Flags: deterministic, pruned, minimized, epsilon_free, ...

    s0: (no arcs)

    xfst[1]:

    xfst[1]: pop

    xfst[0]:

    xfst[0]: regex Animal - Dog;

    xfst[1]: push Cat

    xfst[2]: test equivalent

    1, (0=NO,1=YES)

    xfst[2]: clear

    xfst[0]:


    Compiling networks from words

    Network

    c

    l

    e

    a

    r

    e

    v

    e

    f

    a

    t

    h

    Compiling networks from words

    xfst[0]: read text

    clear

    clever

    ear

    ever

    fat

    father

    ^D

    432 bytes. 10 states, 12 arcs, 6 paths.

    read text < file

    read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;


    Regular expression calculus

    Regular Expression Calculus

    • Symbols

      • Simple symbols vs. symbol pairs

      • Special symbols: ANY, EPSILON

    • Common regular expression operators

      • concatenation, union, intersection, negation, composition

    • Xerox operators

      • contains, restriction, replacement


    Symbols and labels

    a

    Symbols and Labels

    • Single and multicharacter symbols

      • a, b, c, … , +Adj, +SG, ^Fin

    • Special symbols

      • 0 EPSILON

      • ?ANY

    • Symbols vs. symbol pairs

      • In general, no distinction is made between

        • athe language {“a”}

        • a:athe identity relation {<“a”, “a”>}


    Common re operators

    Common RE Operators

    • concatenation

    • * +iteration

    • |union

    • &intersection*

    • ~ \ -complementation*, minus*

    • .x. :crossproduct

    • .o.composition

      • * = not applicable to regular relations because the result may not be encodable by a finite-state network.


    Iteration

    ?

    b:B

    c:C

    a:A

    [a:A | b:B | c:C | d:D | … ]*

    d:D

    Iteration

    • A*zero or more contatenations of A

    • A+one or more concatenations of A

    • ?*the universal language/

    • the universal identity relation


    Negation

    a

    a

    ?

    a

    a

    ?

    ?

    ?

    Negation

    • \Aany single symbol that is not in A

    • \? the null language

    • ~Aany string that is not in A

    • a

    • \a

    • Sigma: a, ?

    • ~a


    Crossproduct

    b:y

    a:x

    c:0

    Crossproduct

    • A .x. BThe relation that maps every string in A to every string in B, and vice versa

    • A:BSame as [A .x. B].

    a b c .x. x y

    [a b c] : [x y]

    {abc}:{xy}


    Composition1

    b

    a

    c

    b:B

    a:A

    c:C

    b:B

    c:C

    a:A

    d:D

    Composition

    • A .o. BThe relation C such that if A maps x to y and B maps y to z, C maps x to z.

    {abc} .o. [a:A | b:B | c:C | d:D]*


    Xerox re operators

    Xerox RE Operators

    • $containment

    • => restriction

    • -> @-> replacement

      • Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.


    Containment

    a

    ?

    ?

    a

    Containment

    $a

    [?* a ?*]


    Restriction

    b

    a => b _ c

    b

    ?

    a

    c

    “Anyamust be preceded byb

    and followed byc.”

    ?

    c

    c

    ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]

    Equivalent expression

    Restriction


  • Login