introduction to computational linguistics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Computational Linguistics PowerPoint Presentation
Download Presentation
Introduction to Computational Linguistics

Loading in 2 Seconds...

play fullscreen
1 / 60

Introduction to Computational Linguistics - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Introduction to Computational Linguistics. Words and Finite State Machinery. Acknowledgement. Material derived from/copied from Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 Richard Sproat, Lecture notes. Finite State Methods. Word-Oriented Application Areas

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Introduction to Computational Linguistics


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction to computational linguistics

Introduction toComputational Linguistics

Words and

Finite State Machinery

CLINT-CS Finite State

acknowledgement
Acknowledgement

Material derived from/copied from

  • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000
  • Richard Sproat, Lecture notes

CLINT-CS Finite State

finite state methods
Finite State Methods
  • Word-Oriented Application Areas
    • Tokenization
    • Sentence breaking
    • Spelling correction
    • Morphology (analysis/generation)
    • Phonological disambiguation (Speech Recognition)
    • Morphological disambiguation (“Tagging”)
    • Pattern matching (“Named Entity Recognition”)
    • Shallow Parsing

CLINT-CS Finite State

outline
Outline

Words

Regular Languages

Regular Expressions

Finite State Automota

CLINT-CS Finite State

what is a word
What is a Word?

Some Distinctions

  • Written
  • Spoken
  • Word Type
  • Word Token

CLINT-CS Finite State

information associated with words
Information Associated with Words
  • Spelling
    • orthographic
    • phonological
  • Syntax
    • POS
    • Valency
  • Semantics
    • Meaning
    • Relationship to other words

CLINT-CS Finite State

properties of words
Properties of Words
  • Sequence
    • characters pollution
    • phonemes
  • Delimitation
    • whitespace
    • other?
  • Structure
    • simple ("atomic") words
    • complex ("molecular") words

CLINT-CS Finite State

complex words
Complex Words
  • Complex words have subparts:
  • e.g. "enlargement"en + large + ment
  • Some subparts are valid wordslarge
  • Others are prefixes and suffixesen, ment
  • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)

CLINT-CS Finite State

morphological processes
Morphological Processes
  • affixation
    • prefix
    • suffix
    • circumfix: għandi - mgħandix
    • infix: phenidinephenetidine
  • other morphological processes
    • redoubling (mexa; mexxa)
    • vowel change (swim; swam)

CLINT-CS Finite State

affixation uses concatenation
Affixation uses Concatenation

prefixes

roots

suffixes

large

charge

infect

code

decide

ed

ing

ee

er

ly

dis

re

un

en

+

+

CLINT-CS Finite State

the language of words
The Language of Words
  • What kind of formal language is the language of words?
  • One which can be constructed out of
    • A characteristic set of basic symbols (alphabet)
    • A characteristic set of combining operations
      • Union (disjunction)
      • Concatenation
      • Iteration
  • Regular Language; Regular Sets

CLINT-CS Finite State

characterising classes of set

MACHINE

Characterising Classes of Set

CLASS OF

SETS or LANGUAGES

NOTATION

CLINT-CS Finite State

outline1
Outline

Words

Regular Languages

Regular Expressions

Finite State Automota

CLINT-CS Finite State

regular languages
Regular Languages
  • A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:
    • Set union
    • Concatenation
    • Transitive closure (Kleene star)

CLINT-CS Finite State

some things that are regular languages
Some things that areregular languages
  • Zero or more a’s followed by zero or more b’s
  • The set of words in an English dictionary
  • Dates
  • URLs
  • English?

CLINT-CS Finite State

some things that are not regular languages
Some things that are not regular languages
  • Zero or more a’s followed by exactly the same number of b’s
  • The set of all English palindromes (e.g. Madam I'm Adam)
  • The set that includes all noun phrases of the form
    • the cat slept
    • the cat the dog bit slept
    • the cat the dog the man fed bit slept

CLINT-CS Finite State

some special regular languages
Some special regular languages
  • The universal language (Σ*)
  • The empty language (Ø)

Note: the empty language is not the same as the empty string

CLINT-CS Finite State

some closure properties of regular languages
Some closure propertiesof regular languages
  • Intersection
  • Complementation
  • Difference
  • Reversal
  • Power

CLINT-CS Finite State

characterising classes of set1

MACHINE

Characterising Classes of Set

CLASS OF

SETS or LANGUAGES

NOTATION

CLINT-CS Finite State

outline2
Outline

Words

Regular Languages

Regular Expressions

Finite Automota

CLINT-CS Finite State

regular expressions
Regular Expressions
  • Notation for describing regular sets
  • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
  • Xerox Finite State tools use a somewhat different notation, but similar function.

CLINT-CS Finite State

regular expressions1
Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

CLINT-CS Finite State

caveats
Caveats
  • Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages.
  • For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …)

/(…+)\1/

CLINT-CS Finite State

characterising classes of set2

MACHINE

Characterising Classes of Set

CLASS OF

SETS or LANGUAGES

NOTATION

CLINT-CS Finite State

outline3
Outline

Words

Regular Languages

Regular Expressions

Finite Automota

CLINT-CS Finite State

finite automaton
Finite Automaton
  • A finite automaton is a quintuple (Q, I, q0,F, δ ) where:
  • Q is a finite set of states
  • Σ is alphabet of symbols
  • q0  Q is a start state
  • F  Q are final states
  • δ is a transition relationδ(q,i,q') between a state q  Q, a symbol σ Σand q'  Q

CLINT-CS Finite State

state table
State Table

CLINT-CS Finite State

prolog

1-

h

2

a

h

3

!

4=

Prolog

initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h).

CLINT-CS Finite State

mr s k
Mr. S.K.

CLINT-CS Finite State

kleene s theorem
Kleene’s theorem
  • Languages generated by NFAs are exactly equivalent languages described by Regular Expressions.
  • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.
  • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.

CLINT-CS Finite State

converting a regular expression to an nfa
Converting a Regular Expressionto an NFA
  • The NFA representing the empty string is:
  • The NFA representing a single character is:

ε

1

2

a

1

2

CLINT-CS Finite State

converting a regular expression to an nfa1
Converting a Regular Expressionto an NFA
  • The union operator is represented by a choice of paths from a node, e.g. a|b

b

1

2

a

CLINT-CS Finite State

converting a regular expression to an nfa2
Converting a Regular Expressionto an NFA
  • Concatenation simply involves connecting one NFA to the other, so that ab is represented by

a

b

1

2

3

CLINT-CS Finite State

converting a regular expression to an nfa3
Converting a Regular Expressionto an NFA
  • The Kleene star must allow for zero or more occurrences. So a* is represented by

ε

a

ε

1

2

3

3

ε

ε

CLINT-CS Finite State

deterministic versus non deterministic finite automata
Deterministic versus non-deterministic finite automata
  • The definition of finite-state automata given above was for non-deterministic finite automata (NFA):
  • δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states.
  • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
  • In other words, δ is a function

CLINT-CS Finite State

a deterministic automaton
A deterministic automaton

CLINT-CS Finite State

nfas vs dfas
NFAs vs DFAs
  • NDFA’s are typically smaller and simpler than their equivalent DFA’s
  • Why do we care about DFA’s?

CLINT-CS Finite State

nfas vs dfas1
NFAs vs DFAs
  • NDFA’s are typically smaller and simpler than their equivalent DFA’s
  • Why do we care about DFA’s?
  • EFFICIENCY!

CLINT-CS Finite State

subset construction for determinisation
Subset Construction for Determinisation
  • Any two states that are connected by an εtransition may as well be the same, since we can move from one to the other without consuming any character.
  • Thus states which are connected by an εtransition will be represented by the same states in the DFA.
  • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).
  • Thus these states will be combined into a single DFA state.
  • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html

CLINT-CS Finite State

xerox tools

Xerox Tools

Finite State Machinery

CLINT-CS Finite State

the xerox approach
The Xerox Approach
  • Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi.
  • Meta-languages for describing regular languages and regular relations.
  • Compiler for mapping meta-language "programs" into efficient FS machinery
  • Several tools and applications

CLINT-CS Finite State

xerox tools1
xerox tools
  • xfstXerox Finite-State Tool
  • lexcFinite-State Lexicon Compiler
  • twolcTwo-Level Rule Compiler

CLINT-CS Finite State

slide45
xfst
  • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.
  • xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)

CLINT-CS Finite State

simple regular expressions
Simple Regular Expressions
  • Atomic Expressions
    • Simple Symbols
    • Multicharacter Symbols
  • Complex Expressions
    • Union
    • Intersection
    • Concatenation

CLINT-CS Finite State

xfst notation examples
xfst Notation Examples

A|B Union

A&B Intersection

A B Concatenation

A* Closure (Kleene Star)

(A) Optional Element

? Any symbol

\b Any symbol other than b

~A Complement (= [?* - A ])

0 Empty string language

$A [ ?* A ?* ]

CLINT-CS Finite State

concatenation over reg expression and language
Regular Expression

E1: = [a|b]

E2: = [c|d]

E1 E2 =

[a|b] [c|d]

Language

L1 = {"a", "b"}

L2 = {"c", "d"}

L1 L2 =

{"ac", "ad", "bc", "bd"}

Concatenation over Reg. Expression and Language

CLINT-CS Finite State

concatenation over fs automata
Concatenation overFS Automata

a

c

+

b

d

a

c

=

b

d

CLINT-CS Finite State

simple commands
Simple Commands
  • In addition to the notation there are also commands, e.g.
    • define: give a name to an RE
    • print: print information
    • read: read information
    • various stack operations
    • file interaction
    • various command line options

CLINT-CS Finite State

define command
define command
  • define name regexp

xfst[0]: define foo [d o g] | [c a t];

xfst[0]: define R1 [a | b | c | d];

xfst[0]: define R2 [d | e | f | g];

xfst[0]: define R3 [f | g | h | i | j];

x0

CLINT-CS Finite State

print command
print command
  • print words name - see the words in the language called name
  • print net name - see detailed information about the network name.

xfst[0]: print words foo;

xfst[0]: print net baz;

xfst[0]: define baz R1 & R2;

CLINT-CS Finite State

stack example
Stack Example

xfst[0]: clear stack;

xfst[0]: read regex [e d | i n g | s |[]]

xfst[1]: read regex [t a l k | k i c k]

xfst[2]: print stack

xfst[2]: print net

xfst[2]: print words

xfst[2]: concatenate net

xfst[1]: print words

CLINT-CS Finite State

slide54
lexc?

Source File

Compiled Network

lexc

  • lexc is a high level programming language and compiler that is well suited for defining NL lexicons.
  • The output is a compiled form of FS network in a format identical to other Xerox tools (xfst, twolc).

CLINT-CS Finite State

lexc source file
lexc source file

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

! ex0-lex.txt

LEXICON Root

dine #;

dines #;

dined #;

line #;

lines #;

lined #;

END

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

CLINT-CS Finite State

lexc sublexicons
Lexc Sublexicons

! ex1-lex.txt

LEXICON Root

Noun;

Verb;

LEXICON Noun

line NounSuffix;

LEXICON Verb

dine VerbSuffix;

line VerbSuffix;

LEXICON NounSuffix

s #;

#;

LEXICON VerbSuffix

s #;

d #;

#;

CLINT-CS Finite State

slide57
lexc
  • The resulting lexicon contains the same six words
  • The form lines actually gets constructed twice, once as a verb, once as a noun.
  • After minimization, only one of them remains.
  • The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.

CLINT-CS Finite State

resulting fsa
Resulting FSA

s

d

i

n

e

l

d

CLINT-CS Finite State

running lexc
Running lexc

lexc> compile-source ex1-lex.txt


Opening 'ex1-lex.txt'...

Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3

Building lexicon...Minimizing...Done!

SOURCE: 6 states, 7 arcs, 6 words

lexc>

CLINT-CS Finite State

conclusion

MACHINE

Conclusion

CLASS OF

SETS or LANGUAGES

xfst

NOTATION

lexc

CLINT-CS Finite State