- 66 Views
- Uploaded on

Download Presentation
## Introduction to Computational Linguistics

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Acknowledgement

Material derived from/copied from

- Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000
- Richard Sproat, Lecture notes

CLINT-CS Finite State

Finite State Methods

- Word-Oriented Application Areas
- Tokenization
- Sentence breaking
- Spelling correction
- Morphology (analysis/generation)
- Phonological disambiguation (Speech Recognition)
- Morphological disambiguation (“Tagging”)
- Pattern matching (“Named Entity Recognition”)
- Shallow Parsing

CLINT-CS Finite State

Information Associated with Words

- Spelling
- orthographic
- phonological
- Syntax
- POS
- Valency
- Semantics
- Meaning
- Relationship to other words

CLINT-CS Finite State

Properties of Words

- Sequence
- characters pollution
- phonemes
- Delimitation
- whitespace
- other?
- Structure
- simple ("atomic") words
- complex ("molecular") words

CLINT-CS Finite State

Complex Words

- Complex words have subparts:
- e.g. "enlargement"en + large + ment
- Some subparts are valid wordslarge
- Others are prefixes and suffixesen, ment
- N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment)

CLINT-CS Finite State

Morphological Processes

- affixation
- prefix
- suffix
- circumfix: għandi - mgħandix
- infix: phenidinephenetidine
- other morphological processes
- redoubling (mexa; mexxa)
- vowel change (swim; swam)

CLINT-CS Finite State

Affixation uses Concatenation

prefixes

roots

suffixes

large

charge

infect

code

decide

ed

ing

ee

er

ly

dis

re

un

en

+

+

CLINT-CS Finite State

The Language of Words

- What kind of formal language is the language of words?
- One which can be constructed out of
- A characteristic set of basic symbols (alphabet)
- A characteristic set of combining operations
- Union (disjunction)
- Concatenation
- Iteration
- Regular Language; Regular Sets

CLINT-CS Finite State

Regular Languages

- A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations:
- Set union
- Concatenation
- Transitive closure (Kleene star)

CLINT-CS Finite State

Some things that areregular languages

- Zero or more a’s followed by zero or more b’s
- The set of words in an English dictionary
- Dates
- URLs
- English?

CLINT-CS Finite State

Some things that are not regular languages

- Zero or more a’s followed by exactly the same number of b’s
- The set of all English palindromes (e.g. Madam I'm Adam)
- The set that includes all noun phrases of the form
- the cat slept
- the cat the dog bit slept
- the cat the dog the man fed bit slept

CLINT-CS Finite State

Some special regular languages

- The universal language (Σ*)
- The empty language (Ø)

Note: the empty language is not the same as the empty string

CLINT-CS Finite State

Some closure propertiesof regular languages

- Intersection
- Complementation
- Difference
- Reversal
- Power

CLINT-CS Finite State

Regular Expressions

- Notation for describing regular sets
- Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word)
- Xerox Finite State tools use a somewhat different notation, but similar function.

CLINT-CS Finite State

Regular Expressions

a a simple symbol

A B concatenation

A | B alternation operator

A & B intersection operator

A* Kleene star

CLINT-CS Finite State

Caveats

- Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages.
- For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …)

/(…+)\1/

CLINT-CS Finite State

Finite Automaton

- A finite automaton is a quintuple (Q, I, q0,F, δ ) where:
- Q is a finite set of states
- Σ is alphabet of symbols
- q0 Q is a start state
- F Q are final states
- δ is a transition relationδ(q,i,q') between a state q Q, a symbol σ Σand q' Q

CLINT-CS Finite State

Representation of FSA’s:State Diagram

CLINT-CS Finite State

State Table

CLINT-CS Finite State

Mr. S.K.

CLINT-CS Finite State

Kleene’s theorem

- Languages generated by NFAs are exactly equivalent languages described by Regular Expressions.
- Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA.
- Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression.

CLINT-CS Finite State

Converting a Regular Expressionto an NFA

- The NFA representing the empty string is:
- The NFA representing a single character is:

ε

1

2

a

1

2

CLINT-CS Finite State

Converting a Regular Expressionto an NFA

- The union operator is represented by a choice of paths from a node, e.g. a|b

b

1

2

a

CLINT-CS Finite State

Converting a Regular Expressionto an NFA

- Concatenation simply involves connecting one NFA to the other, so that ab is represented by

a

b

1

2

3

CLINT-CS Finite State

Converting a Regular Expressionto an NFA

- The Kleene star must allow for zero or more occurrences. So a* is represented by

ε

a

ε

1

2

3

3

ε

ε

CLINT-CS Finite State

Deterministic versus non-deterministic finite automata

- The definition of finite-state automata given above was for non-deterministic finite automata (NFA):
- δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states.
- In deterministic finite automata (DFA), every state/symbol pair maps to a unique state
- In other words, δ is a function

CLINT-CS Finite State

A deterministic automaton

CLINT-CS Finite State

NFAs vs DFAs

- NDFA’s are typically smaller and simpler than their equivalent DFA’s
- Why do we care about DFA’s?

CLINT-CS Finite State

NFAs vs DFAs

- NDFA’s are typically smaller and simpler than their equivalent DFA’s
- Why do we care about DFA’s?
- EFFICIENCY!

CLINT-CS Finite State

Equivalence of NFA’s and DFA’s

CLINT-CS Finite State

Subset Construction for Determinisation

- Any two states that are connected by an εtransition may as well be the same, since we can move from one to the other without consuming any character.
- Thus states which are connected by an εtransition will be represented by the same states in the DFA.
- If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol).
- Thus these states will be combined into a single DFA state.
- more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html

CLINT-CS Finite State

The Xerox Approach

- Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi.
- Meta-languages for describing regular languages and regular relations.
- Compiler for mapping meta-language "programs" into efficient FS machinery
- Several tools and applications

CLINT-CS Finite State

xerox tools

- xfstXerox Finite-State Tool
- lexcFinite-State Lexicon Compiler
- twolcTwo-Level Rule Compiler

CLINT-CS Finite State

xfst

- xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.
- xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)

CLINT-CS Finite State

Simple Regular Expressions

- Atomic Expressions
- Simple Symbols
- Multicharacter Symbols
- Complex Expressions
- Union
- Intersection
- Concatenation

CLINT-CS Finite State

xfst Notation Examples

A|B Union

A&B Intersection

A B Concatenation

A* Closure (Kleene Star)

(A) Optional Element

? Any symbol

\b Any symbol other than b

~A Complement (= [?* - A ])

0 Empty string language

$A [ ?* A ?* ]

CLINT-CS Finite State

Regular Expression

E1: = [a|b]

E2: = [c|d]

E1 E2 =

[a|b] [c|d]

Language

L1 = {"a", "b"}

L2 = {"c", "d"}

L1 L2 =

{"ac", "ad", "bc", "bd"}

Concatenation over Reg. Expression and LanguageCLINT-CS Finite State

Simple Commands

- In addition to the notation there are also commands, e.g.
- define: give a name to an RE
- print: print information
- read: read information
- various stack operations
- file interaction
- various command line options

CLINT-CS Finite State

define command

- define name regexp

xfst[0]: define foo [d o g] | [c a t];

xfst[0]: define R1 [a | b | c | d];

xfst[0]: define R2 [d | e | f | g];

xfst[0]: define R3 [f | g | h | i | j];

x0

CLINT-CS Finite State

print command

- print words name - see the words in the language called name
- print net name - see detailed information about the network name.

xfst[0]: print words foo;

xfst[0]: print net baz;

xfst[0]: define baz R1 & R2;

CLINT-CS Finite State

Stack Example

xfst[0]: clear stack;

xfst[0]: read regex [e d | i n g | s |[]]

xfst[1]: read regex [t a l k | k i c k]

xfst[2]: print stack

xfst[2]: print net

xfst[2]: print words

xfst[2]: concatenate net

xfst[1]: print words

CLINT-CS Finite State

lexc?

Source File

Compiled Network

lexc

- lexc is a high level programming language and compiler that is well suited for defining NL lexicons.
- The output is a compiled form of FS network in a format identical to other Xerox tools (xfst, twolc).

CLINT-CS Finite State

lexc source file

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

! ex0-lex.txt

LEXICON Root

dine #;

dines #;

dined #;

line #;

lines #;

lined #;

END

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

CLINT-CS Finite State

Lexc Sublexicons

! ex1-lex.txt

LEXICON Root

Noun;

Verb;

LEXICON Noun

line NounSuffix;

LEXICON Verb

dine VerbSuffix;

line VerbSuffix;

LEXICON NounSuffix

s #;

#;

LEXICON VerbSuffix

s #;

d #;

#;

CLINT-CS Finite State

lexc

- The resulting lexicon contains the same six words
- The form lines actually gets constructed twice, once as a verb, once as a noun.
- After minimization, only one of them remains.
- The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.

CLINT-CS Finite State

Running lexc

lexc> compile-source ex1-lex.txt

Opening 'ex1-lex.txt'...

Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3

Building lexicon...Minimizing...Done!

SOURCE: 6 states, 7 arcs, 6 words

lexc>

CLINT-CS Finite State

Download Presentation

Connecting to Server..