Create Presentation
Download Presentation

Download Presentation
## Introduction to Computational Linguistics

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Introduction toComputational Linguistics**Words and Finite State Machinery CLINT-CS Finite State**Acknowledgement**Material derived from/copied from • Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 • Richard Sproat, Lecture notes CLINT-CS Finite State**Finite State Methods**• Word-Oriented Application Areas • Tokenization • Sentence breaking • Spelling correction • Morphology (analysis/generation) • Phonological disambiguation (Speech Recognition) • Morphological disambiguation (“Tagging”) • Pattern matching (“Named Entity Recognition”) • Shallow Parsing CLINT-CS Finite State**Outline**Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State**What is a Word?**Some Distinctions • Written • Spoken • Word Type • Word Token CLINT-CS Finite State**Information Associated with Words**• Spelling • orthographic • phonological • Syntax • POS • Valency • Semantics • Meaning • Relationship to other words CLINT-CS Finite State**Properties of Words**• Sequence • characters pollution • phonemes • Delimitation • whitespace • other? • Structure • simple ("atomic") words • complex ("molecular") words CLINT-CS Finite State**Complex Words**• Complex words have subparts: • e.g. "enlargement"en + large + ment • Some subparts are valid wordslarge • Others are prefixes and suffixesen, ment • N.B. The complex word can be built in different ways: (en + large) + menten + (large + ment) CLINT-CS Finite State**Morphological Processes**• affixation • prefix • suffix • circumfix: għandi - mgħandix • infix: phenidinephenetidine • other morphological processes • redoubling (mexa; mexxa) • vowel change (swim; swam) CLINT-CS Finite State**Affixation uses Concatenation**prefixes roots suffixes large charge infect code decide ed ing ee er ly dis re un en + + CLINT-CS Finite State**The Language of Words**• What kind of formal language is the language of words? • One which can be constructed out of • A characteristic set of basic symbols (alphabet) • A characteristic set of combining operations • Union (disjunction) • Concatenation • Iteration • Regular Language; Regular Sets CLINT-CS Finite State**MACHINE**Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State**Outline**Words Regular Languages Regular Expressions Finite State Automota CLINT-CS Finite State**Regular Languages**• A regular language is a language with a finite alphabet that can be constructed out of one or more of the following operations: • Set union • Concatenation • Transitive closure (Kleene star) CLINT-CS Finite State**Some things that areregular languages**• Zero or more a’s followed by zero or more b’s • The set of words in an English dictionary • Dates • URLs • English? CLINT-CS Finite State**Some things that are not regular languages**• Zero or more a’s followed by exactly the same number of b’s • The set of all English palindromes (e.g. Madam I'm Adam) • The set that includes all noun phrases of the form • the cat slept • the cat the dog bit slept • the cat the dog the man fed bit slept CLINT-CS Finite State**Some special regular languages**• The universal language (Σ*) • The empty language (Ø) Note: the empty language is not the same as the empty string CLINT-CS Finite State**Some closure propertiesof regular languages**• Intersection • Complementation • Difference • Reversal • Power CLINT-CS Finite State**MACHINE**Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State**Outline**Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State**Regular Expressions**• Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc.) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. CLINT-CS Finite State**Regular Expressions**a a simple symbol A B concatenation A | B alternation operator A & B intersection operator A* Kleene star CLINT-CS Finite State**Caveats**• Perl and other languages (see J&M, Chapter 2) have lots of stuff in their “regular expression” syntax. Strictly speaking, not all of these correspond to regular expressions in the formal sense since they don’t describe regular languages. • For example, arbitrary substring copying is not expressible as a regular language, though one can do this in Perl (or Python …) /(…+)\1/ CLINT-CS Finite State**MACHINE**Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION CLINT-CS Finite State**Outline**Words Regular Languages Regular Expressions Finite Automota CLINT-CS Finite State**Finite Automaton**• A finite automaton is a quintuple (Q, I, q0,F, δ ) where: • Q is a finite set of states • Σ is alphabet of symbols • q0 Q is a start state • F Q are final states • δ is a transition relationδ(q,i,q') between a state q Q, a symbol σ Σand q' Q CLINT-CS Finite State**Representation of FSA’s:State Diagram**CLINT-CS Finite State**State Table**CLINT-CS Finite State**1-**h 2 a h 3 ! 4= Prolog initial(1).final(4).arc(1,2,h).arc(2,3,a).arc(3,4,!).arc(3,2,h). CLINT-CS Finite State**Mr. S.K.**CLINT-CS Finite State**Kleene’s theorem**• Languages generated by NFAs are exactly equivalent languages described by Regular Expressions. • Kleene’s Theorem, part 1: To each regular expression there corresponds a NFA. • Kleene’s Theorem, part 2: To each NFA there corresponds a regular expression. CLINT-CS Finite State**Converting a Regular Expressionto an NFA**• The NFA representing the empty string is: • The NFA representing a single character is: ε 1 2 a 1 2 CLINT-CS Finite State**Converting a Regular Expressionto an NFA**• The union operator is represented by a choice of paths from a node, e.g. a|b b 1 2 a CLINT-CS Finite State**Converting a Regular Expressionto an NFA**• Concatenation simply involves connecting one NFA to the other, so that ab is represented by a b 1 2 3 CLINT-CS Finite State**Converting a Regular Expressionto an NFA**• The Kleene star must allow for zero or more occurrences. So a* is represented by ε a ε 1 2 3 3 ε ε CLINT-CS Finite State**Deterministic versus non-deterministic finite automata**• The definition of finite-state automata given above was for non-deterministic finite automata (NFA): • δ is a relation, meaning that from any state and given any symbol, one can in principle transition to any number of states. • In deterministic finite automata (DFA), every state/symbol pair maps to a unique state • In other words, δ is a function CLINT-CS Finite State**A deterministic automaton**CLINT-CS Finite State**NFAs vs DFAs**• NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? CLINT-CS Finite State**NFAs vs DFAs**• NDFA’s are typically smaller and simpler than their equivalent DFA’s • Why do we care about DFA’s? • EFFICIENCY! CLINT-CS Finite State**Equivalence of NFA’s and DFA’s**CLINT-CS Finite State**Subset Construction for Determinisation**• Any two states that are connected by an εtransition may as well be the same, since we can move from one to the other without consuming any character. • Thus states which are connected by an εtransition will be represented by the same states in the DFA. • If there are multiple transitions based on the same symbol, then we can regard a transition as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). • Thus these states will be combined into a single DFA state. • more details http://www.cs.may.ie/~jpower/Courses/parsing/node9.html CLINT-CS Finite State**Xerox Tools**Finite State Machinery CLINT-CS Finite State**The Xerox Approach**• Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. • Meta-languages for describing regular languages and regular relations. • Compiler for mapping meta-language "programs" into efficient FS machinery • Several tools and applications CLINT-CS Finite State**xerox tools**• xfstXerox Finite-State Tool • lexcFinite-State Lexicon Compiler • twolcTwo-Level Rule Compiler CLINT-CS Finite State**xfst**• xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.) CLINT-CS Finite State**Simple Regular Expressions**• Atomic Expressions • Simple Symbols • Multicharacter Symbols • Complex Expressions • Union • Intersection • Concatenation CLINT-CS Finite State**xfst Notation Examples**A|B Union A&B Intersection A B Concatenation A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ] CLINT-CS Finite State**Regular Expression**E1: = [a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"} Concatenation over Reg. Expression and Language CLINT-CS Finite State**Concatenation overFS Automata**a c + b d a c = b d CLINT-CS Finite State**Simple Commands**• In addition to the notation there are also commands, e.g. • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction • various command line options CLINT-CS Finite State