190 likes | 273 Views
Explore text searching techniques like FSMs, Boyer Moore Algorithm, regular expressions, and morphological parsing. Understand composing transducers for language processing. Study ambiguity disambiguation in language rules and search algorithms in linguistics.
E N D
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2
Searching for a Single StringUsing a Nondeterministic FSM c o c o n u t 1 2 3 4 5 6 7 8
Searching for a Single String Using the Boyer Moore Algorithm
Searching for Multiple Strings o c o s 2 3 4 5 6 l c o c o n u t 1 2 3 4 5 6 7 8 Example: lococonut
Converting to a Deterministic FSM o c o s 2 3 4 5 6 l c o c o n u t 1 2 3 4 5 6 7 8
Regular Expressions • Two different (but related) uses of the term: • Expressions that define all and only the regular languages • (aa ab ba bb)* • Expressions in a useful pattern language Matching ip addresses: S!<emphasis> ([0-9]+ (\ . [0-9]+) {3}) </emphasis> ! <inet> $1 </inet>! Finding doubled words: \< ([A-Za-z]+) \s+ \1 \>
REs: Syntax and Semantics Syntax The regular expressions over an alphabet are all strings over the alphabet {(, ), , , *} that can be obtained as follows: 1. and each member of is a regular expression. 2. If , are regular expressions, then so is . 3. If , are regular expressions, then so is . 4. If is a regular expression, then so is *. 5. If is a regular expression, then so is (). 6. Nothing else is a regular expression.
REs: Syntax and Semantics Regular expressions define languages via a semantic interpretation function we'll call L: 1. L() = and L(a) = {a} for each a 2. If , are regular expressions, then L() = L() L() = all strings that can be formed by concatenating to some string from L() some string from L(). 3. If , are regular expressions, then L() = L() L() 4. If is a regular expression, then L(*) = L()* 5. If () is a regular expression, then L( () ) = L() A language is regular if and only if it can be described by a regular expression. Note: Lis compositional.
The Importance of Compositionality What is the meaning of: Mary cooked the yujutes. Mary tyroked the yujutes.
Morphological Analysis • Read J & M Chapter 3 • Recognize words • Parse words
Morphological Parsing Goal: to represent the facts declaratively so that a single representation can be used for both recognition and generation. Note: ^ marks morpheme boundaries. # marks word boundaries.
From Lexical to Intermediate Note: All the transducers in the book are described as lexical:intermediate, but they can run the other direction.
From Intermediate to Surface For text, we need spelling rules. x e / s ^ ___ s # z Read this as “Replace as e in the context after the /.
Turning the Rule into a Transducer foxes xerox fox#sat
Disambiguation - Local Local ambiguities: # s# asses luxury
Disambiguation - Harder Sometimes additional knowledge is necessary: foxes: fox +N + PL or fox +V +SG Can we think of nouns that cannot also be verbs?
Search • For FSMs, we can build a deterministic machine. • In other cases, we will have to search: • Depth-first • Breadth-first – chart parsing S S VP VP NP PP NP NP V V PR N det N PREP DET N I hit the boy with a bat.