380 likes | 386 Views
Natural Language Processing. Lecture Notes 2. Assignment 1. Go over Assignment 1. Cosine similarity. Cosine of angle between two vectors The denominator involves the lengths of the documents. Normalization. Next Lectures: Words. Three related topics (Ch 2-3): Finite state automata
E N D
Natural Language Processing Lecture Notes 2
Assignment 1 • Go over Assignment 1
Cosine similarity • Cosine of angle between two vectors • The denominator involves the lengths of the documents. Normalization
Next Lectures: Words • Three related topics (Ch 2-3): • Finite state automata • Finite state transducers • English morphology • We’ll start by reviewing regular expressions and finite state automata, and see where they can get us in NLP
Chapter 2: Regular Expressions and Text Searching • Pervasive • Emacs, vi, perl, python, awk, grep, sed, Word, netscape, etc. • Views of Regular Expressions: • Practical way to specify textual search strings (bells and whistles to make things easier) • Formal specifications of regular languages (only three operations are strictly needed: U, o, and *)
Example • /$[0-9]+(\.[0-9][0-9])?/ • Matches: • It costs $765.03 • It costs $871 • Let D = {0,1,2,3,4,5,6,7,8,9} • $(DD* U DD*.DD)
Example • Find me all instances of the word “the” in a text. • /the/ • /[tT]he/ • /\b[tT]he\b/ • The second RE catches “the” when it begins a sentence (increasing Recall), while the third RE does not match “the” when embedded in another word such as “other” (increasing Precision).
Errors • The process we just went through was based on fixing two kinds of errors • Not matching things that we should have matched (The) • False negatives (decreasing Recall) • Matching strings that we should not have matched (there, then, other) • False positives (decreasing Precision)
Errors cont. • We’ll be telling the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts: • Increasing Precision (minimizing false positives) • Increasing Recall/Coverage (minimizing false negatives).
Text Searches using REs • General: Ch 2 of the text • If regular expressions aren’t in your programming repertoire: decide which language you’ll use, and find a tutorial • Assignment 1
Finite State Automata • Remember: Regular expressions and finite-state automata are equivalent (given any RE, an equivalent FSA can be constructed, and vice-versa) • FSAs and their close relatives are a theoretical foundation of much of the field of NLP
FSAs as Graphs • Let’s start with the sheep language from the text • /baa+!/
Sheep FSA • We can say the following things about this machine • It has 5 states • At least b,a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions (in this short-hand version; needs a sink state, since this is meant to be deterministic)
But note • There are other machines that recognize this language
And Multiple REs… • /baaa*!/ • /baa+!/ • We’ll return to this point later…
More Formally • You can specify an FSA by enumerating the following things (precisely: a DFSA). • The set of states: Q • A finite alphabet: Σ • A start state • A set of accept/final states • A transition function that maps QxΣ to Q
About Alphabets • An alphbet is just a finite set of symbols in the input. • These symbols can and will stand for bigger objects that can have internal structure.
3 3 Another View
Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string
Recognition • Traditionally, (Turing’s idea) this process is depicted with a tape.
Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.
Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all DFSAs. To recognize a new regular language, you merely change the alphabet and table. • So, searching for a string using a RE involves compiling the RE into a table, and passing the table to an interpreter.
Generative Formalisms • FSAs can be viewed from two perspectives: • Acceptors that can tell you if a string is in the language • Generators to produce all and only the strings in the language
A non-deterministic machine for talking sheep: In q2, input a: move to q3 or stay in q2 Non-Determinism
ε Non-Determinism cont. • Epsilon transitions • They do not examine or advance the tape during recognition
Equivalence • NDFAs, DFAs are equivalent (with REs too) • So, one way to do recognition with a non-deterministic machine is to turn it into a deterministic one.
Non-Deterministic Recognition • Given NDFSA N, a string S is in the language of N iff there exists at least one path through N for S that ends in an accept state • It’s ok if there are paths through N for S that do not end in an accept state • If string T is not in the language of N, then there is no path through N for T that ends in an accept state
Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when none of the possible paths lead to an accept state.
There is also a path which gets blocked; even so, baaa! is in the language b a a ! a q0 q2 q1 q2 q3 q4
Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.
Search Strategies (Review) • Depth-first-search • Most recently created states considered first • Agenda is a stack • LIFO • Can get stuck in an infinite loop • More efficient in memory • Breadth-first-search • States considered in creation order • Agenda is a queue • FIFO • Cannot get stuck in an infinite loop • Very inefficient in memory • Dynamic programming, etc: later in the course
ND-Recognize Code First(agenda) DFS: GEN-NEW-STATES(current-search-state) + agenda BFS: agenda + GEN-NEW-STATES(current-search-state)
Why Bother? • Non-determinism doesn’t get us more formal power and it requires search, so why bother? • Often more natural solutions • Often much smaller
Summing Up • Regular expressions, NDFSAs, and DFSAs can represent a subset of natural languages (regular languages) • Relatively easy to use • Can be difficult to scale up • Cannot represent all NL phenomena