Natural Language Processing

Natural Language Processing Lecture Notes 2

Assignment 1 • Go over Assignment 1

Cosine similarity • Cosine of angle between two vectors • The denominator involves the lengths of the documents. Normalization

Next Lectures: Words • Three related topics (Ch 2-3): • Finite state automata • Finite state transducers • English morphology • We’ll start by reviewing regular expressions and finite state automata, and see where they can get us in NLP

Chapter 2: Regular Expressions and Text Searching • Pervasive • Emacs, vi, perl, python, awk, grep, sed, Word, netscape, etc. • Views of Regular Expressions: • Practical way to specify textual search strings (bells and whistles to make things easier) • Formal specifications of regular languages (only three operations are strictly needed: U, o, and *)

Example • /$[0-9]+(\.[0-9][0-9])?/ • Matches: • It costs $765.03 • It costs $871 • Let D = {0,1,2,3,4,5,6,7,8,9} • $(DD* U DD*.DD)

Example • Find me all instances of the word “the” in a text. • /the/ • /[tT]he/ • /\b[tT]he\b/ • The second RE catches “the” when it begins a sentence (increasing Recall), while the third RE does not match “the” when embedded in another word such as “other” (increasing Precision).

Errors • The process we just went through was based on fixing two kinds of errors • Not matching things that we should have matched (The) • False negatives (decreasing Recall) • Matching strings that we should not have matched (there, then, other) • False positives (decreasing Precision)

Errors cont. • We’ll be telling the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts: • Increasing Precision (minimizing false positives) • Increasing Recall/Coverage (minimizing false negatives).

Text Searches using REs • General: Ch 2 of the text • If regular expressions aren’t in your programming repertoire: decide which language you’ll use, and find a tutorial • Assignment 1

Finite State Automata • Remember: Regular expressions and finite-state automata are equivalent (given any RE, an equivalent FSA can be constructed, and vice-versa) • FSAs and their close relatives are a theoretical foundation of much of the field of NLP

FSAs as Graphs • Let’s start with the sheep language from the text • /baa+!/

Sheep FSA • We can say the following things about this machine • It has 5 states • At least b,a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions (in this short-hand version; needs a sink state, since this is meant to be deterministic)

But note • There are other machines that recognize this language

And Multiple REs… • /baaa*!/ • /baa+!/ • We’ll return to this point later…

More Formally • You can specify an FSA by enumerating the following things (precisely: a DFSA). • The set of states: Q • A finite alphabet: Σ • A start state • A set of accept/final states • A transition function that maps QxΣ to Q

About Alphabets • An alphbet is just a finite set of symbols in the input. • These symbols can and will stand for bigger objects that can have internal structure.

Dollars and Cents

3 3 Another View

Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string

Recognition • Traditionally, (Turing’s idea) this process is depicted with a tape.

Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.

D-Recognize

D-Recognize (with technically correct DFSA)

Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all DFSAs. To recognize a new regular language, you merely change the alphabet and table. • So, searching for a string using a RE involves compiling the RE into a table, and passing the table to an interpreter.

Generative Formalisms • FSAs can be viewed from two perspectives: • Acceptors that can tell you if a string is in the language • Generators to produce all and only the strings in the language

A non-deterministic machine for talking sheep: In q2, input a: move to q3 or stay in q2 Non-Determinism

ε Non-Determinism cont. • Epsilon transitions • They do not examine or advance the tape during recognition

Equivalence • NDFAs, DFAs are equivalent (with REs too) • So, one way to do recognition with a non-deterministic machine is to turn it into a deterministic one.

Non-Deterministic Recognition • Given NDFSA N, a string S is in the language of N iff there exists at least one path through N for S that ends in an accept state • It’s ok if there are paths through N for S that do not end in an accept state • If string T is not in the language of N, then there is no path through N for T that ends in an accept state

Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when none of the possible paths lead to an accept state.

There is also a path which gets blocked; even so, baaa! is in the language b a a ! a q0 q2 q1 q2 q3 q4

Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

ND-Recognize Code

Search Strategies (Review) • Depth-first-search • Most recently created states considered first • Agenda is a stack • LIFO • Can get stuck in an infinite loop • More efficient in memory • Breadth-first-search • States considered in creation order • Agenda is a queue • FIFO • Cannot get stuck in an infinite loop • Very inefficient in memory • Dynamic programming, etc: later in the course

ND-Recognize Code First(agenda) DFS: GEN-NEW-STATES(current-search-state) + agenda BFS: agenda + GEN-NEW-STATES(current-search-state)

Why Bother? • Non-determinism doesn’t get us more formal power and it requires search, so why bother? • Often more natural solutions • Often much smaller

Summing Up • Regular expressions, NDFSAs, and DFSAs can represent a subset of natural languages (regular languages) • Relatively easy to use • Can be difficult to scale up • Cannot represent all NL phenomena

Natural Language Processing