1 / 38

Natural Language Processing

Natural Language Processing. Lecture Notes 2. Assignment 1. Go over Assignment 1. Cosine similarity. Cosine of angle between two vectors The denominator involves the lengths of the documents. Normalization. Next Lectures: Words. Three related topics (Ch 2-3): Finite state automata

wjuly
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Lecture Notes 2

  2. Assignment 1 • Go over Assignment 1

  3. Cosine similarity • Cosine of angle between two vectors • The denominator involves the lengths of the documents. Normalization

  4. Next Lectures: Words • Three related topics (Ch 2-3): • Finite state automata • Finite state transducers • English morphology • We’ll start by reviewing regular expressions and finite state automata, and see where they can get us in NLP

  5. Chapter 2: Regular Expressions and Text Searching • Pervasive • Emacs, vi, perl, python, awk, grep, sed, Word, netscape, etc. • Views of Regular Expressions: • Practical way to specify textual search strings (bells and whistles to make things easier) • Formal specifications of regular languages (only three operations are strictly needed: U, o, and *)

  6. Example • /$[0-9]+(\.[0-9][0-9])?/ • Matches: • It costs $765.03 • It costs $871 • Let D = {0,1,2,3,4,5,6,7,8,9} • $(DD* U DD*.DD)

  7. Example • Find me all instances of the word “the” in a text. • /the/ • /[tT]he/ • /\b[tT]he\b/ • The second RE catches “the” when it begins a sentence (increasing Recall), while the third RE does not match “the” when embedded in another word such as “other” (increasing Precision).

  8. Errors • The process we just went through was based on fixing two kinds of errors • Not matching things that we should have matched (The) • False negatives (decreasing Recall) • Matching strings that we should not have matched (there, then, other) • False positives (decreasing Precision)

  9. Errors cont. • We’ll be telling the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts: • Increasing Precision (minimizing false positives) • Increasing Recall/Coverage (minimizing false negatives).

  10. Text Searches using REs • General: Ch 2 of the text • If regular expressions aren’t in your programming repertoire: decide which language you’ll use, and find a tutorial • Assignment 1

  11. Finite State Automata • Remember: Regular expressions and finite-state automata are equivalent (given any RE, an equivalent FSA can be constructed, and vice-versa) • FSAs and their close relatives are a theoretical foundation of much of the field of NLP

  12. FSAs as Graphs • Let’s start with the sheep language from the text • /baa+!/

  13. Sheep FSA • We can say the following things about this machine • It has 5 states • At least b,a, and ! are in its alphabet • q0 is the start state • q4 is an accept state • It has 5 transitions (in this short-hand version; needs a sink state, since this is meant to be deterministic)

  14. But note • There are other machines that recognize this language

  15. And Multiple REs… • /baaa*!/ • /baa+!/ • We’ll return to this point later…

  16. More Formally • You can specify an FSA by enumerating the following things (precisely: a DFSA). • The set of states: Q • A finite alphabet: Σ • A start state • A set of accept/final states • A transition function that maps QxΣ to Q

  17. About Alphabets • An alphbet is just a finite set of symbols in the input. • These symbols can and will stand for bigger objects that can have internal structure.

  18. Dollars and Cents

  19. 3 3 Another View

  20. Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string

  21. Recognition • Traditionally, (Turing’s idea) this process is depicted with a tape.

  22. Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.

  23. D-Recognize

  24. D-Recognize (with technically correct DFSA)

  25. Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all DFSAs. To recognize a new regular language, you merely change the alphabet and table. • So, searching for a string using a RE involves compiling the RE into a table, and passing the table to an interpreter.

  26. Generative Formalisms • FSAs can be viewed from two perspectives: • Acceptors that can tell you if a string is in the language • Generators to produce all and only the strings in the language

  27. A non-deterministic machine for talking sheep: In q2, input a: move to q3 or stay in q2 Non-Determinism

  28. ε Non-Determinism cont. • Epsilon transitions • They do not examine or advance the tape during recognition

  29. Equivalence • NDFAs, DFAs are equivalent (with REs too) • So, one way to do recognition with a non-deterministic machine is to turn it into a deterministic one.

  30. Non-Deterministic Recognition • Given NDFSA N, a string S is in the language of N iff there exists at least one path through N for S that ends in an accept state • It’s ok if there are paths through N for S that do not end in an accept state • If string T is not in the language of N, then there is no path through N for T that ends in an accept state

  31. Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when none of the possible paths lead to an accept state.

  32. There is also a path which gets blocked; even so, baaa! is in the language b a a ! a q0 q2 q1 q2 q3 q4

  33. Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

  34. ND-Recognize Code

  35. Search Strategies (Review) • Depth-first-search • Most recently created states considered first • Agenda is a stack • LIFO • Can get stuck in an infinite loop • More efficient in memory • Breadth-first-search • States considered in creation order • Agenda is a queue • FIFO • Cannot get stuck in an infinite loop • Very inefficient in memory • Dynamic programming, etc: later in the course

  36. ND-Recognize Code First(agenda) DFS: GEN-NEW-STATES(current-search-state) + agenda BFS: agenda + GEN-NEW-STATES(current-search-state)

  37. Why Bother? • Non-determinism doesn’t get us more formal power and it requires search, so why bother? • Often more natural solutions • Often much smaller

  38. Summing Up • Regular expressions, NDFSAs, and DFSAs can represent a subset of natural languages (regular languages) • Relatively easy to use • Can be difficult to scale up • Cannot represent all NL phenomena

More Related