1 / 51

Regular Expressions and Automata

Regular Expressions and Automata. Lecture #2-2. September 10 2009. Finite State Automata. Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata).

geraldc
Download Presentation

Regular Expressions and Automata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expressions and Automata Lecture #2-2 September 10 2009

  2. Finite State Automata • Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). • FSAs and their close variants are a theoretical foundation of much of the field of NLP.

  3. a b a a ! q0 q1 q2 q3 q4 Finite State Automata • FSAs recognize the regular languages represented by regular expressions • SheepTalk: /baa+!/ • Directed graph with labeled nodes and arc transitions • Five states: q0 the start state, q4 the final state, 5 transitions

  4. a b a a ! q0 q1 q2 q3 q4 Formally • FSA is a 5-tuple consisting of • Q: set of states {q0,q1,q2,q3,q4} • : a finite alphabet of symbols {a,b,!} • q0: a start state • F: a set of accept/final states in Q {q4} • (q,i): a transition function mapping Q x  to Q

  5. State Transition Table for SheepTalk

  6. Recognition • Recognition (or acceptance) is the process of determining whether or not a given input should be accepted by a given machine. • Or… it’s the process of determining if as string is in the language we’re defining with the machine • In terms of REs, it’s the process of determining whether or not a given input matches a particular regular expression. • Traditionally, recognition is viewed as processing an input written on a tape consisting of cells containing elements from the alphabet.

  7. FSA recognizes (accepts) strings of a regular language • baa! • baaa! • baaa! • … • Tape metaphor: a rejected input q0 a b a ! b

  8. Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.

  9. D-Recognize

  10. q0 q3 q3 q4 q1 q2 q3

  11. Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous languages. • To change the machine, you change the table.

  12. Key Points • Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of • translating the expression into a machine (table) and • passing the table to an interpreter

  13. Recognition as Search • You can view this algorithm as a degenerate kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state • Its degenerate because?

  14. Formal Languages • Formal Languages are sets of strings composed of symbols from a finite set of symbols. • Finite-state automate define formal languages (without having to enumerate all the strings in the language) • Given a machine m (such as a particular FSA) L(m) means the formal language characterized by m. • L(Sheeptalk FSA) = {baa!, baaa!, baaaa!, …} (an infinite set)

  15. Generative Formalisms • The term Generative is based on the view that you can run the machine as a generator to get strings from the language. • FSAs can be viewed from two perspectives: • Acceptors that can tell you if a string is in the language • Generators to produce all and only the strings in the language

  16. Three Views • Three equivalent formal ways to look at what we’re up to (not including tables – and we’ll find more…) Regular Expressions Finite State Automata Regular Languages

  17. Determinism • Let’s take another look at what is going on with d-recognize. • In particular, let’s look at what it means to be deterministic here and see if we can relax that notion. • How would our recognition algorithm change? • What would it mean for the accepted language?

  18. Determinism and Non-Determinism • Deterministic: There is at most one transition that can be taken given a current state and input symbol. • Non-deterministic: There is a choice of several transitions that can be taken given a current state and input symbol. (The machine doesn’t specify how to make the choice.)

  19. Non-Deterministic FSAs for SheepTalk a b a a ! q0 q1 q2 q3 q4 b a a ! q0 q1 q2 q3 q4 

  20. FSAs as Grammars for Natural Language dr the rev mr pat l. robinson q0 q1 q2 q3 q4 q5 q6 ms hon  mrs  Can you use a regexpr to capture this too?

  21. Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction (essentially building “set states” that are reached by following all possible states in parallel) • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones • It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. • Problems: translating gives us a not very intuitive machine, and this machine has LOTS of states

  22. Non-Deterministic Recognition • In a ND FSA there exists at least one path directed through the machine by a string that is in the language defined by the machine that leads to an accept condition.. • But not all paths directed through the machine by an accept string lead to an accept state. It is OK for some paths to lead to a reject condition. • In a ND FSA no path directed through the machine by a string outside the language leads to an accept condition.

  23. Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • However, being driven to a reject condition by an input does not imply it should be rejected. • Failure occurs only when none of the possible paths lead to an accept state. • This means that the problem of non-deterministic recognition can be thought of as a standard search problem.

  24. The Problem of Choice • Choice in non-deterministic models comes up again and again in NLP. Several Standard Solutions • Backup (search, this chapter) • Save input/state of machine at choice points • If wrong choice, use this saved state to back up and try another choice • Lookahead • Look ahead in the input to help make a choice • Parallelism • Look at all choices in parallel

  25. Backup • After a wrong choice leads to a dead-end (either no input left in a non-accept state, or no legal transitions), return to a previous choice point to pursue another unexplored choice. • Thus, at each choice point, the search process needs to remember the (unexplored) choices. • Standard State Space Search. • State = (FSA node or machine state, tape-position)

  26. Example b a a ! \ a q0 q2 q1 q2 q3 q4

  27. ND-Recognize Code

  28. Example Agenda:

  29. Example

  30. Example Agenda:

  31. Example

  32. Example Agenda:

  33. Example

  34. Agenda: Example

  35. Example Agenda:

  36. Example Agenda:

  37. Example

  38. Example Agenda:

  39. Example Agenda:

  40. Example Agenda:

  41. Example Agenda:

  42. Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

  43. Infinite Search • If you’re not careful such searches can go into an infinite loop. • How?

  44. Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? • More natural solutions • Machines based on construction are too big

  45. Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • This turns out to be a useful exercise

  46. Union • Accept a string in either of two languages

  47. Concatenation • Accept a string consisting of a string from language L1 followed by a string from language L2.

  48. Negation • Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 • Invert all the accept and not accept states in M1 • Does that work for non-deterministic machines?

  49. Intersection • Accept a string that is in both of two specified languages • An indirect construction… • A^B = ~(~A or ~B)

  50. Why Bother? • ‘FSAs can be useful tools for recognizing – and generating – subsets of natural language • But they cannot represent all NL phenomena (Center Embedding: The mouse the cat ... chased died.)

More Related