740 likes | 748 Views
Digital State Machines. Regular Expressions & Languages. Chapter Outline. Regular Expressions Basic Regular Expression Patterns Disjunction, Grouping and Precedence Examples Advanced Operators Regular Expression Substitution, Memory and ELIZA Summary. Regular Expressions (RE).
E N D
Digital State Machines Regular Expressions & Languages
Chapter Outline • Regular Expressions • Basic Regular Expression Patterns • Disjunction, Grouping and Precedence • Examples • Advanced Operators • Regular Expression Substitution, Memory and ELIZA • Summary Veton Këpuska
Regular Expressions (RE) • Algebraic Description of finite state automata. • Regular Expressions can define exactly the same languages that the various forms of automata describe: regular languages. • Regular Expressions (RE) offer a declarative way to express the strings we want to accept – FSA do not! • REs serve as the input language for many systems that process strings: • Search commands such as UNIX grep (egrep, etc.) for finding strings: • WWW Browsers, • Text-formatting systems, etc. • Search Systems convert REs into FSA(s) (D-FSA or N-FSA). • Lexical-analyzer generators, such as LEX or FLEX. • Compiler, • Language Modeling System in a Speech Recognizer. • Grammar and Spell Checkers. Veton Këpuska
FSA, RE and Regular Languages Regularexpressions Regularlanguages Finiteautomata Veton Këpuska
The Operators of Regular Expressions • Regular Expressions denote languages. • 01*+10* -denotes the language consisting of all strings that are either a: • {0, 01, 011, 0111, 01111,…}, or • {1, 10, 100, 1000, 10000, …} • Operations on Regular Languages that Regular Expressions Represent. Let L, L1 and L2 be regular languages, L={0,1}, L1 = {10, 001, 111} & L2 = {e, 001}, then • The union: L1 ∪L2, the union or disjunction of L1 and L2. • L1 ∪L2 ={e, 10, 001, 111} • The concatenation: L1L2 = {xy|x ∈ L1, y ∈ L2}. • L1 L2 ={10, 001, 111, 10001, 00001, 111001} • The closure (or star, *, or Kleene closure): L*. • L* = {L0, L1, L2,…, Li,…, L∞} Veton Këpuska
Example • L={0,11}, • L0 = {e} – independent of what language L is. • L1 = L – represents the choice of one string from L. • {L0, L1} = {e, 0, 11} • L2 = {00, 011,110,1111} • L3 = {000, 0011, 0110, 01111,1100,11011,11110,111111} • To compute L* must compute Lifor each i (i) • Li has 2i members. • Union of infinite number of terms Li is generally an infinite language (L*) as it is this example. Veton Këpuska
Example • Let L={e, 0, 00, 000, …} – a set of strings consisting all zeros. L – is infinite language • L0 = {e} – independent of what language L is. • L1 = L – represents the choice of one symbol from L. • {L0, L1} = {e, 0, 00, 000, 0000, …...} • L2 = {e, 0, 00,000,0000, ...} = L • L3 = L • L*= L0 L1 L2 … = L • - empty set. One of only two languages that its closure, *, is not infinite. • 0 = {e} • 1 = {e} • i = {e} • * = {e} Veton Këpuska
Distinction of Star (*) and Closure (*) Operator • Star *: *- forms all strings whose symbols were chosen from alphabet . • Closure * operator is essentially the same with a subtle difference. Let: • L – be a language containing strings of length 1, and • for each symbol a in there is a string a in L. Thus: • - set of symbols, while • L – set of strings • * and L* denote the same language. Veton Këpuska
Building Regular Expressions • The algebra of regular expressions follows the pattern of classical algebra. • Constants and Variables denote Languages • Operators ⇒ {Union, Product, Star/Closure} • Define Regular Expression (E - the language that it represents is denoted by L(E)), Recursively: BASIS: • The constants e and and are regular expressions, denoting the languages L(e)={e} and L()= respectively. • If a is any symbol, then a is a regular expression. L(a)={a}. • Any variable, e.g., L, typically capitalized and italic represents any language. Veton Këpuska
Building Regular Expressions INDUCTION: If E and F are regular expressions, than • E+F is a regular expressions denoting their union: L(E+F) = L(E) L(F). • EF is a regular expression denoting their concatenation: L(EF) = L(E)L(F). • A dot can optionally be used to denote the concatenation operator on languages or in a regular expression. A regular expression 0.1 is same as 01 that represents the language {01} • E* is a regular expression denoting the closure of L(E): L(E*) = (L(E))*. • (E) is also a regular expression denoting the same language as E: L((E))=L(E) Veton Këpuska
Example • Develop a regular expression for the language consisting of the single string 01. • 0 and 1 are expressions denoting the languages {0} and {1} • Concatenation of the two expressions results in regular expression 01 for the language {01}. • As a general rule, if we want a regular expression for the language consisting of only the string w, we use w itself as the regular expression. • Write a regular expression for set of strings that consists of alternating 0’s and 1’s. Thus from the above we get (01)* • Note 1: 01* ≠ (01)* • Note 2: L((01)*) – is not exactly what we want – what about when 1 is at the beginning and/or 0 at the end? • (01)*+(10)*+1(01)*+0(10)* • “+” operator indicates union of the corresponding languages. Veton Këpuska
Example • Alternate Solution: • Note: L(e+1)= L(e)L(1)={e}{1}={e,1} (e+1)(01)*(e+0) Veton Këpuska
Precedence of Regular Expression Operators • * operator has the highest precedence. • Concatenation or dot operator. • Union (+) operator • Controlling the order of operations by grouping operator “()”. • Example: • (0(1*))+1 • (01)*+1 • 0(1*+1) Veton Këpuska
Exercise Examples Exercise 3.1.1: • Write regular expression for the following languages:a • The set of strings over alphabet {a, b, c} containing at least one a and at least one b. • (aba*b*c*) what about other combinations? • ((e+a*)+(e+b*)+(e+c*))*(ab + ba)((e+a*)+(e+b*)+(e+c*))* • The set of strings of 0’s and 1’s whose tenth symbols from the right end is 1. • (0+1)*1(0+1) (0+1)… (0+1) (0+1) • The set of strings of 0’s and 1’s with at most one pair of consecutive 1’s. • (0+1)(0+(00)+(01)+(10))* Veton Këpuska
Finite Automata and Regular Expressions • Regular-expressions describe languages in fundamentally different form from the finite automata. • However, they both describe the same set of languages – “Regular Languages”. To show this one must: • Every language defined by one of these automata is also defined by a regular expression. Must show that the language is accepted by some D-FSA. • Every language defined by a regular expression is defined by one of these automata. Must show that there is an N-FSA with e-transitions accepting the same language. Veton Këpuska
Finite Automata and Regular Expressions Plan for showing the equivalency of four different notations for regular languages. NFSA e-NFSA RE DFSA Veton Këpuska
Converting Regular Expressions to Automata • We can show that every language L, that is L(R) for some regular expression R, is also L(E) for some e-NFSA E. • Start by showing how to construct automata for basis expressions, single symbols e and f. • Show how to combine these automata into larger automata that accept the union, concatenation, or closure. Veton Këpuska
Converting Regular Expressions to Automata • Theorem: • Every language defined by a regular expression is also defined by a finite automata. • Proof: • Suppose L=L(R) for a regular expression R. We will show that L=L(E) for some e-NFSA E with: • Exactly one accepting state • No arcs into the initial state. • No arcs out of the accepting state. • The proof is by structural induction on R, following the recursive definition of regular expressions. Veton Këpuska
Converting Regular Expressions to Automata BASIS: • The language of automaton is {e} • Depicts construction for f, since there is no path from start state to accepting state. Thus f is the language of automaton. • Language of the automaton is L(a) which is the one string a. Veton Këpuska
Converting Regular Expressions to Automata INDUCTION:It assumed that the statement of the theorem is true for the immediate sub-expressions of a given regular expression. • R+S: L(R) L(S) • RS: L(R)L(S) • R*: L(R*) Veton Këpuska
Example • Convert (0+1)*1(0+1) to an e-NFSA. • (0+1) • (0+1)* • (0+1)*1(0+1) Veton Këpuska
Converting D-FSA’s to Regular Expressions by Eliminating States • When a state s is eliminated from D-FSA, all the paths that go through s no longer will exist in automaton. Thus, if the language of the automaton is not to change, we must include, an arc that goes directly from state q to state p, the labels of the paths that went from state q to p through state s that is eliminated. Veton Këpuska
Converting D-FSA’s to Regular Expressions by Eliminating States R11+Q1S*P1 Veton Këpuska
Strategy from D-FSA to RE • For each q of D-FSA apply reduction process to produce D-FSA with regular expressions labels on the arcs. Eliminate all states except q and the start state q0. • If q≠q0then we shall be left with a two state automaton that looks like: (R+SU*T)*SU* VetonKëpuska
Strategy from D-FSA to RE • It the start state is also an accepting state, then we must also perform a state-elimination from the original automaton that gets rid of every state but the last start state. When this is done, what is left is a one state automaton that looks like the following: • The desired regular expression is the sum (union) of all the expressions derived from the reduced automata for each accepting state, by rules (2) and (3): R* VetonKëpuska
Example for: D-FSA to RE • Consider N-FSA below that accepts all strings of 0’s and 1’s such that either the second or third position form the end has a 1. Derive equivalent regular expression of the language of this N-FSA. • Solution: • Replace labels with regular expressions. VetonKëpuska
Example for: D-FSA to RE • Eliminate State B: • Predecessor states: A • Successor states: C • Equivalent Expression A → C: 1(0+1) VetonKëpuska
Example for: D-FSA to RE • Branching eliminating states C and D in separate reductions. Elimination of state C: • Predecessor states: A • Successor states: D • Equivalent Expression A → D: 1(0+1)(0+1) VetonKëpuska
Example for: D-FSA to RE • Generic two-state automaton: • ((0+1)*1(0+1)(0+1)) • Eliminating D from Resulting in: • Corresponding RE: ((0+1)*1(0+1)) VetonKëpuska
Example for: D-FSA to RE • Combining two expressions for the entire automaton by summing each RE: • ((0+1)*1(0+1)(0+1)) + ((0+1)*1(0+1)) VetonKëpuska
Algebraic Laws for Regular Expressions 7 October 2008 Veton Këpuska 31
Algebraic Laws for Regular Expressions • Collection of laws that define when two regular expressions are equivalent. • Arithmetic: • Commutativity: (x+y = y+x) • Switching of order of operands does not change results. • Associativity: (xy)z = x(yz) • Regroup the operands when the operator is applied twice. • Regular expressions have a number of laws similar to the laws for arithmetic. 7 October 2008 Veton Këpuska 32
Associativity and Commutativity For L,M and N Languages (defined by Regular Expressions or equivalently by FSA) • Commutative Law for Union: • L+M=M+L • Associative Law for Union: • (L+M)+N=L+(M+N) • Associative Law for Concatenation: • (LM)N=L(MN) 7 October 2008 Veton Këpuska 33
Identities and Annihilators Arithmetic • Identity: • 0 is identity for addition: 0+x = x+0 = x • 1 is identity for multiplication: 1x = x1 = x • Annihilator: • 0 is annihilator for multiplication: 0x = x0 = 0 Regular Expressions • Identity for Union and Concatenation: • ∅+L = L+∅ = L • ∊L = L∊ = L • Annihilator for Concatenation: • ∅L = L∅ = ∅ • Important in simplification of regular expressions. 7 October 2008 Veton Këpuska 34
Distributive Laws Arithmetic • A distributive law involves two operators. Distributive law of multiplication over addition (most common): • x (y+z) = xy+ xz Regular Expressions • Left Distributive Law of Concatenation over union: L(M+N) = LM + LN • Right Distributive Law of Concatenation over union: (M+N)L = ML + NL 7 October 2008 Veton Këpuska 35
Distributive Laws Theorem: • If L, M, and N are any languages, then: L(M N) = LM LN Proof: • Show first that a string w is in L(M N) if and only if it is in LM LN. • (Only-if) If w is in L(M N) then w=xy, where xis in L and y is in (M N) ⇒ y is in M or N. • If y is in M then w=xyis in LM ⇒ is in LM LN • If y is in N then w=xyis in LN ⇒ is in LM LN • (if) If w is in LM LN then wis either in LM or in LN • If w=xyand w is in LM then x is in L and y in M ⇒ If y M then y is in M N, thus w is in L(M N) • If w=xyand w is in LN then x is in L and y in N ⇒ If y N then y isis in M N, thus w is in L(M N) 7 October 2008 Veton Këpuska 36
The Idempotent Law Arithmetic: • Common arithmetic operators are not idempotent: • x+x ≠ x and • xx ≠ x Regular Expressions: • Idempotent law • L+L=L 7 October 2008 Veton Këpuska 37
Laws Involving Closures • (L*)* = L* - Closing an expression that is already closed does not change the language. • ∅* = - The closure of ∅ contains only the string . • * = • L+ = LL* = L*L • L+ = L + LL + LLL + … • L* = + L + LL + LLL + … = + L+ LL* = L + LL + LLL + LLLL + … • L = L = L • L* = L+ + • L? = + L 7 October 2008 Veton Këpuska 38
Discovering Laws for Regular Expressions • There is an infinite variety of laws about regular expressions that might be proposed. • Is there a general methodology that will make proofs of the correct laws easy? • The truth of a law reduces to a question of the equality of two specific languages. • Technique is closely tied to the regular-expression operators • It can not be extended to expressions involving some other operators (e.g., intersection)
Discovering Laws for Regular Expressions • Consider a proposed law: (L+M)*=(L*M*)* Given two languages L and M: • Closure of the union of the languages, (L+M)*, is identical to closure of concatenation of individually closed languages; (L*M*)*. • Proof: • Suppose w is in the language of (L+M)*. Thus we can write w = w1 w2 w3 … wk for some k, where each wi is in either L or M. • If string wi is in L, this string is also in L*. If the string is not in M then one can pick from M*. Thus the string is in L*M*. • Similarly we could rationalize forwi in M showing that the string is in L*M* • Since each wiof w = w1 w2 w3 … wk … is in L*M*, its closed language must be in (L*M*)* • Must also show that strings in (L*M*)* are in (L+M)* to complete the proof. • Exercise Problem.
Regular Expressions • Formally, a regular expression is an algebraic notation for characterizing a set of strings. • Thus they can be used to specify search strings as well as to define a language in a formal way. • Regular Expression requires • A pattern that we want to search for, and • A corpus of text to search through. • Thus when we give a search pattern, we will assume that the search engine returns the line of the documentreturned. This is what the UNIX grep command does. • We will underline the exact part of the pattern that matches the regular expression. • A search can be designed to return all matches to a regular expression or only the first match. We will show only the first match. Veton Këpuska
Basic Regular Expression Patterns • The simplest kind of regular expression is a sequence of simple characters: • /woodchuck/ • /Buttercup/ • /!/ Veton Këpuska
Basic Regular Expression Patterns • Regular Expressions are case sensitive • /s/ • /S/ • /woodchucks/ will not match “Woodchucks” • Disjunction: “[“ and “]”. Veton Këpuska
Basic Regular Expression Patterns • Specifying range in Regular Expressions: “-” Veton Këpuska
Basic Regular Expression Patterns • Negative Specification – what pattern can not be: “^” • If the first symbol after the open square brace “[” is “^” the resulting pattern is negated. • Example /[^a]/ matches any single character (including special characters) except a. Veton Këpuska
Basic Regular Expression Patterns • How do we specify both woodchuck and woodchucks? • Optional character specification: /?/ • /?/ means “the preceding character or nothing”. Veton Këpuska
Basic Regular Expression Patterns • Question-mark “?” can be though of as “zero or one instances of the previous character”. • It is a way to specify how many of something that we want. • Sometimes we need to specify regular expressions that allow repetitions of things. • For example, consider the language of (certain) sheep, which consists of strings that look like the following: • baa! • baaa? • baaaa? • baaaaa? • baaaaaa? • … Veton Këpuska
Basic Regular Expression Patterns • Any number of repetitions is specified by “*” which means “any string of 0 or more”. • Examples: • /aa*/ - a followed by zero or more a’s • /[ab]*/ - zero or more a’s or b’s. This will match aaaa or abababa or bbbb Veton Këpuska
Basic Regular Expression Patterns • We know enough to specify part of our regular expression for prices: multiple digits. • Regular expression for individual digit: • /[0-9]/ • Regular expression for an integer: • /[0-9][0-9]*/ • Why is not just /[0-9]*/? • Because it is annoying to specify “at least once” RE since it involves repetition of the same pattern there is a special character that is used for “at least once”: “+” • Regular expression for an integer becomes then: • /[0-9]+/ • Regular expression for sheep language: • /baa*!/, or • /ba+!/ Veton Këpuska