Regular Expression

Regular Expression While studying formal languages, we often expresse the languages in terms of set notation, like {aibi | i > 0}. Such set notation is practical only when the language property is simple enough to describe. However, all regular languages can be expressed succinctly in terms of a regular expression, which is defined as follow. Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression. (2)  is a regular expression and denotes the set {}. (3) For every a, a is a regular expression and denotes the set {a}. (4) If r and s are regular expressions that denote the sets R and S, respectively, then (r + s), ( rs ), and ( r* ) are regular expressions that denote, respectively, the sets R  S, RS, and R*. We may omit parentheses from a regular expression if it expresses the same set under the assumption that the star * has higher precedence than concatenation or +, and that concatenation has higher precedence than +. For a regular expression r, by L(r) we denote the set of strings that is expressed by regular expression r.

Regular expression (cont’ed) For example regular language {aibj | i, j  0} can be expressed in regular expression a*b*, and regular language {xaaybbz | x, y, z  {a, b}* } can be expressed as (a+b)*aa(a+b)*bb(a+b)*. We can easily prove that a*b* is a regular expression according to the definition; since a and b is regular expressions, respectively, denoting the sets {a} and {b}, by definition part (4) expressions a* and b* are regular expressions, which denote, respectively, the sets {a}* and {b}*. Since a* and b* are regular expression, the concatenation a*b* is also a regular expression by definition part (4), which denotes the set {a}*{b}*, which is equivalent to {aibj | i, j  0} . By similar argument we can show that (a+b)*aa(a+b)*bb(a+b)* is a regular expression which denotes the regular language above. Later we will see that every regular language can be expressed in a regular expression, and if a language is expressible in a regular expression, then that language is regular.

Chomsky Hierarchy of Languages andRelated Models We have studied four types of formal grammars and their languages, and four different computational models that recognize the languages, together with other related models, such as L-systems, syntax flow graph, and regular expressions. Now we will study more closely about their relationships. The table on the next page summarizes the relationship among those models. This relationship, called the Chomsky hierarchy (after Noam Chomsky, who defined the classes of languages) is one of the most significant achievement in computer science. In the table the vertical relationship denotes proper containment and the horizontal relationship  denotes the characterizations. For example, the class of context-free languages properly contains regular languages, finite state machines can only recognize regular languages, and the languages recognized by finite state machines can be expressed by regular expressions. Many powerful models have been introduced (for example, the ones shown at upper right corner), which turned out to be computationally equivalent to the Turing machines and their languages, also called recursively enumerable sets.

Machines Other Models Languages (grammars) Recursively Enumerable Sets (type 0) Post System, Markov Algorithms, -recursive Functions Turing Machines Linear-bounded Automata . . . . . Context-sensitive Languages(type 1) Context-free Languages(type 2) Pushdown Automata Regular Languages(type3) Finite State Automata Regular Expression The Chomsky Hierarchy

A  abbB is equivalent to A  aB1 B1  bB2 B2  bB Characterization Theorem among Regular Grammars, FA’s and Regular Expressions We only prove the characterization (i.e., horizontal relationship) at the level of regular languages, and later prove the vertical relations for the lower two levels only. Theorem. (1) A language L is regularif and only if it is accepted by an FA M. (2) A language L can be expressible in terms of a regular expression if and only if L is accepted by an FA M. Proof of (1-a):If L is regular, then there is an FA M which accepts L. We construct an FA M with any regular grammar G whose language is L. Without loss of generality, assume G has production rules of the form A  xB or A  x, where x is or a single terminal symbol, i.e., |x| = 1. Otherwise, we can easily convert the rules into these restricted forms without affecting the language of the grammar. For example, if there is rule A  abbB in a grammar, this rule can be converted to a set of rules as follows without changing the language, where Bi are new non-terminal symbols.

For each production rule the following type Construct a state transition in M as follows:  a a A  aB | aA A B A  a a A F F is a new accepting state A  A A is an accepting state If A is the start symbol Let A be the start state start A Proof of Characterization Theorem(cont’ed) Suppose the grammar is given as G = (VT , VN , P, S),We construct an FA M from G using the rules shown below. Let A, B  VN and a VT  { }. We can prove that L(G) = L(M), i.e., the language accepted by M is exactly the language generated by the grammar G.

State transition of M Production rule of G a b A  bB | aA A B Define A as the start symbol. A start A  A Proof of Characterization Theorem(cont’ed) Proof of (1-b):If L is the language accepted by an FA M, then there is a regular G which generates L. Let M = ( Q, ,  , q0, F ). Construct a regular grammar G from M according to the rules shown blow, where A, B  Qand a, b   { }. 

  a a start start start Proof of Characterization Theorem(cont’ed) Proof of (2)-(a): If a language L can be expressible in terms of a regular expression, then L is accepted by an FA M. Going along the definition of regular expression, we show how to construct an FA for a given regular expression. (This is proof by induction.) Assume that the alphabet is . 1. If the regular expression is , , or a  , which respectively denote the empty set, {}, and {a}. Then for each case we construct the following FA. 2. Suppose that for regular expressions r1 and r2, we have constructed FA M1 And M2, which recognize the language expressed by r1 and r2, respectively. Then we can construct FA M1+2, M12, and M1* which respectively recognize the languages expressed by regular expressions r1+ r2 , r1r2 , and (r1)*, as follows:

If L(M1) = L(r1) and L(M2) = L(r2), then L(M1+2) = L(r1+ r2), L(M12 ) = L(r1r2), and L(M1*) = L((r1)*). M1  start M1+2 New start  M2 start  M2 M1 M12 start  M1  M1*  New start  Proof of Characterization Theorem(cont’ed) start

r q (ab+c)* q p Figure (a) Figure (b) p Proof of Characterization Theorem(cont’ed) Proof of (2)-(b): If L is a language L accepted by an FA M, then L can be expressible in terms of a regular expression. Definition: Generalized state transition graph. For all strings expressed by a regular expression r, if an FA M takes transition from a state p to a state q, we write (p, r) = q, and draw state transition as the following Figure (a) shows. Figure (b) is an example. The state transition graphs of M can be considered as a generalized state transition graphs of special case, where each edge label has a regular expression expressing one string of length 1 or zero (for the case of  transition). By further generalizing , for a path label w = r1r2…ri (i.e., a concatenated sequence of regular expressions), let (p, w) = q denotes the sequence of transitions along a path with labels of regular expressions r1,r2, …, ri.

f a c df*c r q s af*b af*c d b r s df*b For a generalized state transition graph G, let L(G) be the set of strings defined as follows, where q0 is the start state and F is the set of accepting states. Clearly L(G) = L(M). L(G) = {x | x L(w), w is a path label such that (q0, w) = qf  F } Given a generalized state transition graph G of an FA, we can eliminate a state from G, and transform it to another generalized state transition graph G' such that L(G) = L(G'). Suppose that q is a non-accepting state in a state transition graph G. Suppose q has a self-loop, and is on a path between its two neighboring states r and s as shown in figure (a) below. (Dotted arrows indicate other possible transitions.) State q can be eliminated and generalized transitions can be added without changing the language of the automaton as figure (b) shows. (a) G (b) G'

a b a 2 b 1 a  0 b  4 b b ba start a a 3 b a b b 1 ba b 0  start 4 b b a 3 b (a) (b) Now, we give an example for transforming a state transition graph G into a regular expression using the above technique. Consider an FA whose state transition graph is shown in figure (a) below. Clearly, if an automaton has k  1 Accepting states, then the language of the automaton is the union of the languages accepted by k accepting states. So we compute a regular expression ri for the language Li accepted by each of the k accepting state, and find the regular expression for the language of the automaton; r = r1 + r2 + . . . . + rk For example, the language accepted by the automaton shown below is the union of the languages accepted by state 0 and 1.

ba+a b+ a b 1 ba b 0 ba start a 4 b b a b b a 1 3 b ba b 0  (c) start 4 b b a 3 b (b) For this example, we first compute the regular expression for the language accepted by state 4 by changing state 0 to non-accepting state. Leaving the start state and the accepting state, we eliminate all other states, one at a time. Eliminating state 2 will give the generalized state transition graph shown in (b). We could eliminated state 1 or 3 first. In general it is better to choose a state which does not induce too many new links. Before eliminating state 3, we merge links which have the same origin and destination using the + operator, and get figure (c) below.

ba+a b+ a ba+a b 1 b+ a b bba b 0 1 start 4 b ba b b 0 ba start 4 b a 3 (d) b (c) Eliminating state 3 gives the graph shown in figure (d), and bb

ba+a b+ a b 1 bba b 0 start 4 b+bb ba bba(ba+a)*(b+) a(ba+a)*b (d) b a(ba+a)*(b+) bba(ba+a)*b 0 4 b start ba (e) Finally eliminating state 1 we get the graph in figure (e). Notice that regular expression b+bb on the self-loop of state 4 has been simplified to b, because looping on b or bb is equivalent to looping on b.

bba(ba+a)*(b+)+b a(ba+a)*b+b bba(ba+a)*(b+) a(ba+a)*b a(ba+a)*(b+) b a(ba+a)*(b+) 0 4 bba(ba+a)*b 0 4 b start start bba(ba+a)*b+ba ba (f) (e) By merging edges which have the same origin and destination, we get the final transition graph (f), from which we can construct a regular expression r4 whose language is exactly the language accepted by state 4.

bba(ba+a)*(b+)+b a(ba+a)*b+b r11 r22 a(ba+a)*(b+) r12 0 4 1 2 start start bba(ba+a)*b+ba r21 (f) r2 = (r11)*r12(r22 + r21(r21)*r12)* (g) In general suppose a generalized transition graph with the start state and an accepting state is given with each edge labeled with a regular expression as shown in figure (g) below . Then regular expression r2 shown in the figure expresses the language accepted by the automaton. By substituting rij in the expression in figure (g) with corresponding regular expression from figure (f), we get the regular expression r4 for the language accepted by state 4.

r11 r22 bba(ba+a)*(b+)+b a(ba+a)*b+b r12 a(ba+a)*(b+) 2 1 r21 0 4 start start bba(ba+a)*b+ba r1 = (r11 + r12(r22)*r21)* (h) (i) Now to construct a regular expression for the language accepted by the other accepting state, which is the start state, we can start with figure (f) by changing the start state back to accepting state and state 4 to non-accepting state as shown in figure (h). This is the general case as shown in figure (i) whose regular expression can be given as r1 in the figure. Substituting corresponding regular expressions from figure (h), we get a regular expression r0 which denotes the language accepted by state 0. Finally we get a regular expression r = r0 + r4 which denotes the language accepted by automaton M.

Regular Expression