Regular Expressions

Regular Expressions • Regular Languages and Regular expressions are used to describe the patterns which describe lexemes. • Regular expressions are composed of empty-string, concatenation, union, and closure. • Examples: A(A | D)* where A is alphabetic and Dis a digit (+ | - | ε ) D D* closure union Empty-string Concatenation is implicit

Meaning of Regular Expressions Let A,B be sets of strings: The empty string: "" ε= { "" } (sometimes <empty> ) Concatenation by juxtaposition: AB = a^b where a in A and b in B A = {"x", "qw"} and B = {"v", "A"} then AB = { "xv", "xA", "qwv", "qwA"}

Meaning of Regular Expressions (cont.) Union by | (or other symbols like U etc) A = {"x", "qw"} and B = {"v", "A"} then A|B = {"x", "qw", "v", "A"} Closure by * Thus A* = {""} | A | AA | AAA | ... = A0 | A1 | A2 | A3 | ... A = {"x", "qw"} then A* = { "" } | {"x", "qw"} | {"xqw", "qwx","xx", "qwqw"} | ...

Regular Expressions as a language • We can treat regular expressions as a programming language. • Each expression is a new program. • Programs can be compiled. • How do we represent the regular expression language? By using a datatype. datatype RE = Empty | Union of RE * RE | Concat of RE * RE | Star of RE | C of char;

Example RE program (+ | - | ε ) D D* val re1 = Concat(Union(C #”+”,Union(C #”-”,Empty)) ,Concat(C #”D”,Star (C #”D”)))

R.E.’s and FSA’s • Algorithm that constructs a FSA from a regular expression. • FSA • alphabet , A • set of states, S • a transition function, A x S -> S • a start state, S0 • a set of accepting states, SF subset of S • Defined by cases over the structure of regular expressions • Let A,B be R.E.’s, “x” in A, then • ε is a R.E. • “x” is a R.E. • AB is a R.E. • A|B is a R.E. • A* is a R.E. 1 Rule for each case

ε x B A ε ε A ε ε B ε ε ε A ε Rules • ε • “x” • AB • A|B • A*

Example: (a|b)*abb ε a 2 3 ε ε ε ε 6 7 1 0 b ε ε 5 4 a ε 8 b b 10 9 • Note the many ε transitions • Loops caused by the * • Non-Determinism, many paths out of a state on “a”

Building an NFA from a RE datatype Label = Epsilon | Char of char; type Start = int; type Finish = int; datatype Edge = Edge of Start * Label * Finish; val next = ref 0; fun new () = let val ref n = next in (next := n+1; n) end; Ref makes a mutable variable Semi colon separates commands (inside parenthesis)

ε x ε ε A ε ε B fun nfa Empty = let val s = new() val f = new() in (s,f,[Edge(s,Epsilon,f)]):Nfa end | nfa (C x) = let val s = new() val f = new() in (s,f,[Edge(s,Char x,f)]) end | nfa (Union(x,y)) = let val (sx,fx,xes) = nfa x val (sy,fy,yes) = nfa y val s = new() val f = new() val newes = [Edge(s,Epsilon,sx) ,Edge(s,Epsilon,sy) ,Edge(fx,Epsilon,f) ,Edge(fy,Epsilon,f)] in (s,f,newes @ xes @ yes) end

B A ε ε ε A ε | nfa (Concat(x,y)) = let val (sx,fx,xes) = nfa x val (sy,fy,yes) = nfa y in (sx,fy,(Edge(fx,Epsilon,sy)):: (xes @ yes)) end | nfa (Star r) = let val (sr,fr,res) = nfa r val s = new() val f = new() val newes = [Edge(s,Epsilon,sr) ,Edge(fr,Epsilon,f) ,Edge(s,Epsilon,f) ,Edge(f,Epsilon,s)] in (s,f,newes @ res) end

Example use val re1 = Concat(Union(C #”+”,Union(C #”-”,Empty)) ,Concat(C #”D”,Star (C #”D”))) Val ex6 = nfa re1; val ex6 = (8,15, [Edge (9,Epsilon,10),Edge (8,Epsilon,0) ,Edge (8,Epsilon,6),Edge (1,Epsilon,9) ,Edge (7,Epsilon,9),Edge (0,Char #,1) ,Edge (6,Epsilon,2),Edge (6,Epsilon,4) ,Edge (3,Epsilon,7),Edge (5,Epsilon,7) ,Edge (2,Char #,3),Edge (4,Epsilon,5),...]) : Nfa

Assignment #3 CS321 Prog Lang & Compilers Assignment # 3 Assigned: Jan 22, 2007 Due: Wed. Jan 24, 2007 Turn in a listing, and a transcript that shows you have tested your code. A minimum of 3 tests is necessary. Some functions may require more than 3 tests to receive full credit. 1) Write the following functions over lists. You must use pattern matching and recursion. A. reverse a list so that its elements appear in the oposite order. reverse [1,2,3,4] ----> [4,3,2,1] B. Count the number of occurrences of an element in a list count 4 [1,2,3,4,5,4] ---> 2 count 4 [1,2,3,2,1] ---> 0 C. concatenate together a list of lists concat [[1,2],[],[5,6]] ----> [1,2,5,6] 2) Using the datatype for Regular Expressions we defined in class datatype RE = Empty | Union of RE * RE | Concat of RE * RE | Star of RE | C of char; Write a function that turns a RE into a string, so that it can be printed. Minimize the number of parenthesis, but keep the string unambigouous by using the following rules. 1) Star has highest precedence so: ab* means a(b*) 2) Concat has the next highest precedence so: a+bc means a+(bc) 3) Union has lowest precedence so: a+bc+c* means a+(bc)+(c*) 4) Use the hash mark (#) as the empty string. 5) Special characters *+()\ should be escaped by using a preceeding backslash. So (Concat (C #"+") (C #"a")) should be "\+a" Hints: 1) The string concatenation operator is usefull: "abc" ^ "zx" -----> "abczx" 2) Write this is two steps. First, fully paranethesize every RE Second, Change the function to not add the parenthesis which the rules don't require.

Regular Expressions