1 / 39

LEXICAL ANALYSIS

LEXICAL ANALYSIS. Phung Hua Nguyen University of Technology 2006. Outline. Introduction to Lexical Analysis Token specification Language Regular Expressions (REs) Token recoginition REs  NFA (Thompson’s construction, Algorithm 3.3) NFA  DFA (subset construction, Algorithm 3.2)

uttara
Download Presentation

LEXICAL ANALYSIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006

  2. Outline • Introduction to Lexical Analysis • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

  3. Introduction • Read the input characters • Produce as output a sequence of tokens • Eliminate white space and comments token lexical analyzer source program parser get next token symbol table Lexical Analysis

  4. Why ? • Simplify design • Improve compiler efficiency • Enhance compiler portability Lexical Analysis

  5. Tokens, Patterns, Lexemes Lexical Analysis

  6. Outline • Introduction  • Token specification • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

  7. Alphabet, Strings and Languages • Alphabet ∑: any finite set of symbols • The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…} • The binary alphabet {0,1} • The ASCII alphabet • String: a finite sequence of symbols drawn from ∑ : • Length |s| of a string s: the number of symbols in s • The empty string, denoted , || = 0 • Language: any set of strings over ∑; • its two special cases: • : the empty set • {} Lexical Analysis

  8. Examples of Languages • ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…} • Vietnamese language • ∑ = {0,1} • A string is an instruction • The set of Pentium instructions • ∑ = the ASCII set • A string is a program • The set of C programs Lexical Analysis

  9. Terms (Fig.3.7) Lexical Analysis

  10. String operations • String concatenation • If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay  xy = homnay •  is the identity: y = y; x = x • String exponentiation • s0 =  • si = si-1s E.g. s = 01, s0 = , s2 =0101, s3 = 010101 Lexical Analysis

  11. Language Operations (Fig 3.8) Lexical Analysis

  12. Examples • L = {A,B,…,Z,a,b,…,z} • D = {0,1,…,9} letters and digits strings consists of a letter followed by a digit all four-letter strings all strings of letters, including  all strings of letters and digits beginning with a letter all strings of one or more digits Lexical Analysis

  13. Regular Expressions (Res) over Alphabet ∑ • Inductive base: •  is a RE, denoting the RL {} • a  ∑ is a RE, denoting the RL {a} • Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then • (r)|(s) is a RE, denoting the RL L(r)  L(s) • (r)(s) is a RE, denoting the RL L(r)L(s) • (r)* is a RE, denoting the RL (L(r))* • (r) is a RE, denoting the RL L(r) Lexical Analysis

  14. Precedence and Associativity • Precedence: • “*” has the highest precedence • “concatenation” has the second highest precedence • “|” has the lowest precedence • Associativity: • all are left-associative E.g.: (a)|((b)*(c))  a|b*c  Unnecessary parentheses can be removed Lexical Analysis

  15. Example • ∑ = {a, b} • a|b denotes {a,b} • (a|b)(a|b) denotes {aa,ab,ba,bb} • a* denotes {,a,aa,aaa,aaaa,…} • (a|b)* denotes ? • a|a*b denotes ? Lexical Analysis

  16. Notational Shorthands • One or more instances +: r+ = rr* • denotes the language (L(r))+ • has the same precedence and associativity as * • Zero or one instance ?: r? = r| • denotes the language (L(r)  {}) • Character classes • [abc] denotes a|b|c • [A-Z] denotes A|B|…|Z • [a-zA-Z_][a-zA-Z0-9_]* denotes ? Lexical Analysis

  17. Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction, Algorithm 3.3) • NFA  DFA (subset construction, Algorithm 3.2) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

  18. Overview RE 3.3 3.5 3.6 3.2 mDFA NFA DFA Lexical Analysis

  19. Nondeterministic finite automata • A nondeterministic finite automaton (NFA) is a mathematical model that consists of • a finite set of states S • a set of input symbols ∑ • a transition function move: S  ∑ S • a start state s0 • a finite set of final or accepting states F Lexical Analysis

  20. A B A A Transition graph a A Lexical Analysis

  21. Transition table Input symbol State Lexical Analysis

  22. A Acceptance • A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x. 0 0 01010 B 1 0 0 1 0 A  B  A  B  A  B 1 1 0 01011 0 0 1 1 1 error A  B  A  B  A  ? Lexical Analysis

  23. Deterministic finite automata • A deterministic finite automaton (DFA) is a special case of NFA in which • no state has an -transition, and • for each state s and input symbol a, there is at most one edge labeled a leaving s. Lexical Analysis

  24. Thompson’s construction of NFA from REs • guided by the syntactic structure of the RE r • For , • For a in ∑  i f a i f Lexical Analysis

  25. i i f f Thompson’s construction (cont’d) • Suppose N(s) and N(t) are NFA’s for REs s and t • For s|t, • For st, • For s*, • For (s), use N(s) itself   N(s)  N(t)  N(t) N(s) f i    N(t)  Lexical Analysis

  26. Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction) • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

  27. Subset construction • s : an NFA state • T : a set of NFA states Lexical Analysis

  28. Subset construction (cont’d) Let s0 be the start state of the NFA; Dstates contains the only unmarked state -closure(s0); while there is an unmarked state T in Dstatesdo begin mark T for each input symbol a do begin U := -closure(move(T; a)); if U is not in Dstatesthen Add U as an unmarked state to Dstates; DTran[T; a] := U; end; end; Lexical Analysis

  29. DFA • Let (∑, S, T, F, s0) be the original NFA. The DFA is: • The alphabet: ∑ • The states: all states in Dstates • The transitions: DTran • The accepting states: all states in Dstates containing at least one accepting state in F of the NFA • The start state: -closure(s0) Lexical Analysis

  30. Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6) • Programming Lexical Analysis

  31. Minimise a DFA Initially, create two states: • one is the set of all final states: F • the other is the set of all non-final states: S - F while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑ Let t1,…, tn be the successor states to s1,…, sn under c if (t1,…, tn don't all belong to the same state) { Split S into new states so that si and sj remain in the same state iff ti and tj are in the same state } } Lexical Analysis

  32. A B D A C B D Example b Step1: {A,B,C,D} {E} For a, {B,B,B,B} For b, {C,D,C,E} Split {A,B,C} {D} {E} Step 2: For b, {C,D,C} Split {A,C} {B} {D} {E} Step 3: For a, {B,B} For b, {C,C} Terminate b b a a b b E a a a b b b a b b E a a a Lexical Analysis

  33. Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6)  • Programming Lexical Analysis

  34. Input Buffering begin… Scanner if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else forward++ eof Lexical Analysis

  35. Input Buffering begin… Scanner eof forward = forward + 1 if (forward↑=eof) { if (forward at end of first half) { reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis } eof eof Lexical Analysis

  36. 0 1 6 5 Transition Diagrams < = relop  <= | < |<> return(relop,LE) 2 > return(relop,NE) 3 other 4 return(relop,LT) letter other return(id,lexeme) 7 id  letter(letter|digit)* letter or digit Transition diagram is a DFA in which there is no edge leaving out of a final state Lexical Analysis

  37. Implementation token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c == ‘<‘) state = 1; else state = fail(0); break; case 1: c = nextchar(); if (c == ‘=‘) state = 2; else if (c == ‘>’ state = 3; else state = 4; break; case 2: retract(0); return new Token(relop,”<=”); case 4: retract(1); return new Token(relop,”<”); case 5: c = nextchar(); if (Character.isLetter(c)) state = 6; else state = fail(5); break; case 6: c = nextchar(); if (Character.isLetter(c) ||Character.isDigit(c)) continue; else state = 7; break; case 7: retract(1); return new Token(id, getLexeme()); Lexical Analysis

  38. Implemetation (cont’d) int fail(int current_state) { forward = beginning; switch (current_state) { case 0: return 5; case 5: error(); } } void retract(int flag) { if (flag ==1) move forward back get lexeme from beginning to forward move forward onward beginning = forward state = 0 } b│e│g│i│n│:│=│ │ │… Lexical Analysis

  39. Outline • Introduction  • Token specification  • Language • Regular Expressions (REs) • Token recoginition • REs  NFA (Thompson’s construction)  • NFA  DFA (subset construction)  • DFA  minimal DFA (Algorithm 3.6)  • Programming  Lexical Analysis

More Related