1 / 35

Lecture 2 Lexical Analysis

Lecture 2 Lexical Analysis. CSCE 531 Compiler Construction. Topics Sample Simple Compiler Operations on strings Regular expressions Finite Automata Readings:. January 11, 2006. Overview. Last Time A little History Compilers vs Interpreter Data-Flow View of Compilers

star
Download Presentation

Lecture 2 Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 Lexical Analysis CSCE 531 Compiler Construction • Topics • Sample Simple Compiler • Operations on strings • Regular expressions • Finite Automata • Readings: January 11, 2006

  2. Overview • Last Time • A little History • Compilers vs Interpreter • Data-Flow View of Compilers • Regular Languages • Course Pragmatics • Today’s Lecture • Why Study Compilers? • xx • References • Chapter 2, Chapter 3 • Assignment Due Wednesday Jan 18 • 3.3a; 3.5a,b; 3.6a,b,c; 3.7a; 3.8b

  3. A Simple Compiler for Expressions • Chapter Two Overview • Structure of the simple compiler, really just translator for infix expressions  postfix • Grammars • Parse Trees • Syntax directed Translation • Predictive Parsing • Translator for Simple Expressions • Grammar • Rewritten grammar (equivalent one better for pred. parsing) • Parsing modules fig 2.24 • Specification of Translator fig 2.35 • Structure of translator fig 2.36

  4. Grammars • Grammar (or a context free grammar more correctly) has • A set of tokens also known as terminals • A set of nonterminals • A set of productions of the form nonterminal  sequence of tokens and/or nonterminals • A special nonterminal the start symbol. • Example • E  E + E • E  E * E • E  digit

  5. Derivations • A derivation is a sequence of rewriting of a string of grammar symbols using the productions in a grammar. • We use the symbol  to denote that one string of grammar symbols is obtained by rewritting another using a production • X Y if there is a production N  β where • The nonterminal N occurs in the sequence X of Grammar symbols • And Y is the same as X except β replaces the N • Example • E  E+E  d+E  d+ E*E  d+ E+E*E  d+d+E*E  d+d+d*E  d+d+d*d

  6. Parse Trees • A graphical presentation of a derivation, satisfying • Root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production

  7. Syntax directed Translation • Frequently the rewritting by a production will be called a reduction or reducing by the particular production. • Syntax directed translation attaches action (code) that are done when the reductions are performed • Example • E  E + T {print(‘+’);} • E  E - T {print(‘-’);} • E  T • T  0 {print(‘0’);} • T  1 {print(‘1’);} • … • T  9 {print(‘9’);}

  8. Equivalent Grammars

  9. Specification of the translator • S  L eof figure 2.38 • L  E ; L • L  Є • E  T E’ • E’  + T { print(‘+’); } E’ • E’  - T { print(‘-’); } E’ • E  Є • T  F T’ • T’  * F { print(‘*’); } T’ • T’  / F { print(‘/’); } T’ T  Є • F  ( E ) • F  id { print(id.lexeme);} • F  num { print(num.value);}

  10. E  T E’ E’  + T { print(‘+’); } E’ E’  - T { print(‘-’); } E’ E  Є Expr() { int t; term(); while(1) switch(lookahead){ case ‘+’: case ‘-’: t = lookahead; match(lookahead); term(); emit(t, NONE); continue; … Translating to code

  11. Overview of the Code Figure 2.36 • /class/csce531-001

  12. Operations on Strings • A language over an alphabet is a set of strings of characters from the alphabet. • Operations on strings: • let x=x1x2…xn and t=t1t2…tm then • Concatenation: xt =x1x2…xnt1t2…tm • Alternation: x|t = either x1x2…xn or t1t2…tm

  13. Operations on Sets of Strings • Operations on sets of strings: • For these let S = {s1, s2, … sm} and R = {r1, r2, … rn} • Alternation: S | T = S U T = {s1, s2, … sm, r1, r2, … rn } • Concatenation: • ST ={st | where s Є S and t Є T} • = { s1r1, s1r2, … s1rn, s2r1, … s2rn, … smr1, … smrn} • Power: S2 = S S, S3= S2 S, Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  14. Operations cont. Kleene Closure • Powers: • S2 = S S • S3= S2 S • … • Sn =Sn-1 S • What is S0? • Kleene Closure: S* = U∞i=0 Si , note S0 = is in S*

  15. Examples of Operations on Sets of Strings • Operations on sets of strings: • For these let S = {a,b,c} and R = {t,u} • Alternation: S | T = S U T = {a,b,c,t,u} • Concatenation: • ST ={st | where s Є S and t Є T} • = { at, au, bt, bu, ct, cu} • Power: S2 = { aa, ab, ac, ba, bb, bc, ca, cb, cc} • S3= { aaa, aab, aac, … ccc} 27 elements • Kleene closure: S* = {any string of any length of a’s, b’s and c’s}

  16. Examples of Operations on Sets of Strings

  17. Regular Expressions • For a given alphabet Σ the following are regular expressions: • If a ЄΣ then a is a regular expression and L(a) = { a } • Є is a regular expression and L(Є) = { Є } • Φ is a regular expression and L(Φ) = Φ • And if s and t are regular expressions denoting languages L(s) and L(t) respectively then • st is a regular expression and L(st) = L(s) L(t) • s | t is a regular expression and L(s | t) = L(s) U L(t) • s* is a regular expression and L(s*) = L(s)*

  18. Why Regular Expressions? • We use regular expressions to describe the tokens • Examples: • Reg expr for C identifiers • C identifiers? Any string of letters, underscores and digits that start with a letter or underscore ID reg expr = (letter | underscore) (letter | underscore | digit)* Or more explicitly ID reg expr = ( a|b|…|z|_)(a|b|…z|_|0|1…|9)*

  19. Pop Quiz • Given r and s are regular expressions then • What is rЄ ? r | Є ? • Describe the Language denoted by 0*110* • Describe the Language denoted by (0|1)*110* • Give a regular expression for the language of 0’s and 1’s such that end in a 1 • Give a regular expression for the language of 0’s and 1’s such that every 0 is followed by a 1

  20. Recognizers of Regular Languages • To develop efficient lexical analyzers (scanners) we will rely on a mathematical model called finite automata, similar to the state machines that you have probably seen. In particular we will use deterministic finite automata, DFAs. • The construction of a lexical analyzer will then proceed as: • Identify all tokens • Develop regular expressions for each • Convert the regular expressions to finite automata • Use the transition table for the finite automata as the basis for the scanner • We will actually use the tools lex and/or flex for steps 3 and 4.

  21. Transition Diagram for a DFA • Start in state s0 then if the input is “f” make transition to state s1. • The from state s1 if the input is “o” make transition to state s2. • And from state s2 if the input is “r” make transition to state s3. • The double circle denotes an “accepting state” which means we recognized the token. • Actually there is a missing state and transition f o r s1 s2 s3 s0

  22. Now what about “fort” • The string “fort” is an identifier, not the keyword “for” followed by “t.” • Thus we can’t really recognize the token until we see a terminator – whitespace or a special symbol ( one of ,;(){}[]

  23. Deterministic Finite Automata • A Deterministic finite automaton (DFA) is a mathematical model that consists of • 1. a set of states S • 2. a set of input symbols ∑, the input alphabet • 3. a transition function δ: S x ∑  Sthat for each state and each input maps to the next state • 4. a state s0that is distinguished as the start state • 5. a set of states F distinguished as accepting (or final) states

  24. DFA to recognize keyword “for” • Σ= {a,b,c …z, A,B,…Z,0,…9,’,’, ‘;’, …} • S = {s0, s1, s2, s3, sdead} • s0, is the start state • SF = {s3} • δ given by the table below

  25. Language Accepted by a DFA • A string x0x1…xn is accepted by a DFA M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn determines a path through the state diagram for the DFA that ends in an Accepting State. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  26. What is the Language Accepted by…

  27. DFA1.c • /* • * Deteministic Finite Automata Simulation • * • * One line of input is read and then processed character by character. • * Thus '\n' (EOL) is treated as the end of input. • * The major functions are: • * delta(s,c) - that implements the tranistion function, and • * accept(s) - that tells whether state s is an accepting state or not. • * The particular DFA recognizes strings of digits that end in 000. • * The DFA has: • * S = {0, 1, 2, 3, DEAD_STATE} • * Transitions on 0: S0=>S1, S1=>S2, S2=>S3, S3=>S3 • * Transitions on non-zero digits: S0=>S0, S1=>S0, S2=>S0, S3=>S0 • * Transitions on non-digits: Si=> DEAD_STATE • * • */

  28. #include <stdio.h> • #define DEAD_STATE -1 • #define ACCEPT 1 • #define DO_NOT 0 • #define EOL '\n' • main(){ • int c; • int state; • state = 0; • while((c = getchar()) != EOL && state != DEAD_STATE){ • state = delta(state, c); • } • if(accept(state)){ • printf("Accept!\n"); • }else{ • printf("Do not accept!\n"); • } • }

  29. /* DFA Transition function delta */ • /* delta(s,c) = transition from state s on input c */ • int delta(int s, int c){ • switch (s){ • case 0: if (c == '0') return 1; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 1: if (c == '0') return 2; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 2: if (c == '0') return 3; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case 3: if (c == '0') return 3; • else if((c > '0') && (c <= '9')) return 0; • else return(DEAD_STATE); • break; • case DEAD_STATE: return DEAD_STATE; • break; • default: • printf("Bad State\n"); • return(DEAD_STATE); • } • }

  30. int accept(state){ • if (state == 3) return ACCEPT; • else return DO_NOT; • }

  31. Non-Deterministic Finite Automata • What does deterministic mean? • In a Non-Deterministic Finite Automata (NFA) we relax the restriction that the transition function δ maps every state and every element of the alphabet to a unique state, i.e. δ: S x ∑  S • An NFA can: • Have multiple transitions from a state for the same input • Have Є transitions, where a transition from one state to another can be accomplished without consuming an input character • Not have transitions defined for every state and every input • Note for NFAs δ: S x ∑  2S where is the power set of S

  32. Language Accepted by an NFA • A string x0x1…xn is accepted by an NFA • M = (Σ, S, s0, δ, SF) if si+1= δ(si, xi) for i=0,1, …n and sn+1Є SF • i.e. if x0x1…xn can determines a path through the state diagram for the NFA that ends in an Accepting State, taking Є where ever necessary. • Then the language accepted by the DFA M = (Σ, S, s0, δ, SF), denoted L(M) is the set of all strings accepted by M.

  33. Language Accepted by an NFA

  34. Thompson Construction • For any regular expression R construct an NFA, M, that accepts the language denoted by R, i.e., L(M) = L(R).

More Related