CS375. Compilers Lexical Analysis 4 th February, 2010. Outline. Overview of a compiler. What is lexical analysis? Writing a Lexer Specifying tokens: regular expressions Converting regular expressions to NFA, DFA Optimizations. How It Works. Program representation. Source code
Lexical Analysis
Syntax Analysis
Semantic Analysis
We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.
Token readIdentifier( ) {
String id = “”;
while (true) {
char c = input.read();
if (!identifierChar(c))
return new Token(ID, id, lineNumber);
id = id + String(c);
}
}
class Lexer {
InputStream s;
char next;
Lexer(InputStream _s) { s = _s; next = s.read(); }
Token nextToken( ) {
if (identifierFirstChar(next))//starts with a char
return readIdentifier(); //is an identifier
if (numericFirstChar(next)) //starts with a num
return readNumber(); //is a number
if (next == ‘\”’) return readStringConst();
…
}
}
2.e0 20.e01 2.0000
“” “x” “\\” “\”\’”
if (x == 0) a = x<<1;
if (x == 0) a = x<1;
Regular
Expression
describing
tokens
RE NFA
Conversion
R
NFA DFA
Conversion
Yes, if w is valid token
DFA
Simulation
Input
String
w
No, if not
Kleene closure
R+ one or more strings from L(R): R(R*)
R? optional R: (Rε)
[abce] one of the listed characters: (abce)
[az] one character from this range: (abcde…yz)
[^ab] anything but one of the listed chars
[^az] one character notfrom this range
”abc” the string “abc”
\( the character ’(’
. . .
id=R named nonrecursive regular expressions
Regular Expression RStrings in L(R)
digit = [09] “0” “1” “2” “3” …
posint = digit+ “8” “412” …
int = ? posint “42” “1024” …
real = int ((. posint)?) “1.56” “12” “1.0”
= (ε)([09]+)((. [09]+)ε)
[azAZ_][azAZ09_]* C identifiers
else the keyword “else”
a
a
transition_table[NumSTATES][NumCHARS]
accept_states[NumSTATES]
state = INITIAL
while (state != Error)
{c = input.read();if (c == EOF) break;state = trans_table[state][c];
}
return (state!=Error) && accept_states[state];
R automaton
S automaton
R automaton
S automaton
Example:
“strings accepted are those for which there is some corresponding path from start state to an accept state”

T = S
Repeat T = T U {s’  sT, (s,s’) is εtransition}
Until T remains unchanged
εclosure(S) = T
DFAedge(S,c) = εclosure( { s’  sS, (s,s’) is ctransition} )
DFAinitialstate = εclosure(NFAinitialstate)
Worklist = { DFAinitialstate }
While ( Worklist not empty )
Pick state S from Worklist
For each character c
S’ = DFAedge(S,c)
if (S’ not in DFA states)
Add S’ to DFA states and worklist
Add an edge (S, S’) labeled c in DFA
For each DFAstate S
If S contains an NFAfinal state
Mark S as DFAfinalstate
Regular
Expression
RE NFA
Conversion
R
NFA DFA
Conversion
Yes, if w L(R)
DFA
Simulation
Input
String
w
No, if w L(R)
Regular
Expression
RE NFA
R
NFA DFA
Minimize DFA
Yes, if w L(R)
DFA
Simulation
Input
String
w
No, if w L(R)
REs for all valid Tokens
RE NFA
NFA DFA
Minimize DFA
R1…Rn
Character
Stream
DFA
Simulation
Token stream
(and errors)
program
NFAs
Minimized DFA
keywords
whitespace
identifier
number
%%
digits = 0[19][09]*
letter = [AZaz]
identifier = {letter}({letter}[09_])*
whitespace = [\ \t\n\r]+
%%
{whitespace} {/* discard */}
{digits} { return new Token(INT, Integer.parseInt(yytext()); }
”if” { return new Token(IF, yytext()); }
”while” { return new Token(WHILE, yytext()); }
…
{identifier} { return new Token(ID, yytext()); }
case 5: { return new Token(WHILE, yytext()); }
case 6: break;
case 2: { return new Token(ID, yytext()); }
case 7: break;
case 4: { return new Token(IF, yytext());}
case 8: break;
case 1: { return new Token(INT, Integer.parseInt(yytext());}
Mechanism that specifies state in which to start the execution of the DFA
Declare states in the second section
%state STATE
Use states as prefixes of regular expressions in the third section:
<STATE> regex {action}
Set current state in the actions
yybegin(STATE)
There is a predefined initial state: YYINITIAL
%%
%state STRING
%%
<YYINITIAL> “if” { return new Token(IF, null); }
<YYINITIAL> “\”” { yybegin(STRING); … }
<STRING> “\”” { yybegin(YYINITIAL); … }
<STRING> . { … }
”
.
STRING
INITIAL
if
”
Acknowledgement
The slides are based on similar content by Tim Teitelbaum, Cornell.