CS375. Compilers Lexical Analysis 4 th February, 2010. Outline. Overview of a compiler. What is lexical analysis? Writing a Lexer Specifying tokens: regular expressions Converting regular expressions to NFA, DFA Optimizations. How It Works. Program representation. Source code
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Program representation
Source code
(character stream)
if (b == 0) a = b;
Lexical Analysis
Tokenstream
if
(
b
==
0
)
a
=
b
;
Syntax Analysis
(Parsing)
if
==
=
Abstract syntaxtree (AST)
b
0
a
b
Semantic Analysis
if
boolean
int
==
=
Decorated
AST
intb
int0
inta
lvalue
intb
Source code
(character stream)
if (b == 0) a = b;
Lexical Analysis
Token stream
if
(
b
==
0
)
a
=
b
;
Syntax Analysis
Semantic Analysis
Token
Description
?
Input
String
Token? {Yes/No}
w
We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.
Token readIdentifier( ) {
String id = “”;
while (true) {
char c = input.read();
if (!identifierChar(c))
return new Token(ID, id, lineNumber);
id = id + String(c);
}
}
char next;
…
while (identifierChar(next)) {
id = id + String(next);
next = input.read ();
}
e
l
s
e
n
next
(lookahead)
class Lexer {
InputStream s;
char next;
Lexer(InputStream _s) { s = _s; next = s.read(); }
Token nextToken( ) {
if (identifierFirstChar(next))//starts with a char
return readIdentifier(); //is an identifier
if (numericFirstChar(next)) //starts with a num
return readNumber(); //is a number
if (next == ‘\”’) return readStringConst();
…
}
}
2.e0 20.e01 2.0000
“” “x” “\\” “\”\’”
if (x == 0) a = x<<1;
if (x == 0) a = x<1;
16
17
18

09
0
3
1
2
09
19
{2,3}

{0,1}
{1}
09
09
09
20
21
Regular
Expression
describing
tokens
RE NFA
Conversion
R
NFA DFA
Conversion
Yes, if w is valid token
DFA
Simulation
Input
String
w
No, if not
22
Kleene closure
R+ one or more strings from L(R): R(R*)
R? optional R: (Rε)
[abce] one of the listed characters: (abce)
[az] one character from this range: (abcde…yz)
[^ab] anything but one of the listed chars
[^az] one character notfrom this range
”abc” the string “abc”
\( the character ’(’
. . .
id=R named nonrecursive regular expressions
Regular Expression RStrings in L(R)
digit = [09] “0” “1” “2” “3” …
posint = digit+ “8” “412” …
int = ? posint “42” “1024” …
real = int ((. posint)?) “1.56” “12” “1.0”
= (ε)([09]+)((. [09]+)ε)
[azAZ_][azAZ09_]* C identifiers
else the keyword “else”
R RE
(that describes a
token family)
Yes, if w is a token
?
No, if w not a token
Input string w
(from the program)
L
Description
of language
Finite Automaton
Yes, if w L
Acceptor
No, if w L
Input
String
w
b
a
a
2
0
1
a b
0 1 Error
1 2 1
2 ErrorError
transition_table[NumSTATES][NumCHARS]
accept_states[NumSTATES]
state = INITIAL
while (state != Error)
{c = input.read();if (c == EOF) break;state = trans_table[state][c];
}
return (state!=Error) && accept_states[state];
b
a
a
2
0
1
ε
a
a
?
?
?
R automaton
S automaton
R automaton
S automaton
b
a
e
a
a
b
e
Example:
“strings accepted are those for which there is some corresponding path from start state to an accept state”

09
0
3
1
2
09
{2,3}


{0,1}
{1}
09
0
3
1
2
09
09
09
09
T = S
Repeat T = T U {s’  sT, (s,s’) is εtransition}
Until T remains unchanged
εclosure(S) = T
DFAedge(S,c) = εclosure( { s’  sS, (s,s’) is ctransition} )
DFAinitialstate = εclosure(NFAinitialstate)
Worklist = { DFAinitialstate }
While ( Worklist not empty )
Pick state S from Worklist
For each character c
S’ = DFAedge(S,c)
if (S’ not in DFA states)
Add S’ to DFA states and worklist
Add an edge (S, S’) labeled c in DFA
For each DFAstate S
If S contains an NFAfinal state
Mark S as DFAfinalstate
Regular
Expression
RE NFA
Conversion
R
NFA DFA
Conversion
Yes, if w L(R)
DFA
Simulation
Input
String
w
No, if w L(R)
Gi
c
p
Gj
Gk
(or Error)
b
a
q
b
a
c
r
a
b
c
b
1
a
b
b
2
3
0
4
a
b
a
a
b
b
a
a
Gi
Gj
Gk
(or Error)
a
j ≠ k
p
q
a
Gi
p
Gj
Gk
(or Error)
a
j ≠ k
q
a
Gi’
Regular
Expression
RE NFA
R
NFA DFA
Minimize DFA
Yes, if w L(R)
DFA
Simulation
Input
String
w
No, if w L(R)
REs for all valid Tokens
RE NFA
NFA DFA
Minimize DFA
R1…Rn
Character
Stream
DFA
Simulation
Token stream
(and errors)
program
NFAs
Minimized DFA
keywords
whitespace
identifier
number
a
b
a
b
1
3
2
0
4
b
a
5
6
66
%%
digits = 0[19][09]*
letter = [AZaz]
identifier = {letter}({letter}[09_])*
whitespace = [\ \t\n\r]+
%%
{whitespace} {/* discard */}
{digits} { return new Token(INT, Integer.parseInt(yytext()); }
”if” { return new Token(IF, yytext()); }
”while” { return new Token(WHILE, yytext()); }
…
{identifier} { return new Token(ID, yytext()); }
68
case 5: { return new Token(WHILE, yytext()); }
case 6: break;
case 2: { return new Token(ID, yytext()); }
case 7: break;
case 4: { return new Token(IF, yytext());}
case 8: break;
case 1: { return new Token(INT, Integer.parseInt(yytext());}
69
Mechanism that specifies state in which to start the execution of the DFA
Declare states in the second section
%state STATE
Use states as prefixes of regular expressions in the third section:
<STATE> regex {action}
Set current state in the actions
yybegin(STATE)
There is a predefined initial state: YYINITIAL
70
%%
%state STRING
%%
<YYINITIAL> “if” { return new Token(IF, null); }
<YYINITIAL> “\”” { yybegin(STRING); … }
<STRING> “\”” { yybegin(YYINITIAL); … }
<STRING> . { … }
”
.
STRING
INITIAL
if
”
71
Acknowledgement
The slides are based on similar content by Tim Teitelbaum, Cornell.