cs375 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS375 PowerPoint Presentation
Download Presentation
CS375

Loading in 2 Seconds...

play fullscreen
1 / 74

CS375 - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

CS375. Compilers Lexical Analysis 4 th February, 2010. Outline. Overview of a compiler. What is lexical analysis? Writing a Lexer Specifying tokens: regular expressions Converting regular expressions to NFA, DFA Optimizations. How It Works. Program representation. Source code

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS375' - weylin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs375

CS375

Compilers

Lexical Analysis

4th February, 2010

outline
Outline
  • Overview of a compiler.
  • What is lexical analysis?
  • Writing a Lexer
    • Specifying tokens: regular expressions
    • Converting regular expressions to NFA, DFA
  • Optimizations.
how it works
How It Works

Program representation

Source code

(character stream)

if (b == 0) a = b;

Lexical Analysis

Tokenstream

if

(

b

==

0

)

a

=

b

;

Syntax Analysis

(Parsing)

if

==

=

Abstract syntaxtree (AST)

b

0

a

b

Semantic Analysis

if

boolean

int

==

=

Decorated

AST

intb

int0

inta

lvalue

intb

what is a lexical analyzer
What is a lexical analyzer
  • What?
    • Reads in a stream of characters and groups them into “tokens” or “lexemes”.
    • Language definition describes what tokens are valid.
  • Why?
    • Makes writing the parser a lot easier, parser operates on “tokens”.
    • Input dependent functionality such as character codes, EOF, new line characters.
first step lexical analysis
First Step: Lexical Analysis

Source code

(character stream)

if (b == 0) a = b;

Lexical Analysis

Token stream

if

(

b

==

0

)

a

=

b

;

Syntax Analysis

Semantic Analysis

what it should do
What it should do?

Token

Description

?

Input

String

Token? {Yes/No}

w

We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.

tokens
Tokens
  • Logical grouping of characters.
  • Identifiers: x y11 elsen _i00
  • Keywords: if else while break
  • Constants:
    • Integer: 2 1000 -500 5L 0x777
    • Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10
    • String: ”x” ”He said, \”Are you?\”\n”
    • Character: ’c’ ’\000’
    • Symbols: + * { } ++ < << [ ] >=
  • Whitespace (typically recognized and discarded):
    • Comment: /** don’t change this **/
    • Space: <space>
    • Format characters: <newline> <return>
ad hoc lexer
Ad-hoc Lexer
  • Hand-write code to generate tokens
  • How to read identifier tokens?

Token readIdentifier( ) {

String id = “”;

while (true) {

char c = input.read();

if (!identifierChar(c))

return new Token(ID, id, lineNumber);

id = id + String(c);

}

}

  • Problems
    • How to start?
    • What to do with following character?
    • How to avoid quadratic complexity of repeated concatenation?
    • How to recognize keywords?
look ahead character
Look-ahead Character
  • Scan text one character at a time
  • Use look-ahead character (next) to determine what kind of token to read and when the current token ends

char next;

while (identifierChar(next)) {

id = id + String(next);

next = input.read ();

}

e

l

s

e

n

next

(lookahead)

ad hoc lexer top level loop
Ad-hoc Lexer: Top-level Loop

class Lexer {

InputStream s;

char next;

Lexer(InputStream _s) { s = _s; next = s.read(); }

Token nextToken( ) {

if (identifierFirstChar(next))//starts with a char

return readIdentifier(); //is an identifier

if (numericFirstChar(next)) //starts with a num

return readNumber(); //is a number

if (next == ‘\”’) return readStringConst();

}

}

problems
Problems
  • Might not know what kind of token we are going to read from seeing first character
    • if token begins with “i’’ is it an identifier? (what about int, if )
    • if token begins with “2” is it an integer constant?
    • interleaved tokenizer code hard to write correctly, harder to maintain
    • in general, unbounded look-ahead may be needed
problems cont
Problems (cont.)
  • How to specify (unambiguously) tokens.
  • Once specified, how to implement them in a systematic way?
  • How to implement them efficiently?
problems cont1
Problems (cont. )
  • For instance, consider.
    • How to describe tokens unambiguously

2.e0 20.e-01 2.0000

“” “x” “\\” “\”\’”

    • How to break up text into tokens

if (x == 0) a = x<<1;

if (x == 0) a = x<1;

    • How to tokenize efficiently
      • tokens may have similar prefixes
      • want to look at each character ~1 time
principled approach
Principled Approach
  • Need a principled approach
  • Lexer Generators
    • lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator
  • Your own Lexer
    • Describe programming language’s tokens with a set of regular expressions
    • Generate scanning automaton from that set of regular expressions
top level idea
Top level idea…
  • Have a formal language to describe tokens.
    • Use regular expressions.
  • Have a mechanical way of converting this formal description to code.
    • Convert regular expressions to finite automaton (acceptors/state machines)
  • Run the code on actual inputs.
    • Simulate the finite automaton.

16

an example integers
An Example : Integers
  • Consider integers.
  • We can describe integers using the following grammar:
      • Num -> ‘-’ Pos
      • Num -> Pos
      • Pos ->0 | 1 |…|9
      • Pos ->0 | 1 |…|9 Pos
  • Or in a more compact notation, we have:
    • Num-> -? [0-9]+

17

an example integers1
An Example : Integers
  • Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0.
  • We can also represent above regular expression as a state machine.
    • This would be useful in simulation of the regular expression.

18

an example integers2
An Example : Integers

-

0-9

0

3

1

2

0-9

  • The Non-deterministic Finite Automaton is as follows.
  • We can verify that -123, 65, 0 are accepted by the state machine.
  • But which path to take? -paths?

19

an example integers3
An Example : Integers

{2,3}

-

{0,1}

{1}

0-9

0-9

0-9

  • The NFA can be converted to an equivalent Deterministic FA as below.
    • We shall see later how.
  • It accepts the same tokens.
    • -123
    • 65
    • 0

20

an example integers4
An Example : Integers
  • The deterministic Finite automaton makes implementation very easier, as we shall see later.
  • So, all we have to do is:
    • Express tokens as regular expressions
    • Convert RE to NFA
    • Convert NFA to DFA
    • Simulate the DFA on inputs

21

the larger picture
The larger picture…

Regular

Expression

describing

tokens

RE  NFA

Conversion

R

NFA  DFA

Conversion

Yes, if w is valid token

DFA

Simulation

Input

String

w

No, if not

22

language theory review
Language Theory Review
  • Let  be a finite set
    •  called an alphabet
    • a   called a symbol
  • * is the set of all finite strings consisting of symbols from 
  • A subset L  * is called a language
  • If L1 and L2 are languages, then L1 L2 is the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively
language theory review ctd
Language Theory Review, ctd.
  • Let L  * be a language
  • Then
    • L0 = {}
    • Ln+1= L Ln for all n  0
  • Examples
    • if L = {a, b} then
      • L1 = L = {a, b}
      • L2 = {aa, ab, ba, bb}
      • L3 = {aaa, aab, aba, aba, baa, bab, bba, bbb}
syntax of regular expressions
Syntax of Regular Expressions
  • Set of regular expressions (RE) over alphabet  is defined inductively by
    • Let a  and R, S  RE. Then:
      • a  RE
      • ε RE
      •   RE
      • R|S RE
      • RS RE
      • R*  RE
  • In concrete syntactic form, precedence rules, parentheses, and abbreviations
semantics of regular expressions
Semantics of Regular Expressions
  • Regular expression T  RE denotes the language L(R)  * given according to the inductive structure of T:
    • L(a) ={a} the string “a”
    • L(ε) = {“”} the empty string
    • L()= {} the empty set
    • L(R|S) = L(R)  L(S) alternation
    • L(RS) = L(R) L(S) concatenation
    • L(R*) = {“”}  L(R)  L(R2)  L(R3)  L(R4)  …

Kleene closure

simple examples
Simple Examples
  • L(R) = the “language” defined by R
    • L( abc ) = { abc }
    • L( hello|goodbye ) = {hello, goodbye}
      • OR operator, so L(a|b) is the language containing either strings of a, or strings of b.
    • L( 1(0|1)* ) = all non-zero binary numerals beginning with 1
      • Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.
convienent re shorthand
Convienent RE Shorthand

R+ one or more strings from L(R): R(R*)

R? optional R: (R|ε)

[abce] one of the listed characters: (a|b|c|e)

[a-z] one character from this range: (a|b|c|d|e|…|y|z)

[^ab] anything but one of the listed chars

[^a-z] one character notfrom this range

”abc” the string “abc”

\( the character ’(’

. . .

id=R named non-recursive regular expressions

more examples
More Examples

Regular Expression RStrings in L(R)

digit = [0-9] “0” “1” “2” “3” …

posint = digit+ “8” “412” …

int = -? posint “-42” “1024” …

real = int ((. posint)?) “-1.56” “12” “1.0”

= (-|ε)([0-9]+)((. [0-9]+)|ε)

[a-zA-Z_][a-zA-Z0-9_]* C identifiers

else the keyword “else”

historical anomalies
Historical Anomalies
  • PL/I
    • Keywords not reserved
      • IF IF THEN THEN ELSE ELSE;
  • FORTRAN
    • Whitespace stripped out prior to scanning
      • DO 123 I = 1
      • DO 123 I = 1 , 2
  • By and large, modern language design intentionally makes scanning easier
writing a lexer1
Writing a Lexer
  • Regular Expressions can be very useful in describing languages (tokens).
    • Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification.
    • Have a systematic way of writing a Lexer from a specification such as regular expressions.
how to use regular expressions
How To Use Regular Expressions
  • Given R RE and input string w, need a mechanism to determine if w  L(R)
  • Such a mechanism is called an acceptor

R RE

(that describes a

token family)

Yes, if w is a token

?

No, if w not a token

Input string w

(from the program)

acceptors
Acceptors
  • Acceptor determines if an input string belongs to a language L
  • Finite Automata are acceptors for languages described by regular expressions

L

Description

of language

Finite Automaton

Yes, if w  L

Acceptor

No, if w  L

Input

String

w

finite automata
Finite Automata
  • Informally, finite automaton consist of:
    • A finiteset of states
    • Transitions between states
    • An initial state (start state)
    • A set of final states (accepting states)
  • Two kinds of finite automata:
    • Deterministic finite automata (DFA): the transition from each state is uniquely determined by the current input character
    • Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input
dfa example
DFA Example
  • Finite automaton that accepts the strings in the language denoted by regular expression ab*a
  • Can be represented as a graph or a transition table.
    • A graph.
      • Read symbol
      • Follow outgoing edge

b

a

a

2

0

1

dfa example cont
DFA Example (cont.)

a b

0 1 Error

1 2 1

2 ErrorError

  • Representing FA as transition tables makes the implementation very easy.
  • The above FA can be represented as :
    • Current state and current symbol determine next state.
    • Until
      • error state.
      • End of input.
simulating the dfa
Simulating the DFA
  • Determine if the DFA accepts an input string

transition_table[NumSTATES][NumCHARS]

accept_states[NumSTATES]

state = INITIAL

while (state != Error)

{c = input.read();if (c == EOF) break;state = trans_table[state][c];

}

return (state!=Error) && accept_states[state];

b

a

a

2

0

1

re finite automaton
RE Finite automaton?
  • Can we build a finite automaton for every regular expression?
  • Strategy: build the finite automaton inductively, based on the definition of regular expressions

ε

a

a

re finite automaton1
RE Finite automaton?

?

?

?

  • Alternation R|S
  • Concatenation: RS
  • Recall ? implies optional move.

R automaton

S automaton

R automaton

S automaton

nfa definition
NFA Definition

b

a

e

a

a

b

e

  • A non-deterministic finite automaton (NFA) is an automaton where:
    • There may be ε-transitions (transitions that do not consume input characters)
    • There may be multiple transitions from the same state on the same input character

Example:

re nfa intuition
RE NFA intuition

-?[0-9]+

-

0-9

0-9

When to take the -path?

nfa construction thompson
NFA construction (Thompson)
  • NFA only needs one stop state (why?)
  • Canonical NFA form:
  • Use this canonical form to inductively construct NFAs for regular expressions
inductive nfa construction
Inductive NFA Construction

ε

ε

R

R|S

ε

ε

S

ε

RS

R

S

ε

ε

ε

R*

R

ε

inductive nfa construction1
Inductive NFA Construction

ε

ε

R

R|S

ε

ε

S

RS

R

S

ε

ε

ε

R*

R

ε

dfa vs nfa
DFA vs NFA
  • DFA: action of automaton on each input symbol is fully determined
    • obvious table-driven implementation
  • NFA:
    • automaton may have choice on each step
    • automaton accepts a string if there is any way to make choices to arrive at accepting state
    • every path from start state to an accept state is a string accepted by automaton
    • not obvious how to implement!
simulating an nfa
Simulating an NFA
  • Problem: how to execute NFA?

“strings accepted are those for which there is some corresponding path from start state to an accept state”

  • Solution: search all paths in graph consistent with the string in parallel
    • Keep track of the subset of NFA states that search could be in after seeing string prefix
    • “Multiple fingers” pointing to graph
example
Example

-

0-9

0

3

1

2

0-9

  • Input string: -23
  • NFA states:
    • Start:{0,1}
    • “-” :{1}
    • “2” :{2, 3}
    • “3” :{2, 3}
  • But this is very difficult to implement directly.
nfa dfa conversion
NFA  DFA conversion

{2,3}

  • Can convert NFA directly to DFA by same approach
  • Create one DFA state for each distinct subset of NFA states that could arise
  • States: {0,1}, {1}, {2, 3}
  • Called the “subset construction”

-

-

{0,1}

{1}

0-9

0

3

1

2

0-9

0-9

0-9

0-9

algorithm
Algorithm
  • For a set S of states in the NFA, compute ε-closure(S) = set of states reachable from states in S by one or more ε-transitions

T = S

Repeat T = T U {s’ | sT, (s,s’) is ε-transition}

Until T remains unchanged

ε-closure(S) = T

  • For a set S of ε-closed states in the NFA, compute DFAedge(S,c) = the set of states reachable from states in S by transitions on symbol c and ε-transitions

DFAedge(S,c) = ε-closure( { s’ | sS, (s,s’) is c-transition} )

algorithm1
Algorithm

DFA-initial-state = ε-closure(NFA-initial-state)

Worklist = { DFA-initial-state }

While ( Worklist not empty )

Pick state S from Worklist

For each character c

S’ = DFAedge(S,c)

if (S’ not in DFA states)

Add S’ to DFA states and worklist

Add an edge (S, S’) labeled c in DFA

For each DFA-state S

If S contains an NFA-final state

Mark S as DFA-final-state

putting the pieces together
Putting the Pieces Together

Regular

Expression

RE  NFA

Conversion

R

NFA  DFA

Conversion

Yes, if w  L(R)

DFA

Simulation

Input

String

w

No, if w  L(R)

state minimization
State minimization
  • State Minimization is an optimization that converts a DFA to another DFA that recognizes the same language and has a minimum number of states.
    • Divide all states into “equivalence” groups.
      • Two states p and q are equivalent if for all symbols, the outgoing edges either lead to error or the same destination group.
      • Collapse the states of a group to a single state (instead of p and q, have a single state).
state minimization equivalence
State Minimization (Equivalence)
  • More formally, all states in group Gi are equivalent iff for any two states p and q in Gi, and for every symbol σ, transition(p,σ) and transition(q,σ) are either both Error, or are states in the same group Gj (possibly Gi itself).
  • For example:

Gi

c

p

Gj

Gk

(or Error)

b

a

q

b

a

c

r

a

b

c

state minimization1
State Minimization
  • Step1. Partition states of original DFA into maximal-sized groups of “equivalent” states S = {G1, … ,Gn}
  • Step 2. Construct the minimized DFA such that there is a state for each group Gi

b

1

a

b

b

2

3

0

4

a

b

a

a

b

b

a

a

dfa minimization
DFA Minimization
  • Step1. Partition states of original DFA into maximal-sized groups of equivalent states
    • Step 1a. Discard states not reachable from start state
    • Step 1b. Initial partition is S = {Final, Non-final}
    • Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gicontains states p and q such that for some symbol σ, transitions from p and q on σare to different groups

Gi

Gj

Gk

(or Error)

a

j ≠ k

p

q

a

dfa minimization1
DFA Minimization
  • Step1. Partition states of original DFA into maximal-sized groups of equivalent states
    • Step 1a. Discard states not reachable from start state
    • Step 1b. Initial partition is S = {Final, Non-final}
    • Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gicontains states p and q such that for some symbol σ, transitions from p and q on σare to different groups

Gi

p

Gj

Gk

(or Error)

a

j ≠ k

q

a

Gi’

after state minimization
After state minimization.
  • We have an optimized acceptor.

Regular

Expression

RE  NFA

R

NFA  DFA

Minimize DFA

Yes, if w  L(R)

DFA

Simulation

Input

String

w

No, if w  L(R)

lexical analyzers vs acceptors
Lexical Analyzers vs Acceptors
  • We really need a Lexer, not an acceptor.
  • Lexical analyzers use the same mechanism, but they:
    • Have multiple RE descriptions for multiple tokens
    • Output a sequence of matching tokens (or an error)
    • Always return the longest matching token
    • For multiple longest matching tokens, use rule priorities
lexical analyzers
Lexical Analyzers

REs for all valid Tokens

RE  NFA

NFA  DFA

Minimize DFA

R1…Rn

Character

Stream

DFA

Simulation

Token stream

(and errors)

program

slide64

Handling Multiple REs

  • Construct one NFA for each RE
  • Associate the final state of each NFA with the given RE
  • Combine NFAs for all REs into one NFA
  • Convert NFA to minimized DFA, associating each final DFA state with the highest priority RE of the corresponding NFA states

NFAs

Minimized DFA

keywords

whitespace

identifier

number

using roll back
Using Roll Back

a

b

a

b

1

3

2

0

4

b

a

5

6

  • Consider three REs: {aa ba aabb] and input: aaba
  • Reach state 3 with no transition on next character a
  • Roll input back to position on entering state 2 (i.e., having read aa)
  • Emit token for aa
  • On next call to scanner, start in state 0 again with input ba
automatic lexer generators
Automatic Lexer Generators
  • Input: token specification
    • list of regular expressions in priority order
    • associated action for each RE (generates appropriate kind of token, other bookkeeping)
  • Output: lexer program
    • program that reads an input stream and breaks it up into tokens according to the REs (or reports lexical error -- “Unexpected character” )

66

example jlex java
Example: Jlex (Java)

%%

digits = 0|[1-9][0-9]*

letter = [A-Za-z]

identifier = {letter}({letter}|[0-9_])*

whitespace = [\ \t\n\r]+

%%

{whitespace} {/* discard */}

{digits} { return new Token(INT, Integer.parseInt(yytext()); }

”if” { return new Token(IF, yytext()); }

”while” { return new Token(WHILE, yytext()); }

{identifier} { return new Token(ID, yytext()); }

68

example output java
Example Output (Java)
  • Java Lexer which implements the functionality described in the language specification.
  • For instance :

case 5: { return new Token(WHILE, yytext()); }

case 6: break;

case 2: { return new Token(ID, yytext()); }

case 7: break;

case 4: { return new Token(IF, yytext());}

case 8: break;

case 1: { return new Token(INT, Integer.parseInt(yytext());}

69

start states
Start States

Mechanism that specifies state in which to start the execution of the DFA

Declare states in the second section

%state STATE

Use states as prefixes of regular expressions in the third section:

<STATE> regex {action}

Set current state in the actions

yybegin(STATE)

There is a pre-defined initial state: YYINITIAL

70

example1
Example

%%

%state STRING

%%

<YYINITIAL> “if” { return new Token(IF, null); }

<YYINITIAL> “\”” { yybegin(STRING); … }

<STRING> “\”” { yybegin(YYINITIAL); … }

<STRING> . { … }

.

STRING

INITIAL

if

71

summary
Summary
  • Lexical analyzer converts a text stream to tokens
  • Ad-hoc lexers hard to get right, maintain
  • For most languages, legal tokens are conveniently and precisely defined using regular expressions
  • Lexer generators generate lexer automaton automatically from token RE’s, prioritization
summary1
Summary
  • To write your own Lexer:
    • Describe tokens using Regular Expressions.
    • Construct NFAs for those tokens.
      • If you have no ambiguities in the NFA, or you have a DFA directly from the regular expressions, you are done.
    • Construct DFA from NFA using the algorithm described.
    • Systematically implement the DFA using transition tables.
reading
Reading
  • IC Language spec
  • JLEX manual
  • CVS manual
  • Links on course web home page
  • Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...), Russ Cox, January 2007

Acknowledgement

The slides are based on similar content by Tim Teitelbaum, Cornell.