slide1
Download
Skip this Video
Download Presentation
Lexical Analysis (2 Lectures)

Loading in 2 Seconds...

play fullscreen
1 / 61

Lexical Analysis 2 Lectures - PowerPoint PPT Presentation


  • 286 Views
  • Uploaded on

Lexical Analysis (2 Lectures). Overview. Basic Concepts Regular Expressions Language Lexical analysis by hand Regular Languages Tools NFA DFA Scanning tools Lex / Flex / JFlex / ANTLR. Scanning Perspective. Purpose Transform a stream of symbols Into a stream of tokens.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lexical Analysis 2 Lectures' - temple


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Lexical Analysis

(2 Lectures)

overview
Overview
  • Basic Concepts
  • Regular Expressions
    • Language
  • Lexical analysis by hand
  • Regular Languages Tools
    • NFA
    • DFA
  • Scanning tools
    • Lex / Flex / JFlex / ANTLR
scanning perspective
Scanning Perspective
  • Purpose
    • Transform a stream of symbols
    • Into a stream of tokens
lexical analyzer responsibilities
Lexical Analyzer Responsibilities
  • Lexical analyzer [Scanner]
    • Scan input
    • Remove white spaces
    • Remove comments
    • Manufacture tokens
    • Generate lexical errors
    • Pass token to parser
modular design
Modular design
  • Rationale
    • Separate the two analysis
      • High cohesion / Low coupling
    • Improve efficiency
    • Improve portability / maintainability
    • Enable integration of third-party lexers
      • [lexer = lexical analysis tool]
terminology
Terminology
  • Token
    • A classification for a common set of strings
    • Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....
  • Pattern
    • The rules that characterize the set of strings for a token
    • Examples: [0-9]+
  • Lexeme
    • Actual sequence of characters that matches a pattern and has a given Token class.
    • Examples:
      • Identifier: Name,Data,x
      • Integer: 345,2,0,629,....
examples

Examples
lexical errors
Lexical Errors
  • Error Handling is very localized, w.r.t. Input Source
  • Example:

fi(a==f(x)) …generates no lexical error in C

  • In what situations do errors occur?
    • Prefix of remaining input doesn’t match any defined token
  • Possible error recovery actions:
    • Deleting or Inserting Input Characters
    • Replacing or Transposing Characters
  • Or, skip over to next separator to ignore problem
basic scanning technique
Basic Scanning technique
  • Use 1 character of look-ahead
    • Obtain char with getc()
  • Do a case analysis
    • Based on lookahead char
    • Based on current lexeme
  • Outcome
    • If char can extend lexeme, all is well, go on.
    • If char cannot extend lexeme:
      • Figure out what the complete lexeme is and return its token
      • Put the lookahead back into the symbol stream
language concepts
Language Concepts
  • A language, L, is simply any set of strings over a fixed alphabet.

Alphabet Language

{0,1} {0,10,100,1000,10000,…}

{0,1,100,000,111,…}

{a,b,c} {abc,aabbcc,aaabbbccc,…}

{A…Z} {TEE,FORE,BALL…}

{FOR,WHILE,GOTO…}

{A…Z,a…z,0…9, {All legal PASCAL progs}

+,-,…,<,>,…} {All grammatically correct English Sentences}

Special Languages: Φ – EMPTY LANGUAGE

ε – contains empty string ε only

regular languages
Regular Languages
  • All examples above are
    • Quite expressive
    • Simple languages
  • But also...
    • Belong to a special class: regular languages
  • A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet.
  • Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r
rules
Rules
  • fix alphabet Σ
  • εis a regular expression denoting {ε}
  • If a is in Σ , a is a regular expression that denotes {a}
  • Let r and s be R.E. for L(r) and L(s). Then
  • (a) (r) | (s) is a regular expression L(r) ∪ L(s)
  • (b) (r)(s) is a regular expression L(r) L(s)
  • (c) (r)* is a regular expression (L(r))*
  • (d) (r) is a regular expression L(r)
  • All are Left-Associative.
  • Parentheses are dropped as allowed by precedences.

Precedeence

more examples
More Examples
  • All Strings that start with “tab” or end with “bat”:

tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

  • All Strings in Which {1,2,3} exist in ascending order:

{A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

tokens as r e

“+”

“?”

Tokens as R.E.
tokens as patterns
Tokens as Patterns
  • Patterns are ???
  • Tokens are ???
throw away tokens
Throw Away Tokens
  • Fact
    • Some languages define tokens as useless
    • Example: C
      • whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.
automaton
Automaton
  • A tool to specify a token
what about keywords
What about keywords ?
  • Easy!
    • Use the “Identifier” token
    • After a match, lookup the keyword table
      • If found, return a token for the matched keyword
      • If not, return a token for the true identifier
yes but how to scan
Yes... But how to scan?
  • Remember the algorithm?
    • Acquire 1 character of lookahead
    • Case analysis based
      • On lookahead
      • On state of automaton
scanner code
Scanner code

class Scanner {

InputStream _in;

char _la; // The lookahead character

char[] _window; // lexeme window

Token nextToken() {

startLexeme(); // reset window at start

while(true) {

switch(_state) {

case 0: {

_la = getChar();

if (_la == ‘<’) _state = 1;

else if (_la == ‘=’) _state = 5;

else if (_la == ‘>’) _state = 6;

else failure(state);

}break;

case 6: {

_la = getChar();

if (_la == ‘=’) _state = 7;

else _state = 8;

}break;

}

}

}

}

case 7: {

return new Token(GEQUAL);

}break;

case 8: {

pushBack(_la);

return new Token(GREATER);

}

handling failures
Handling Failures
  • Meaning
    • The automaton for this token failed
  • solution
    • If another automaton is available
      • “rewind” the input to the beginning of last lexeme
      • Jump to start state of next automaton
      • Start recognizing again
    • If no other automaton
      • This is a true lexical error.
      • Discard lexeme (or at least first char of lexeme)
      • Start from state 0 again
overview28
Overview
  • Basic Concepts
  • Regular Expressions
    • Language
  • Lexical analysis by hand
  • Regular Languages Tools
    • NFA / DFA
  • Scanning with DFAs
  • Scanning tools
    • Lex / Flex / JFlex
automata language theory
Automata & Language Theory
  • Terminology
    • FSA
      • A recognizer that takes an input string and determines whether it’s a valid string of the language.
    • Non-Deterministic FSA (NFA)
      • Has several alternative actions for the same input symbol
    • Deterministic FSA (DFA)
      • Has at most 1 action for any given input symbol
  • Bottom Line
    • expressive power(NFA) == expressive power(DFA)
    • Conversion can be automated
slide30
NFA

An NFA is a mathematical model that consists of :

• S, a set of states

•Σ, the symbols of the input alphabet

•move, a transition function.

•move(state, symbol) → set of states

•move : S ×Σ∪{∈} → Pow(S)

• A state, s0∈ S, the start state

• F ⊆ S, a set of final or accepting states.

representing nfa
Representing NFA

Transition Diagrams :

Transition Tables:

Number states (circles), arcs, final states, …

More suitable to representation within a computer

We’ll see examples of both !

example nfa

0

2

1

j

i

a

start

a

b

b

3

b

Example NFA

S = { 0, 1, 2, 3 }

s0 = 0

F = { 3 }

Σ = { a, b }

What Language is defined ?

What is the Transition Table ?

∈(null) moves possible

i n p u t

a

b

0

{ 0, 1 }

{ 0 }

state

1

--

{ 2 }

Switch state but do not use any input symbol

2

--

{ 3 }

epsilon transitions
Epsilon-Transitions
  • Given the regular expression : (a (b*c)) | (a (b | c+)?)
    • Find a transition diagram NFA that recognizes it.
  • Solution ?
nfa construction
NFA Construction
  • Automatic construction example
  • a(b*c)
  • a(b|c+)?

Build a Disjunction

working nfa

0

2

1

a

start

a

b

b

3

b

Working NFA

• Given an input string, we trace moves

• If no more input & in final state, ACCEPT

EXAMPLE: Input: ababb

-OR-

move(0, a) = 0

move(0, b) = 0

move(0, a) = 1

move(1, b) = 2

move(2, b) = 3

ACCEPT !

move(0, a) = 1

move(1, b) = 2

move(2, a) = ? (undefined)

REJECT !

handling undefined transitions

0

2

1

4

a

start

a

b

b

3

a

b

a

a, b

Σ

Handling Undefined Transitions
  • We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.
worse still

0

2

1

a

start

a

b

b

3

b

Worse still...
  • Not all path result in acceptance!

aabb is accepted along path :

0 → 0 → 1 → 2 → 3

BUT… it is not accepted along the valid path:

0 → 0 → 0 → 0 → 0

the nfa problem
The NFA “Problem”
  • Two problems
    • Valid input may not be accepted
    • Non-deterministic behavior from run to run...
  • Solution?
the dfa save the day
The DFA Save The Day
  • A DFA is an NFA with a few restrictions
    • No epsilon transitions
    • For every state s, there is only one transition (s,x) from s for any symbol x in Σ
  • Corollaries
    • Easy to implement a DFA with an algorithm!
    • Deterministic behavior
nfa vs dfa
NFA vs. DFA
  • NFA
    • smaller number of states Qnfa
    • In order to simulate it requires a |Qnfa| computation for each input symbol.
  • DFA
    • larger number of states Qdfa
    • In order to simulate it requires a constant computation for each input symbol.
  • caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa}
  • but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )
one catch
One catch...
  • NFA-DFA comparison
nfa to dfa conversion
NFA to DFA Conversion
  • Idea
    • Look at the state reachable without consuming any input
    • Aggregate them in macro states
final result
Final Result
  • A state is final
    • IFF one of the NFA state was final
preliminary definitions
Preliminary Definitions
  • NFA N = ( S, Σ, s0, F, MOVE )
  • ε-Closure(s) : s ε S
    • set of states in S that are reachable from s via ε-moves of N that originate from s.
  • ε-Closure(T) : T ⊆ S
    • NFA states reachable from all t ε T on ε-moves only.
  • move(T,a) : T ⊆ S, a ε Σ
    • Set of states to which there is a transition on input a from some t ε T
algorithm
Algorithm

computing the ε-closure

forall(t in T) push(t);

initialize ε-closure(T) to T;

while stack is not empty do begin

t = pop();

for each u ε S with edge t→u labeled ε

if u is not in ε-closure(T)

add u to ε-closure(T) ;

push u onto stack

dfa construction
DFA construction

computing the

The set of states

The transitions

let Q = ε-closure(s0) ;

D = { Q };

enQueue(Q)

while queue not empty do

X = deQueue();

for each a ε Σ do

Y := ε-closure(move(X,a));

T[X,a] := Y

if Y is not in D

D = D U { Y }

enQueue(Y);

end

end

summary
Summary
  • We can
    • Specify tokens with R.E.
    • Use DFA to scan an input and recognize token
    • Transform an NFA into a DFA automatically
  • What we are missing
    • A way to transform an R.E. into an NFA
  • Then, we will have a complete solution
    • Build a big R.E.
    • Turn the R.E. into an NFA
    • Turn the NFA into a DFA
    • Scan with the obtained DFA
r e to nfa
R.E. To NFA
  • Process
    • Inductive definition
      • Use the structure of the R.E.
      • Use atomic automata for atomic R.E.
      • Use composition rules for each R.E. expression
  • Recall
    • RE ::= ε

::= s in Σ

::= rs

::= r | s

::= r*

symbol construction
Symbol Construction
  • RE ::= x in Σ
nfa construction example

r13

r5

r12

|

r3

r4

r11

r10

)

(

a

r9

a

r1

r2

r7

r8

|

r0

c

*

r6

*

b

b

c

NFA Construction Example
  • R.E.
    • (ab*c) | (a(b|c*))
  • Parse Tree:
nfa construction example 2

r3:

r0:

r2:

a

b

c

a

b

b

b

c

c

r4 : r1 r2

r1:

r5 : r3 r4

NFA Construction Example 2
nfa construction example 3

r7:

b

b

b

c

r8:

r11:

a

a

c

c

r9 : r7 | r8

r12 : r11 r10

r6:

c

NFA Construction Example 3

r10 : r9

nfa construction example 4

a

b

c

2

3

4

5

6

7

17

1

b

10

11

a

c

8

9

12

13

14

15

16

NFA Construction Example 4

r13 : r5 | r12

overall summary
Overall Summary
  • How does this all fit together ?
    • Reg. Expr. → NFA construction
    • NFA → DFA conversion
    • DFA simulation for lexical analyzer
  • Recall Lex Structure
    • Pattern Action
    • Pattern Action
    • ……
      • Each pattern recognizes lexemes
      • Each pattern described by regular expression

(a | b)*abb

(abc)*ab

etc.

Recognizer!

morale
Morale?
  • All of this can be automated with a tool!
    • LEX The first lexical analyzer tool for C
    • FLEX A newer/faster implementation C / C++ friendly
    • JFLEX A lexer for Java. Based on same principles.
    • JavaCC
    • ANTLR
ahead
Ahead...
  • Grammars
  • Parsing
    • Bottom Up
    • Top Down
ad