Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

Lexical Analysis (2 Lectures) PowerPoint PPT Presentation


  • 230 Views
  • Uploaded on
  • Presentation posted in: General

Lexical Analysis (2 Lectures). Overview. Basic Concepts Regular Expressions Language Lexical analysis by hand Regular Languages Tools NFA DFA Scanning tools Lex / Flex / JFlex / ANTLR. Scanning Perspective. Purpose Transform a stream of symbols Into a stream of tokens.

Download Presentation

Lexical Analysis (2 Lectures)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Lexical Analysis

(2 Lectures)


Overview l.jpg

Overview

  • Basic Concepts

  • Regular Expressions

    • Language

  • Lexical analysis by hand

  • Regular Languages Tools

    • NFA

    • DFA

  • Scanning tools

    • Lex / Flex / JFlex / ANTLR


Scanning perspective l.jpg

Scanning Perspective

  • Purpose

    • Transform a stream of symbols

    • Into a stream of tokens


Lexical analyzer responsibilities l.jpg

Lexical Analyzer Responsibilities

  • Lexical analyzer [Scanner]

    • Scan input

    • Remove white spaces

    • Remove comments

    • Manufacture tokens

    • Generate lexical errors

    • Pass token to parser


Modular design l.jpg

Modular design

  • Rationale

    • Separate the two analysis

      • High cohesion / Low coupling

    • Improve efficiency

    • Improve portability / maintainability

    • Enable integration of third-party lexers

      • [lexer = lexical analysis tool]


Terminology l.jpg

Terminology

  • Token

    • A classification for a common set of strings

    • Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....

  • Pattern

    • The rules that characterize the set of strings for a token

    • Examples: [0-9]+

  • Lexeme

    • Actual sequence of characters that matches a pattern and has a given Token class.

    • Examples:

      • Identifier: Name,Data,x

      • Integer: 345,2,0,629,....


Examples l.jpg

Examples


Lexical errors l.jpg

Lexical Errors

  • Error Handling is very localized, w.r.t. Input Source

  • Example:

    fi(a==f(x)) …generates no lexical error in C

  • In what situations do errors occur?

    • Prefix of remaining input doesn’t match any defined token

  • Possible error recovery actions:

    • Deleting or Inserting Input Characters

    • Replacing or Transposing Characters

  • Or, skip over to next separator to ignore problem


Basic scanning technique l.jpg

Basic Scanning technique

  • Use 1 character of look-ahead

    • Obtain char with getc()

  • Do a case analysis

    • Based on lookahead char

    • Based on current lexeme

  • Outcome

    • If char can extend lexeme, all is well, go on.

    • If char cannot extend lexeme:

      • Figure out what the complete lexeme is and return its token

      • Put the lookahead back into the symbol stream


Language concepts l.jpg

Language Concepts

  • A language, L, is simply any set of strings over a fixed alphabet.

Alphabet Language

{0,1}{0,10,100,1000,10000,…}

{0,1,100,000,111,…}

{a,b,c}{abc,aabbcc,aaabbbccc,…}

{A…Z}{TEE,FORE,BALL…}

{FOR,WHILE,GOTO…}

{A…Z,a…z,0…9,{All legal PASCAL progs}

+,-,…,<,>,…}{All grammatically correct English Sentences}

Special Languages: Φ – EMPTY LANGUAGE

ε – contains empty string ε only


Formal language operations l.jpg

Formal Language Operations


Examples12 l.jpg

Examples


Regular languages l.jpg

Regular Languages

  • All examples above are

    • Quite expressive

    • Simple languages

  • But also...

    • Belong to a special class: regular languages

  • A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet.

  • Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r


Rules l.jpg

Rules

  • fix alphabet Σ

  • εis a regular expression denoting {ε}

  • If a is in Σ , a is a regular expression that denotes {a}

  • Let r and s be R.E. for L(r) and L(s). Then

  • (a) (r) | (s) is a regular expression L(r) ∪ L(s)

  • (b) (r)(s) is a regular expression L(r) L(s)

  • (c) (r)* is a regular expression (L(r))*

  • (d) (r) is a regular expression L(r)

  • All are Left-Associative.

  • Parentheses are dropped as allowed by precedences.

Precedeence


Example revisited l.jpg

Example revisited


Algebraic properties l.jpg

Algebraic Properties


More examples l.jpg

More Examples

  • All Strings that start with “tab” or end with “bat”:

    tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

  • All Strings in Which {1,2,3} exist in ascending order:

    {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*


Tokens as r e l.jpg

“+”

“?”

Tokens as R.E.


Tokens as patterns l.jpg

Tokens as Patterns

  • Patterns are ???

  • Tokens are ???


Throw away tokens l.jpg

Throw Away Tokens

  • Fact

    • Some languages define tokens as useless

    • Example: C

      • whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.


Automaton l.jpg

Automaton

  • A tool to specify a token


A more complex automaton l.jpg

A More Complex Automaton


Two more l.jpg

Two More...


What about keywords l.jpg

What about keywords ?

  • Easy!

    • Use the “Identifier” token

    • After a match, lookup the keyword table

      • If found, return a token for the matched keyword

      • If not, return a token for the true identifier


Yes but how to scan l.jpg

Yes... But how to scan?

  • Remember the algorithm?

    • Acquire 1 character of lookahead

    • Case analysis based

      • On lookahead

      • On state of automaton


Scanner code l.jpg

Scanner code

class Scanner {

InputStream _in;

char _la; // The lookahead character

char[] _window; // lexeme window

Token nextToken() {

startLexeme(); // reset window at start

while(true) {

switch(_state) {

case 0: {

_la = getChar();

if (_la == ‘<’) _state = 1;

else if (_la == ‘=’) _state = 5;

else if (_la == ‘>’) _state = 6;

else failure(state);

}break;

case 6: {

_la = getChar();

if (_la == ‘=’) _state = 7;

else _state = 8;

}break;

}

}

}

}

case 7: {

return new Token(GEQUAL);

}break;

case 8: {

pushBack(_la);

return new Token(GREATER);

}


Handling failures l.jpg

Handling Failures

  • Meaning

    • The automaton for this token failed

  • solution

    • If another automaton is available

      • “rewind” the input to the beginning of last lexeme

      • Jump to start state of next automaton

      • Start recognizing again

    • If no other automaton

      • This is a true lexical error.

      • Discard lexeme (or at least first char of lexeme)

      • Start from state 0 again


Overview28 l.jpg

Overview

  • Basic Concepts

  • Regular Expressions

    • Language

  • Lexical analysis by hand

  • Regular Languages Tools

    • NFA / DFA

  • Scanning with DFAs

  • Scanning tools

    • Lex / Flex / JFlex


Automata language theory l.jpg

Automata & Language Theory

  • Terminology

    • FSA

      • A recognizer that takes an input string and determines whether it’s a valid string of the language.

    • Non-Deterministic FSA (NFA)

      • Has several alternative actions for the same input symbol

    • Deterministic FSA (DFA)

      • Has at most 1 action for any given input symbol

  • Bottom Line

    • expressive power(NFA) == expressive power(DFA)

    • Conversion can be automated


Slide30 l.jpg

NFA

An NFA is a mathematical model that consists of :

• S, a set of states

•Σ, the symbols of the input alphabet

•move, a transition function.

•move(state, symbol) → set of states

•move : S ×Σ∪{∈} → Pow(S)

• A state, s0∈ S, the start state

• F ⊆ S, a set of final or accepting states.


Representing nfa l.jpg

Representing NFA

Transition Diagrams :

Transition Tables:

Number states (circles), arcs, final states, …

More suitable to representation within a computer

We’ll see examples of both !


Example nfa l.jpg

0

2

1

j

i

a

start

a

b

b

3

b

Example NFA

S = { 0, 1, 2, 3 }

s0 = 0

F = { 3 }

Σ = { a, b }

What Language is defined ?

What is the Transition Table ?

∈(null) moves possible

i n p u t

a

b

0

{ 0, 1 }

{ 0 }

state

1

--

{ 2 }

Switch state but do not use any input symbol

2

--

{ 3 }


Epsilon transitions l.jpg

Epsilon-Transitions

  • Given the regular expression : (a (b*c)) | (a (b | c+)?)

    • Find a transition diagram NFA that recognizes it.

  • Solution ?


Nfa construction l.jpg

NFA Construction

  • Automatic construction example

  • a(b*c)

  • a(b|c+)?

Build a Disjunction


Resulting nfa l.jpg

Resulting NFA


Working nfa l.jpg

0

2

1

a

start

a

b

b

3

b

Working NFA

• Given an input string, we trace moves

• If no more input & in final state, ACCEPT

EXAMPLE: Input: ababb

-OR-

move(0, a) = 0

move(0, b) = 0

move(0, a) = 1

move(1, b) = 2

move(2, b) = 3

ACCEPT !

move(0, a) = 1

move(1, b) = 2

move(2, a) = ? (undefined)

REJECT !


Handling undefined transitions l.jpg

0

2

1

4

a

start

a

b

b

3

a

b

a

a, b

Σ

Handling Undefined Transitions

  • We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.


Worse still l.jpg

0

2

1

a

start

a

b

b

3

b

Worse still...

  • Not all path result in acceptance!

aabb is accepted along path :

0 → 0 → 1 → 2 → 3

BUT… it is not accepted along the valid path:

0 → 0 → 0 → 0 → 0


The nfa problem l.jpg

The NFA “Problem”

  • Two problems

    • Valid input may not be accepted

    • Non-deterministic behavior from run to run...

  • Solution?


The dfa save the day l.jpg

The DFA Save The Day

  • A DFA is an NFA with a few restrictions

    • No epsilon transitions

    • For every state s, there is only one transition (s,x) from s for any symbol x in Σ

  • Corollaries

    • Easy to implement a DFA with an algorithm!

    • Deterministic behavior


Nfa vs dfa l.jpg

NFA vs. DFA

  • NFA

    • smaller number of states Qnfa

    • In order to simulate it requires a |Qnfa| computation for each input symbol.

  • DFA

    • larger number of states Qdfa

    • In order to simulate it requires a constant computation for each input symbol.

  • caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa}

  • but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )


One catch l.jpg

One catch...

  • NFA-DFA comparison


Nfa to dfa conversion l.jpg

NFA to DFA Conversion

  • Idea

    • Look at the state reachable without consuming any input

    • Aggregate them in macro states


Final result l.jpg

Final Result

  • A state is final

    • IFF one of the NFA state was final


Preliminary definitions l.jpg

Preliminary Definitions

  • NFA N = ( S, Σ, s0, F, MOVE )

  • ε-Closure(s) : s ε S

    • set of states in S that are reachable from s via ε-moves of N that originate from s.

  • ε-Closure(T) : T ⊆ S

    • NFA states reachable from all t ε T on ε-moves only.

  • move(T,a): T ⊆ S, a ε Σ

    • Set of states to which there is a transition on input a from some t ε T


Algorithm l.jpg

Algorithm

computing the ε-closure

forall(t in T) push(t);

initialize ε-closure(T) to T;

while stack is not empty do begin

t = pop();

for each u ε S with edge t→u labeled ε

if u is not in ε-closure(T)

add u to ε-closure(T) ;

push u onto stack


Dfa construction l.jpg

DFA construction

computing the

The set of states

The transitions

let Q = ε-closure(s0) ;

D = { Q };

enQueue(Q)

while queue not empty do

X = deQueue();

for each a ε Σ do

Y := ε-closure(move(X,a));

T[X,a] := Y

if Y is not in D

D = D U { Y }

enQueue(Y);

end

end


Summary l.jpg

Summary

  • We can

    • Specify tokens with R.E.

    • Use DFA to scan an input and recognize token

    • Transform an NFA into a DFA automatically

  • What we are missing

    • A way to transform an R.E. into an NFA

  • Then, we will have a complete solution

    • Build a big R.E.

    • Turn the R.E. into an NFA

    • Turn the NFA into a DFA

    • Scan with the obtained DFA


R e to nfa l.jpg

R.E. To NFA

  • Process

    • Inductive definition

      • Use the structure of the R.E.

      • Use atomic automata for atomic R.E.

      • Use composition rules for each R.E. expression

  • Recall

    • RE::= ε

      ::= s in Σ

      ::= rs

      ::= r | s

      ::= r*


Epsilon construction l.jpg

Epsilon Construction

  • RE::= ε


Symbol construction l.jpg

Symbol Construction

  • RE::= x in Σ


Chaining construction l.jpg

Chaining Construction

  • RE::= rs


Branching construction l.jpg

Branching Construction

  • RE::= r | s


Kleene closure construction l.jpg

Kleene-Closure Construction

  • RE::= r*


Nfa construction example l.jpg

r13

r5

r12

|

r3

r4

r11

r10

)

(

a

r9

a

r1

r2

r7

r8

|

r0

c

*

r6

*

b

b

c

NFA Construction Example

  • R.E.

    • (ab*c) | (a(b|c*))

  • Parse Tree:


Nfa construction example 2 l.jpg

r3:

r0:

r2:

a

b

c

a

b

b

b

c

c

r4 : r1 r2

r1:

r5 : r3 r4

NFA Construction Example 2


Nfa construction example 3 l.jpg

r7:

b

b

b

c

r8:

r11:

a

a

c

c

r9 : r7 | r8

r12 : r11 r10

r6:

c

NFA Construction Example 3

r10 : r9


Nfa construction example 4 l.jpg

a

b

c

2

3

4

5

6

7

17

1

b

10

11

a

c

8

9

12

13

14

15

16

NFA Construction Example 4

r13 : r5 | r12


Overall summary l.jpg

Overall Summary

  • How does this all fit together ?

    • Reg. Expr. → NFA construction

    • NFA → DFA conversion

    • DFA simulation for lexical analyzer

  • Recall Lex Structure

    • Pattern Action

    • Pattern Action

    • ……

      • Each pattern recognizes lexemes

      • Each pattern described by regular expression

(a | b)*abb

(abc)*ab

etc.

Recognizer!


Morale l.jpg

Morale?

  • All of this can be automated with a tool!

    • LEXThe first lexical analyzer tool for C

    • FLEXA newer/faster implementation C / C++ friendly

    • JFLEXA lexer for Java. Based on same principles.

    • JavaCC

    • ANTLR


Ahead l.jpg

Ahead...

  • Grammars

  • Parsing

    • Bottom Up

    • Top Down


  • Login