- 286 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Lexical Analysis (2 Lectures)' - temple

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

(2 Lectures)

Overview

- Basic Concepts
- Regular Expressions
- Language
- Lexical analysis by hand
- Regular Languages Tools
- NFA
- DFA
- Scanning tools
- Lex / Flex / JFlex / ANTLR

Scanning Perspective

- Purpose
- Transform a stream of symbols
- Into a stream of tokens

Lexical Analyzer Responsibilities

- Lexical analyzer [Scanner]
- Scan input
- Remove white spaces
- Remove comments
- Manufacture tokens
- Generate lexical errors
- Pass token to parser

Modular design

- Rationale
- Separate the two analysis
- High cohesion / Low coupling
- Improve efficiency
- Improve portability / maintainability
- Enable integration of third-party lexers
- [lexer = lexical analysis tool]

Terminology

- Token
- A classification for a common set of strings
- Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,....
- Pattern
- The rules that characterize the set of strings for a token
- Examples: [0-9]+
- Lexeme
- Actual sequence of characters that matches a pattern and has a given Token class.
- Examples:
- Identifier: Name,Data,x
- Integer: 345,2,0,629,....

Lexical Errors

- Error Handling is very localized, w.r.t. Input Source
- Example:

fi(a==f(x)) …generates no lexical error in C

- In what situations do errors occur?
- Prefix of remaining input doesn’t match any defined token
- Possible error recovery actions:
- Deleting or Inserting Input Characters
- Replacing or Transposing Characters
- Or, skip over to next separator to ignore problem

Basic Scanning technique

- Use 1 character of look-ahead
- Obtain char with getc()
- Do a case analysis
- Based on lookahead char
- Based on current lexeme
- Outcome
- If char can extend lexeme, all is well, go on.
- If char cannot extend lexeme:
- Figure out what the complete lexeme is and return its token
- Put the lookahead back into the symbol stream

Language Concepts

- A language, L, is simply any set of strings over a fixed alphabet.

Alphabet Language

{0,1} {0,10,100,1000,10000,…}

{0,1,100,000,111,…}

{a,b,c} {abc,aabbcc,aaabbbccc,…}

{A…Z} {TEE,FORE,BALL…}

{FOR,WHILE,GOTO…}

{A…Z,a…z,0…9, {All legal PASCAL progs}

+,-,…,<,>,…} {All grammatically correct English Sentences}

Special Languages: Φ – EMPTY LANGUAGE

ε – contains empty string ε only

Regular Languages

- All examples above are
- Quite expressive
- Simple languages
- But also...
- Belong to a special class: regular languages
- A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet.
- Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r

Rules

- fix alphabet Σ
- εis a regular expression denoting {ε}
- If a is in Σ , a is a regular expression that denotes {a}
- Let r and s be R.E. for L(r) and L(s). Then
- (a) (r) | (s) is a regular expression L(r) ∪ L(s)
- (b) (r)(s) is a regular expression L(r) L(s)
- (c) (r)* is a regular expression (L(r))*
- (d) (r) is a regular expression L(r)
- All are Left-Associative.
- Parentheses are dropped as allowed by precedences.

Precedeence

More Examples

- All Strings that start with “tab” or end with “bat”:

tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

- All Strings in Which {1,2,3} exist in ascending order:

{A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

Tokens as Patterns

- Patterns are ???
- Tokens are ???

Throw Away Tokens

- Fact
- Some languages define tokens as useless
- Example: C
- whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.

Automaton

- A tool to specify a token

What about keywords ?

- Easy!
- Use the “Identifier” token
- After a match, lookup the keyword table
- If found, return a token for the matched keyword
- If not, return a token for the true identifier

Yes... But how to scan?

- Remember the algorithm?
- Acquire 1 character of lookahead
- Case analysis based
- On lookahead
- On state of automaton

Scanner code

class Scanner {

InputStream _in;

char _la; // The lookahead character

char[] _window; // lexeme window

Token nextToken() {

startLexeme(); // reset window at start

while(true) {

switch(_state) {

case 0: {

_la = getChar();

if (_la == ‘<’) _state = 1;

else if (_la == ‘=’) _state = 5;

else if (_la == ‘>’) _state = 6;

else failure(state);

}break;

case 6: {

_la = getChar();

if (_la == ‘=’) _state = 7;

else _state = 8;

}break;

}

}

}

}

case 7: {

return new Token(GEQUAL);

}break;

case 8: {

pushBack(_la);

return new Token(GREATER);

}

Handling Failures

- Meaning
- The automaton for this token failed
- solution
- If another automaton is available
- “rewind” the input to the beginning of last lexeme
- Jump to start state of next automaton
- Start recognizing again
- If no other automaton
- This is a true lexical error.
- Discard lexeme (or at least first char of lexeme)
- Start from state 0 again

Overview

- Basic Concepts
- Regular Expressions
- Language
- Lexical analysis by hand
- Regular Languages Tools
- NFA / DFA
- Scanning with DFAs
- Scanning tools
- Lex / Flex / JFlex

Automata & Language Theory

- Terminology
- FSA
- A recognizer that takes an input string and determines whether it’s a valid string of the language.
- Non-Deterministic FSA (NFA)
- Has several alternative actions for the same input symbol
- Deterministic FSA (DFA)
- Has at most 1 action for any given input symbol
- Bottom Line
- expressive power(NFA) == expressive power(DFA)
- Conversion can be automated

NFA

An NFA is a mathematical model that consists of :

• S, a set of states

•Σ, the symbols of the input alphabet

•move, a transition function.

•move(state, symbol) → set of states

•move : S ×Σ∪{∈} → Pow(S)

• A state, s0∈ S, the start state

• F ⊆ S, a set of final or accepting states.

Representing NFA

Transition Diagrams :

Transition Tables:

Number states (circles), arcs, final states, …

More suitable to representation within a computer

We’ll see examples of both !

0

2

1

j

i

a

start

a

b

b

3

b

Example NFAS = { 0, 1, 2, 3 }

s0 = 0

F = { 3 }

Σ = { a, b }

What Language is defined ?

What is the Transition Table ?

∈(null) moves possible

i n p u t

a

b

0

{ 0, 1 }

{ 0 }

state

1

--

{ 2 }

Switch state but do not use any input symbol

2

--

{ 3 }

Epsilon-Transitions

- Given the regular expression : (a (b*c)) | (a (b | c+)?)
- Find a transition diagram NFA that recognizes it.
- Solution ?

2

1

a

start

a

b

b

3

b

Working NFA• Given an input string, we trace moves

• If no more input & in final state, ACCEPT

EXAMPLE: Input: ababb

-OR-

move(0, a) = 0

move(0, b) = 0

move(0, a) = 1

move(1, b) = 2

move(2, b) = 3

ACCEPT !

move(0, a) = 1

move(1, b) = 2

move(2, a) = ? (undefined)

REJECT !

2

1

4

a

start

a

b

b

3

a

b

a

a, b

Σ

Handling Undefined Transitions- We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state.

2

1

a

start

a

b

b

3

b

Worse still...- Not all path result in acceptance!

aabb is accepted along path :

0 → 0 → 1 → 2 → 3

BUT… it is not accepted along the valid path:

0 → 0 → 0 → 0 → 0

The NFA “Problem”

- Two problems
- Valid input may not be accepted
- Non-deterministic behavior from run to run...
- Solution?

The DFA Save The Day

- A DFA is an NFA with a few restrictions
- No epsilon transitions
- For every state s, there is only one transition (s,x) from s for any symbol x in Σ
- Corollaries
- Easy to implement a DFA with an algorithm!
- Deterministic behavior

NFA vs. DFA

- NFA
- smaller number of states Qnfa
- In order to simulate it requires a |Qnfa| computation for each input symbol.
- DFA
- larger number of states Qdfa
- In order to simulate it requires a constant computation for each input symbol.
- caveat - generic NFA=>DFA construction: Qdfa ~ 2^{Qnfa}
- but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Qdfa )

One catch...

- NFA-DFA comparison

NFA to DFA Conversion

- Idea
- Look at the state reachable without consuming any input
- Aggregate them in macro states

Final Result

- A state is final
- IFF one of the NFA state was final

Preliminary Definitions

- NFA N = ( S, Σ, s0, F, MOVE )
- ε-Closure(s) : s ε S
- set of states in S that are reachable from s via ε-moves of N that originate from s.
- ε-Closure(T) : T ⊆ S
- NFA states reachable from all t ε T on ε-moves only.
- move(T,a) : T ⊆ S, a ε Σ
- Set of states to which there is a transition on input a from some t ε T

Algorithm

computing the ε-closure

forall(t in T) push(t);

initialize ε-closure(T) to T;

while stack is not empty do begin

t = pop();

for each u ε S with edge t→u labeled ε

if u is not in ε-closure(T)

add u to ε-closure(T) ;

push u onto stack

DFA construction

computing the

The set of states

The transitions

let Q = ε-closure(s0) ;

D = { Q };

enQueue(Q)

while queue not empty do

X = deQueue();

for each a ε Σ do

Y := ε-closure(move(X,a));

T[X,a] := Y

if Y is not in D

D = D U { Y }

enQueue(Y);

end

end

Summary

- We can
- Specify tokens with R.E.
- Use DFA to scan an input and recognize token
- Transform an NFA into a DFA automatically
- What we are missing
- A way to transform an R.E. into an NFA
- Then, we will have a complete solution
- Build a big R.E.
- Turn the R.E. into an NFA
- Turn the NFA into a DFA
- Scan with the obtained DFA

R.E. To NFA

- Process
- Inductive definition
- Use the structure of the R.E.
- Use atomic automata for atomic R.E.
- Use composition rules for each R.E. expression
- Recall
- RE ::= ε

::= s in Σ

::= rs

::= r | s

::= r*

Epsilon Construction

- RE ::= ε

Symbol Construction

- RE ::= x in Σ

Chaining Construction

- RE ::= rs

Branching Construction

- RE ::= r | s

Kleene-Closure Construction

- RE ::= r*

Overall Summary

- How does this all fit together ?
- Reg. Expr. → NFA construction
- NFA → DFA conversion
- DFA simulation for lexical analyzer
- Recall Lex Structure
- Pattern Action
- Pattern Action
- ……
- Each pattern recognizes lexemes
- Each pattern described by regular expression

(a | b)*abb

∈

∈

(abc)*ab

etc.

Recognizer!

Morale?

- All of this can be automated with a tool!
- LEX The first lexical analyzer tool for C
- FLEX A newer/faster implementation C / C++ friendly
- JFLEX A lexer for Java. Based on same principles.
- JavaCC
- ANTLR

Ahead...

- Grammars
- Parsing
- Bottom Up
- Top Down

Download Presentation

Connecting to Server..