Lexical analysis
Download
1 / 39

LEXICAL ANALYSIS - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

LEXICAL ANALYSIS. Phung Hua Nguyen University of Technology 2006. Outline. Introduction to Lexical Analysis Token specification Language Regular Expressions (REs) Token recoginition REs  NFA (Thompson’s construction, Algorithm 3.3) NFA  DFA (subset construction, Algorithm 3.2)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'LEXICAL ANALYSIS' - uttara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lexical analysis

LEXICAL ANALYSIS

Phung Hua Nguyen

University of Technology

2006


Outline
Outline

  • Introduction to Lexical Analysis

  • Token specification

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction, Algorithm 3.3)

    • NFA  DFA (subset construction, Algorithm 3.2)

    • DFA  minimal DFA (Algorithm 3.6)

  • Programming

Lexical Analysis


Introduction
Introduction

  • Read the input characters

  • Produce as output a sequence of tokens

  • Eliminate white space and comments

token

lexical analyzer

source program

parser

get next token

symbol table

Lexical Analysis


Lexical analysis
Why ?

  • Simplify design

  • Improve compiler efficiency

  • Enhance compiler portability

Lexical Analysis


Tokens patterns lexemes
Tokens, Patterns, Lexemes

Lexical Analysis


Outline1
Outline

  • Introduction 

  • Token specification

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction, Algorithm 3.3)

    • NFA  DFA (subset construction, Algorithm 3.2)

    • DFA  minimal DFA (Algorithm 3.6)

  • Programming

Lexical Analysis


Alphabet strings and languages
Alphabet, Strings and Languages

  • Alphabet ∑: any finite set of symbols

    • The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…}

    • The binary alphabet {0,1}

    • The ASCII alphabet

  • String: a finite sequence of symbols drawn from ∑ :

    • Length |s| of a string s: the number of symbols in s

    • The empty string, denoted , || = 0

  • Language: any set of strings over ∑;

    • its two special cases:

      • : the empty set

      • {}

Lexical Analysis


Examples of languages
Examples of Languages

  • ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…}

    • Vietnamese language

  • ∑ = {0,1}

    • A string is an instruction

    • The set of Pentium instructions

  • ∑ = the ASCII set

    • A string is a program

    • The set of C programs

Lexical Analysis


Terms fig 3 7
Terms (Fig.3.7)

Lexical Analysis


String operations
String operations

  • String concatenation

    • If x and y are strings, xy is the string formed by appending y to x.

      E.g.: x = hom, y = nay  xy = homnay

    •  is the identity: y = y; x = x

  • String exponentiation

    • s0 = 

    • si = si-1s

      E.g. s = 01, s0 = , s2 =0101, s3 = 010101

Lexical Analysis


Language operations fig 3 8
Language Operations (Fig 3.8)

Lexical Analysis


Examples
Examples

  • L = {A,B,…,Z,a,b,…,z}

  • D = {0,1,…,9}

letters and digits

strings consists of a letter followed by a digit

all four-letter strings

all strings of letters, including 

all strings of letters and digits beginning with a letter

all strings of one or more digits

Lexical Analysis


Regular expressions res over alphabet
Regular Expressions (Res) over Alphabet

  • Inductive base:

    •  is a RE, denoting the RL {}

    • a  ∑ is a RE, denoting the RL {a}

  • Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then

    • (r)|(s) is a RE, denoting the RL L(r)  L(s)

    • (r)(s) is a RE, denoting the RL L(r)L(s)

    • (r)* is a RE, denoting the RL (L(r))*

    • (r) is a RE, denoting the RL L(r)

Lexical Analysis


Precedence and associativity
Precedence and Associativity

  • Precedence:

    • “*” has the highest precedence

    • “concatenation” has the second highest precedence

    • “|” has the lowest precedence

  • Associativity:

    • all are left-associative

      E.g.: (a)|((b)*(c))  a|b*c

       Unnecessary parentheses can be removed

Lexical Analysis


Example
Example

  • ∑ = {a, b}

  • a|b denotes {a,b}

  • (a|b)(a|b) denotes {aa,ab,ba,bb}

  • a* denotes {,a,aa,aaa,aaaa,…}

  • (a|b)* denotes ?

  • a|a*b denotes ?

Lexical Analysis


Notational shorthands
Notational Shorthands

  • One or more instances +: r+ = rr*

    • denotes the language (L(r))+

    • has the same precedence and associativity as *

  • Zero or one instance ?: r? = r|

    • denotes the language (L(r)  {})

  • Character classes

    • [abc] denotes a|b|c

    • [A-Z] denotes A|B|…|Z

    • [a-zA-Z_][a-zA-Z0-9_]* denotes ?

Lexical Analysis


Outline2
Outline

  • Introduction 

  • Token specification 

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction, Algorithm 3.3)

    • NFA  DFA (subset construction, Algorithm 3.2)

    • DFA  minimal DFA (Algorithm 3.6)

  • Programming

Lexical Analysis


Overview
Overview

RE

3.3

3.5

3.6

3.2

mDFA

NFA

DFA

Lexical Analysis


Nondeterministic finite automata
Nondeterministic finite automata

  • A nondeterministic finite automaton (NFA) is a mathematical model that consists of

    • a finite set of states S

    • a set of input symbols ∑

    • a transition function move: S  ∑ S

    • a start state s0

    • a finite set of final or accepting states F

Lexical Analysis


Transition graph

A

B

A

A

Transition graph

a

A

Lexical Analysis


Transition table
Transition table

Input symbol

State

Lexical Analysis


Acceptance

A

Acceptance

  • A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x.

0

0

01010

B

1

0

0

1

0

A  B  A  B  A  B

1

1

0

01011

0

0

1

1

1

error

A  B  A  B  A  ?

Lexical Analysis


Deterministic finite automata
Deterministic finite automata

  • A deterministic finite automaton (DFA) is a special case of NFA in which

    • no state has an -transition, and

    • for each state s and input symbol a, there is at most one edge labeled a leaving s.

Lexical Analysis


Thompson s construction of nfa from res
Thompson’s construction of NFA from REs

  • guided by the syntactic structure of the RE r

  • For ,

  • For a in ∑

i

f

a

i

f

Lexical Analysis


Thompson s construction cont d

i

i

f

f

Thompson’s construction (cont’d)

  • Suppose N(s) and N(t) are NFA’s for REs s and t

    • For s|t,

    • For st,

    • For s*,

    • For (s), use N(s) itself

N(s)

N(t)

N(t)

N(s)

f

i

N(t)

Lexical Analysis


Outline3
Outline

  • Introduction 

  • Token specification 

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction) 

    • NFA  DFA (subset construction)

    • DFA  minimal DFA (Algorithm 3.6)

  • Programming

Lexical Analysis


Subset construction
Subset construction

  • s : an NFA state

  • T : a set of NFA states

Lexical Analysis


Subset construction cont d
Subset construction (cont’d)

Let s0 be the start state of the NFA;

Dstates contains the only unmarked state -closure(s0);

while there is an unmarked state T in Dstatesdo begin

mark T

for each input symbol a do begin

U := -closure(move(T; a));

if U is not in Dstatesthen

Add U as an unmarked state to Dstates;

DTran[T; a] := U;

end;

end;

Lexical Analysis


Lexical analysis
DFA

  • Let (∑, S, T, F, s0) be the original NFA. The DFA is:

  • The alphabet: ∑

  • The states: all states in Dstates

  • The transitions: DTran

  • The accepting states: all states in Dstates containing at least one accepting state in F of the NFA

  • The start state: -closure(s0)

Lexical Analysis


Outline4
Outline

  • Introduction 

  • Token specification 

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction) 

    • NFA  DFA (subset construction) 

    • DFA  minimal DFA (Algorithm 3.6)

  • Programming

Lexical Analysis


Minimise a dfa
Minimise a DFA

Initially, create two states:

  • one is the set of all final states: F

  • the other is the set of all non-final states: S - F

    while (more splits are possible) {

    Let S = {s1,…, sn} be a state and c be any char in ∑

    Let t1,…, tn be the successor states to s1,…, sn under c

    if (t1,…, tn don't all belong to the same state) {

    Split S into new states so that si and sj remain in the

    same state iff ti and tj are in the same state

    }

    }

Lexical Analysis


Example1

A

B

D

A

C

B

D

Example

b

Step1: {A,B,C,D} {E}

For a, {B,B,B,B}

For b, {C,D,C,E}

Split {A,B,C} {D} {E}

Step 2:

For b, {C,D,C}

Split {A,C} {B} {D} {E}

Step 3:

For a, {B,B}

For b, {C,C}

Terminate

b

b

a

a

b

b

E

a

a

a

b

b

b

a

b

b

E

a

a

a

Lexical Analysis


Outline5
Outline

  • Introduction 

  • Token specification 

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction) 

    • NFA  DFA (subset construction) 

    • DFA  minimal DFA (Algorithm 3.6) 

  • Programming

Lexical Analysis


Input buffering
Input Buffering

begin…

Scanner

if (forward at end of first half) {

reload second half

forward++

} else

if (forward at end of second half) {

reload first half

forward = 0

} else

forward++

eof

Lexical Analysis


Input buffering1
Input Buffering

begin…

Scanner

eof

forward = forward + 1

if (forward↑=eof) {

if (forward at end of first half) {

reload second half

forward++

} else

if (forward at end of second half) {

reload first half

forward = 0

} else

terminate the analysis

}

eof

eof

Lexical Analysis


Transition diagrams

0

1

6

5

Transition Diagrams

<

=

relop  <= | < |<>

return(relop,LE)

2

>

return(relop,NE)

3

other

4

return(relop,LT)

letter

other

return(id,lexeme)

7

id  letter(letter|digit)*

letter or digit

Transition diagram is a DFA in which there is no edge leaving out of a final state

Lexical Analysis


Implementation
Implementation

token nexttoken() {

while (1) {

switch (state) {

case 0: c = nextchar();

if (c == ‘<‘) state = 1;

else state = fail(0);

break;

case 1: c = nextchar();

if (c == ‘=‘) state = 2;

else if (c == ‘>’ state = 3;

else state = 4;

break;

case 2: retract(0);

return new Token(relop,”<=”);

case 4: retract(1);

return new Token(relop,”<”);

case 5: c = nextchar();

if (Character.isLetter(c))

state = 6;

else state = fail(5);

break;

case 6: c = nextchar();

if (Character.isLetter(c)

||Character.isDigit(c))

continue;

else state = 7;

break;

case 7: retract(1);

return new Token(id,

getLexeme());

Lexical Analysis


Implemetation cont d
Implemetation (cont’d)

int fail(int current_state) {

forward = beginning;

switch (current_state) {

case 0: return 5;

case 5: error();

}

}

void retract(int flag) {

if (flag ==1)

move forward back

get lexeme from beginning to forward

move forward onward

beginning = forward

state = 0

}

b│e│g│i│n│:│=│ │ │…

Lexical Analysis


Outline6
Outline

  • Introduction 

  • Token specification 

    • Language

    • Regular Expressions (REs)

  • Token recoginition

    • REs  NFA (Thompson’s construction) 

    • NFA  DFA (subset construction) 

    • DFA  minimal DFA (Algorithm 3.6) 

  • Programming 

Lexical Analysis