Chapter 3 lexical analysis
1 / 80

Chapter 3: Lexical Analysis - PowerPoint PPT Presentation

  • Uploaded on

Chapter 3: Lexical Analysis. Csci 465. Objectives. Discuss techniques for specifying/implementing Lexical analyzers Examines methods to recognize words in a stream of characters Tokens, Patterns, Lexemes Attributes for Tokens Input Buffering (buffer pairs)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Chapter 3: Lexical Analysis' - melita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


  • Discuss techniques for specifying/implementing Lexical analyzers

  • Examines methods to recognize words in a stream of characters

    • Tokens, Patterns, Lexemes

    • Attributes for Tokens

  • Input Buffering (buffer pairs)

  • Finite Automata ( intermediate step)

    • DFA Faster but bigger

  • Implementing a Transition Diagram


  • Lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction

    • Webster’s Dictionary

Lexical analyzers features
Lexical analyzers features

  • Reads characters from the input file reduces them to manageable tokens

  • Main features include

    • Efficiency

    • Correctness

Lexical analysis vs parsing
Lexical Analysis vs. Parsing

  • Main reasons for separating the analysis phase

    • Compiler simplicity of design (separation of concerns)

    • Compiler efficiency (specialized buffering)

      • A large amount of time is dedicated for reading the source program and tokenization

      • Parser is harder than lexical analysis because the size of parser grows as the grammar grows

    • Compiler Portability

      • Input peculiarities and device specific-anomalies can be limited to the lexical analyzers

      • Special symbols (e.g., ) can be isolated in the LA

    • Lexical analysis can be fully automated

    • Tool Supports

      • Specialized tools have been implemented to automate the implementation of laxer and parser

Some terminologies token pattern lexemes
Some terminologies: Token, Pattern, Lexemes

  • Token (syntactic category)?

    • Terminal symbols in the grammar of the source languages

    • A pair:

      • token name

      • optional attribute value

        • E.g., ID

  • Lexeme?

    • An actual spelling or a sequence of characters in the source program

      • E.g., MyCounter

  • Pattern?

    • The possible form that the lexemes of a token may take

      • E.g., an identifier can be specified as a regular expression: L+D*

Token classes
Token classes

  • The following classes cover most or all of the tokens:

    • One token for each keyword

      • IF, THEN. WHILE, FOR, etc

    • Tokens for operators

      • +, -, /, *

    • One token for identifier

      • Mycounter, Myclass, x, y, p234, etc

    • Tokens for punctuation symbol

      • @, #, $, etc

    • One or more tokens representing constants (numbers) and strings literals

      • “mybook”

Lexical examples of non tokens
Lexical: examples of Non-Tokens

  • Examples of non-tokens

    • comment: /* do not change */

    • preprocessor directive: #include <stdio.h>

    • preprocessor directive: #define NUM 5

    • blanks

    • tabs

    • newlines

Attributes and tokens 1
Attributes and Tokens: 1

  • When more than one pattern matches a lexems, the LA must provide additional information about the particular lexeme that matched to the next phases of the compiler

    • E.g.,

      • the pattern num matches both 0 and 1; code generator needs to know the exact one

Attributes for token 2
Attributes for Token: 2

  • LA uses attributes to document the needed information because

    • Tokens influence parsing decisions

    • Attributes influence the translation of token

Example tokens and related attributes
Example: tokens and related attributes

  • E = M * C ** 2

    Written as

    < ID, ptr to symbol-table for E>

    < Assignsym>

    < ID, ptr to symbol-table for M>

    < Multsym>

    < ID, ptr to symbol-table for C>

  • < ExpSym>

  • < num, integer value 2>

Lexical analyzer and source code errors
Lexical Analyzer and source code errors

  • LA cannot detect syntax or semantic errors

  • Leaves it up to parser or semantic analyzers

    • E.g., LA cannot detect the following error

    • fi (a == f(x))…

      • fi?

        • Could be undeclared function call

        • Misspelled keyword or ID

      • Will be treated as a valid id

Error recovery and error handling by la
Error Recovery and Error handling by LA

  • Case where no pattern matches the current input

    • Delete successive characters from input till the LA finds the next well-formed token (panic mode)

    • Deleting an extraneous chars

    • Inserting a missing char

    • Replacing an incorrect char by corrected one

    • Transposing two adjacent char

Input buffering
Input Buffering

  • to find the end of token, LA may need to go one or more characters beyond the next lexeme

    • E.g.,

      • to find ID or >, =, ==

  • Buffer Pairs

    • Concerns with efficiency issues

    • Used with a lookahead on the input

Using a pair of input buffers
Using a pair of input buffers

N (4096 byte)

N (4096 byte)


Forward ptr

Using a pair of input buffers1
Using a pair of input buffers

N (4096 byte)

N (4096 byte)


Forward ptr

Using a pair of input buffers2
Using a pair of input buffers

N (4096 byte)

N (4096 byte)


Forward ptr

Using a pair of input buffers3
Using a pair of input buffers

N (4096 byte)

N (4096 byte)


Forward ptr

Specification of token
Specification of Token

  • Regular Expression are used to specify forms or patterns

  • Each pattern matches a set of strings

    • Where

      • Strings refers to finite sequence of symbols over alphabet denoted by 

      • ASCII and EBCDIC are two examples of Computer Alphabets

  • Language?

    • Denotes any set of strings over some fixed alphabet

      • Where alphabet denotes any finite set of symbols

      • E.g.

        • set {0,1} represents binary numbers

        • Set of all well-formed Pascal programs

Operations on languages
Operations on Languages

  • Important operations that can be applied to languages are:

    • Union of R and S written as RS

      • RS = {x| x  R  x  S}

      • i.e., Language L(R) L(S)

    • Concatenation of RS

      • RS=R.S = {xy|x   R y S}

      • i.e. Language L(R)L(S)

    • Kleene Closure of R

      • R* = { } | R | RR | RRR|…

      • i.e., (L(R))*

    • Positive closure of R written R+

      • R+ = R | RR | RRR|…


  • Suppose:

    • L = { A, B,…Z,a,b,…z} and

    • D = {0,1,…,9}

  • New languages can be created from L and D by applying the operators

    • LD is the set of letters and digits (62 string where each|si|=1)

      • E.g., a, A, 1, b, …

    • LD is the set of strings consisting of a letter followed by a digit

      • E.g., a1, a2, a3, b9, etc.

    • L4 is the set of all four-letter strings

    • Aaaa, aadd, axcv, etc

More examples
More examples

  • L* is a set of ALL strings of letters, including 

  • L(LD)* is the set of all stings of letters and digits beginning with a letter

    • E.g., a, aa, a1, …,a211111

  • D+ is the set of all strings of one or more digits

Regular expression formal definition
Regular Expression: Formal Definition

  • A regular expression is a formal expression that can be specified according these rules

    • if  is a RE that denotes { }, which means the set containing the empty string

    • If a is a symbol in , then a is a regular expression and L(a) = {a}

    • If r and s are RE denoting the language L (R) and L(s) then

      • (r)|(s) is RE denoting L(r)L(s)

      • (r)(s) is a RE denoting L(r)L(s)

      • (r)* is a RE denoting (L(r))*

      • (r) is a RE denoting L(r).

Re precedence rules
RE: Precedence rules

  • Unnecessary parentheses can be avoided if we adopt the following rules

    • * has the highest precedence and is left associate

    • Concatenation has second highest precedence and is left associative

    • Union has the lowest precedence and is left associative

Some examples
Some examples

  • Let ={a, b}

    • The RE a|b denotes the set {a,b}

    • The RE (a|b)(a|b) denotes

      • {aa, ab, ba, bb} (i.e., the set of all strings of a’s and b’s of length two

    • The RE a* denotes the set of all strings of zero or more

      • {, a,aa,aa,…}

    • The RE (a|b)* denotes the set of all strings zero or more instances of an a or b

      • {, a,aa,aa,b, bb, ab,ba,…}

Regular language
Regular Language

  • A language L is regular iff

    • there exists a regular expression that specifies the strings in L

  • If S and R regular expressions, then R and S define Regular Language L(R) and L(S)


  • Examples

    • L(abc) = {abc}

    • L(hello | Bye)= { Hello, Bye}

    • L([1-9][0-9]*)= all possible integer constants

      • where

        • [1-9] means (1|…|9)

Algebra of re see fig 3 7
Algebra of RE (see fig. 3.7)

  • Regular set: A language that can be defined by RE

  • If two REs r and s generate the same set, we can they are equivalent using s = r

  • E.g.,

    • (a|b) = (b|a)

Regular definitions
Regular Definitions

  • For notational convenience, we may give names to RE and define RE using these names diri

    • Where:

      • Each di is a new symbol, not in , and not the same as any other of the d’s

      • Each ri is a RE in {   {d1,…,di-1} }

Example 3 5 pg 123
Example.3.5 (pg 123)

  • E.g.,

    • C identifier are strings of letter, digits, and underscore can be defined by following regular definitions:

      • letters A|B|…|Z|a|b|…|z|-

      • digit 0|1|…|9

      • id  letter_ (letter_ | digit)*

Example unsigned numbers in pascal
Example: Unsigned numbers in Pascal

  • Unsigned numbers in Pascal are strings

    • 5280

    • 78.90

    • 6.336E4

    • 1.89E-4

  • Regular definitions

    • digit 0|1|…|9

    • digits digitdigit*

    • optional_fractions . digits |

    • optional_exp (E(+|-| ) digits| 

    • number digits optional_fractionoptional_exp

Shorthand notation
Shorthand Notation

  • Character classes

    • [aba] where a, b, and c are alphet symbol is a shorthand for RE A|b|c

    • [a-z] shorthand for a|b|…|z

Limitation of re
Limitation of RE

  • RE can not be used to describe some programming construct

    • E.g.,

      • Balanced parentheses

      • Repeating strings

        • {wcw| w is a string of a’s and b’s}

        • RE can be used for fixed or unspecified number of repetitions (arbitrary)

Recognition of tokens
Recognition of Tokens

  • RE are used to specify pattern

    • Used mainly to specify pattern for ALL possible tokens in language

  • How to recognize tokens are totally different issues


  • Consider the following grammar

    • Stmtif exp then stmt

    • |if exp then stmt else stmt

    • |

    • exp term relop term

    • | term

    • term id

    • | num

Quiz 3 9 20 2013
Quiz 3: 9.20.2013

  • Describe the language denoted by the following RE

    • a(a|b)*a

Goal building lex
Goal: Building lex

  • Our goal is to build a LA that will identify the lexeme for the next token in the input buffer and generates as output a pair consisting of the token and its attributes

    • E.g.

      • Id: RE specifies Id and passes token id with its attributes to Parser

Transition diagram
Transition diagram

  • An intermediate step but important step in implementing the LAX

  • Transition diagram represents the actions that must take place when a LAX is called by the parser

    • Used to keep track of information about characters as scanned by forward pointer AND beginning pointer

Deterministic finite automata dfa

For every language defined by a RE, there exists a DFA to recognize the same language

FSA can be defined

M = (,Q,T,q0, F)

: alphabet

Q: a finite set of states

T: QQ a finite set of transition rule {partial function}

q0: start state

F: final/halting states

Deterministic Finite Automata (DFA)

Simple dfa
Simple DFA recognize the same language

Input symbols

a d









Automata for if
Automata for IF recognize the same language






Automata for
Automata for >= recognize the same language








Combine automata for each token
Combine Automata for each token recognize the same language

Final Automata can be created by combing individual automaton

Augmenting with action
Augmenting with action recognize the same language

Re review
RE: Review recognize the same language

More and more example
More and More Example recognize the same language

Error handling using re
Error Handling using RE recognize the same language

  • One can add special REs that match erroneous token

  • Example:

    • A RE for a fixed-point number with no digit after “.” is very common error

    • Integer_Num::= [0-9]+

    • Fixed_point_num::= [0-9]* ’.’ [0-9]+

    • Bad_Fixed_point_num::= [0-9]*’.’

      • The above specification will generate the token bad_fixed_point_num when erroneous input is detected

      • Action: appending 0 to correcting it is very important

        • Allows routines advances in the compilation and not crashes on the incorrect input

Simple implementation of dfa
Simple implementation of DFA recognize the same language

  • Input: an input string x terminated by eof. A DFA D with start state so and set of accepting state F

  • Output: Yes if FA accepts x, NOotherwise

Simple implementation
Simple Implementation recognize the same language

s:= s0

c:= getnextchar()

While c eof do

s:= move(s,c)



If s F then

return “yes”

Else return “no”;

More example
More example recognize the same language

continue recognize the same language

Code to implement id recognize the same language

Nondeterministic fa nfa
Nondeterministic FA (NFA) recognize the same language

  • NFA

    • Differs from deterministic model in two ways

      • For any given state and input symbol, there may exist more than one transition

      • State transition can occur without reading an input token– this is called empty transition

example recognize the same language

  • NFA for (a|b)*abb










Possible paths: {<0, 0, 1,2,3>, <0,0,0,0,0>}

table recognize the same language

DFA recognize the same language

More example1
More example recognize the same language

  • NFA to recognize aa*|bb*










All kind of transformation
All kind of Transformation recognize the same language

  • There is all kind of transformation from automata to RE and from RE to automata

Lexical analyzer implementation approaches
Lexical Analyzer: Implementation Approaches recognize the same language

General Approach to implement Lexical Analyzer (LA)

1. Tool such as Lex

2. Write the LA using Programming Languages

3. Write LA in assembly language (difficult but efficient)

Main approach
Main Approach recognize the same language

Hand written approach
Hand written approach recognize the same language

Option two using tool to generate lex
Option two: Using Tool to generate Lex recognize the same language

Example recognize the same language

More on lexer generators
More on Lexer generators recognize the same language

From re to lexical analyzer
From RE to Lexical Analyzer recognize the same language

  • The idea behind FA is to automate the generation of executable scanners using RE

    • RENFADFA Code

The cycle of construction
The cycle of Construction recognize the same language

  • The following mapping are needed

    • RE-NFA mapping (Thompson's’ Algorithms)

    • NFA-DFA mapping (Subset Construction)

    • DFA-DFA mapping (Minimization)

More example2
More Example recognize the same language















Rules to describe the behaviors of nfa
Rules to describe the behaviors of NFA recognize the same language

  • Models to maintain the behavior of NFAs

    • Maintains the well-defined accepting mechanism of the DFA

    • Or, for any given input, the NFA clones (configurations) itself to pursue each possible transition

  • NFA Halts iff there exists (at least) one path from start state to final state

  • Any NFA can be simulated by DFA

Re to nfa thompson s construction
RE to NFA: Thompson’s Construction recognize the same language

Applying thompson s rules to a b c
Applying Thompson’s rules to a(b|c)* recognize the same language

The worst case space and time complexity for re using fsa
The worst-case space and time complexity for RE using FSA recognize the same language

r: length of RE

x: length of input string