lexical analysis part 1 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Lexical Analysis Part 1 PowerPoint Presentation
Download Presentation
Lexical Analysis Part 1

Loading in 2 Seconds...

play fullscreen
1 / 33

Lexical Analysis Part 1 - PowerPoint PPT Presentation

  • Uploaded on

Lexical Analysis Part 1. CMSC 431 Shon Vick. Lexical Analysis – What’s to come. Programs could be made from characters, and parse trees would go down to the character level Machine specific, obfuscates parsing, cumbersome

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Lexical Analysis Part 1' - noelle

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lexical analysis part 1

Lexical AnalysisPart 1

CMSC 431

Shon Vick

lexical analysis what s to come
Lexical Analysis – What’s to come
  • Programs could be made from characters, and parse trees would go down to the character level
    • Machine specific, obfuscates parsing, cumbersome
  • Lexical analysis is firewall between program representation and parsing actions
    • Prior lexical analysis phase obtains tokens consisting of a type (ID) and value (the lexeme matched)
  • In Principle – simple transition diagrams (finite state automata) characterize each of the “things” that can be recognized
  • In Practice – a program combines the multiple automata definitions into an efficient state machine
lexical phase
Lexical Phase
  • Simple (non-recursive)
  • Efficient (special purpose code)
  • Portable (ignore character-set and architecture differences)
  • Use JavaCC, lex , flex , etc
  • Used in practice with Bison/Yacc , etc.
lexical processing
Lexical Processing
  • Token: terminal symbols in a grammar. At the lexical level this is a symbol constant, and in “print” is represented in bold
  • Pattern: set of matching strings. For a keyword it is a constant. For a variable or value it can be represented by a regular expression
  • Lexeme: character sequence matched by an instance of the token
lexical processing5
Lexical Processing
  • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc.
  • Languages may have special rules (i.e., PL/1 does not have “Reserved words” and Fortran allows spaces in variables; both are obscure design choices)
lexical analysis sequences
Lexical Analysis – sequences
  • Expression
    • Base * base - 0x4 * height * width
  • Token sequence
    • Name:base operator:times name:base operator:minus hexConstant:4 operatort:imes name:height operator:times name:width
  • Lexical phase returns token and value (yylval , yytext, etc)
  • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc.
  • Formal specification of tokens by regular expressions, define alphabet, strings, languages
regular expression notation
Regular Expression Notation
  • a: an ordinary letter from our alphabet
  • ε: the empty string
  • r1 | r2: choosing from r1 or r2
  • r1r2 : concatenation of r1 and r2
  • r*: zero or more times (Kleene closure)
  • r+: one or more times
  • r?: zero or one occurrence
  • [a-zA-Z] character class (choice)
  • . period stands for any single char exc. newline
semantics of regular expressions
Semantics of Regular Expressions
  • L(e) = {e}
  • L(a) = {a} for all a in S
  • L (r1 | r2) = L(r1) U L (r2)
  • L (r1 r2) = {x,y) | x in L(r1 ), y in L(r2 )}
  • L (R*) = { e } U { x in L(R )} ,

{ x1 x2 | x1 ,x2 in L(R ) } …

{ x1 . . .xn | x1. … xn in L(R ) }

for homework
For Homework
  • Suppose S is {a ,b}

What is the regular expression for:

      • All strings beginning and ending in a?
      • All strings with an odd number of a’s?
      • All strings without two consecutive a’s?
      • All strings with an odd number of b’s followed by an even number of a’s
  • What’s the description for a Java floating point number?
  • What’s the description of variable name in Java?
why we care about regular expressions






Specification of Tokens


Implementation of DFA

Why we care about Regular Expressions

For every regular expression, there is a deterministic finite-state

machine that defines the same language, and vice versa

regular expressions
Regular Expressions
  • Automaton is a good “visual” aid
    • but is not suitable as a specification (its textual description is too clumsy)
  • However regular expressions are a suitable specification
    • a compact way to define a language that can be accepted by an automaton.
regexp use and construction
RegExp Use and Construction
  • Used as the input to a scanner generator like lex or flex or JavaCC
    • define each token, and also
    • define white-space, comments, etc
      • these do not correspond to tokens, but must be recognized and ignored.
  • A NFA can be constructed from a RegExp via Thompson’s Construction
thompson s construction
Thompson’s Construction
  • There are building blocks for each regular expression operator
  • More complex RegExps are constructed by composing smaller building blocks
  • Assumes that the NFAs at each step of the construction will have a single accepting state
regular expressions to nfa 1



Regular Expressions to NFA (1)
  • For each kind of rexp, define an NFA
    • Notation: NFA for rexp M
  • For 
  • For input a
regular expressions to nfa 2





Regular Expressions to NFA (2)
  • For A B
  • For A | B
  • What would be representation for A+?
  • What would be representation for A??
  • What about for[a-z]?
example of regexp nfa conversion
Example of RegExp -> NFA conversion
  • Consider the regular expression


  • The NFA is














more homework problems
More Homework Problems
  • What is the NFA for the following RE?

(a(b+c))* a

  • What is the NFA for the following RE?

((a|b)*c) | (a b c*)

lexical analyzer
Lexical Analyzer
  • Can be programmed in a high-level language.
  • Can be generated using tools like LEX/Flex
  • Integrate these tools with C/C++ or Java code
  • In Java there are other tools Jflex for example
how can a tool like lex or javacc work
How can a tool like LEX or JAVACC work?
  • Translate regular expressions to Non-deterministic Finite Automata (NFA)
    • Easier expressive form than the DFA
    • Automata theory tells us how to optimize
  • Run the automata
    • Simulate NFA, or
    • Translate NFA to DFA: a new DFA where each state corresponds to a set of NFA states (see pgages 28-29 pf Appel for set construction)
      • Have DFA move between states in simulation of the NFAs states
non deterministic fa
Non-deterministic FA
  • NFA is modified to allow zero, one or MOREtransitions from a state on the same input symbol
  • Easier to express complex patterns as NFA
  • Harder to mechanically simulate NFS: what transition do we make on input (simulate all of them, then confirm it worked)
  • DFA and NFA are functionally equivalent.
dfa with null moves
DFA with null moves
  • The model of NFA can be extended to include transitions on <null> input.
  • Change the state without reading any symbol from the input stream.
  • e-closure(q) : set of all states reachable from q without reading any input symbol (following the null edges)
e closure operator
eClosure Operator
  • The eClosure operator is defined as eClosure(s) = { s } U states reachable from s using e transitions.
  • Example: eClosure(1) = {1,3}











re to fa
RE to FA
  • If we write expression as RE (easy for people) how do we turn it into an FA (something a machine can simulate)
  • Use Thompson’s Construction
    • At most twice as many states as there are symbols and operators in the regular expression.
    • Results in a NFA (needs a non-deterministic computer to run most efficiently, hmm….)
nfa to dfa
  • Build “super states” in a DFA where each “super state” represents the set of transitions that the NFA could make from a state on a symbol
  • e-closure(q) : states that can be arrived at from q with just null transitions
    • move(S, a) : states that can be reached on scanning a symbol a (from the input)
    • e-closure(S) : states that can be reached with E transitions from states in S
nfa to dfa cont
NFA to DFA (cont….)
  • Subset Construction (alg 3.2)

Find e-closure(q0)

while ( S in FAStates is unmarked)


mark S

for each a in alphabet {

T = e-closure ( move(S, a) ) ;

if (T  FAStates)

FAStates.include( T );

FATran[S, a] = T ;

} }

fa v s nfa
FA v.s. NFA
  • NFA is smaller O(|r|) space but more time for simulation O(|r|*|x|) time even with the nice properties of Thompson’s construction
  • DFA is faster O(|x|) time, but is not space efficient, O(2|r|) space
nfa t dfa
  • What is the difference between the two?
  • Is there a single DFA for a corresponding NFA?
  • Why do we want to do this anyway?
subset construction for nfa dfa
Subset Construction for NFA-> DFA
  • Compute A = eClosure(start)
  • Compute the set of states reachable from A on transition a, call this new set S’
  • Compute eClosure(S’) – this is the new state and label it with the next available label
  • Continue for all possible transitions from the current state for all applicable elements of S
  • Repeat steps 2-4 for each new state
example a c b
Example: a c*b














  • Compilers Principles, Techniques and Tools, Aho, Sethi, Ullman Chapter 3
  • http://www.cs.columbia.edu/~lerner/CS4115
  • Modern Compiler Implementation in Java, Andrew Appel, Cambridge University Press, 2003