lexical analysis part 1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lexical Analysis Part 1 PowerPoint Presentation
Download Presentation
Lexical Analysis Part 1

Loading in 2 Seconds...

play fullscreen
1 / 33

Lexical Analysis Part 1 - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Lexical Analysis Part 1. CMSC 431 Shon Vick. Lexical Analysis – What’s to come. Programs could be made from characters, and parse trees would go down to the character level Machine specific, obfuscates parsing, cumbersome

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lexical Analysis Part 1' - noelle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lexical analysis part 1

Lexical AnalysisPart 1

CMSC 431

Shon Vick

lexical analysis what s to come
Lexical Analysis – What’s to come
  • Programs could be made from characters, and parse trees would go down to the character level
    • Machine specific, obfuscates parsing, cumbersome
  • Lexical analysis is firewall between program representation and parsing actions
    • Prior lexical analysis phase obtains tokens consisting of a type (ID) and value (the lexeme matched)
  • In Principle – simple transition diagrams (finite state automata) characterize each of the “things” that can be recognized
  • In Practice – a program combines the multiple automata definitions into an efficient state machine
lexical phase
Lexical Phase
  • Simple (non-recursive)
  • Efficient (special purpose code)
  • Portable (ignore character-set and architecture differences)
  • Use JavaCC, lex , flex , etc
  • Used in practice with Bison/Yacc , etc.
lexical processing
Lexical Processing
  • Token: terminal symbols in a grammar. At the lexical level this is a symbol constant, and in “print” is represented in bold
  • Pattern: set of matching strings. For a keyword it is a constant. For a variable or value it can be represented by a regular expression
  • Lexeme: character sequence matched by an instance of the token
lexical processing5
Lexical Processing
  • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc.
  • Languages may have special rules (i.e., PL/1 does not have “Reserved words” and Fortran allows spaces in variables; both are obscure design choices)
lexical analysis sequences
Lexical Analysis – sequences
  • Expression
    • Base * base - 0x4 * height * width
  • Token sequence
    • Name:base operator:times name:base operator:minus hexConstant:4 operatort:imes name:height operator:times name:width
  • Lexical phase returns token and value (yylval , yytext, etc)
tokens
Tokens
  • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc.
  • Formal specification of tokens by regular expressions, define alphabet, strings, languages
regular expression notation
Regular Expression Notation
  • a: an ordinary letter from our alphabet
  • ε: the empty string
  • r1 | r2: choosing from r1 or r2
  • r1r2 : concatenation of r1 and r2
  • r*: zero or more times (Kleene closure)
  • r+: one or more times
  • r?: zero or one occurrence
  • [a-zA-Z] character class (choice)
  • . period stands for any single char exc. newline
semantics of regular expressions
Semantics of Regular Expressions
  • L(e) = {e}
  • L(a) = {a} for all a in S
  • L (r1 | r2) = L(r1) U L (r2)
  • L (r1 r2) = {x,y) | x in L(r1 ), y in L(r2 )}
  • L (R*) = { e } U { x in L(R )} ,

{ x1 x2 | x1 ,x2 in L(R ) } …

{ x1 . . .xn | x1. … xn in L(R ) }

for homework
For Homework
  • Suppose S is {a ,b}

What is the regular expression for:

      • All strings beginning and ending in a?
      • All strings with an odd number of a’s?
      • All strings without two consecutive a’s?
      • All strings with an odd number of b’s followed by an even number of a’s
  • What’s the description for a Java floating point number?
  • What’s the description of variable name in Java?
why we care about regular expressions

NFA

Regular

expressions

DFA

Lexical

Specification of Tokens

Table-driven

Implementation of DFA

Why we care about Regular Expressions

For every regular expression, there is a deterministic finite-state

machine that defines the same language, and vice versa

regular expressions
Regular Expressions
  • Automaton is a good “visual” aid
    • but is not suitable as a specification (its textual description is too clumsy)
  • However regular expressions are a suitable specification
    • a compact way to define a language that can be accepted by an automaton.
regexp use and construction
RegExp Use and Construction
  • Used as the input to a scanner generator like lex or flex or JavaCC
    • define each token, and also
    • define white-space, comments, etc
      • these do not correspond to tokens, but must be recognized and ignored.
  • A NFA can be constructed from a RegExp via Thompson’s Construction
thompson s construction
Thompson’s Construction
  • There are building blocks for each regular expression operator
  • More complex RegExps are constructed by composing smaller building blocks
  • Assumes that the NFAs at each step of the construction will have a single accepting state
regular expressions to nfa 1

M

a

Regular Expressions to NFA (1)
  • For each kind of rexp, define an NFA
    • Notation: NFA for rexp M
  • For 
  • For input a
regular expressions to nfa 2

A

B

B

A

Regular Expressions to NFA (2)
  • For A B
  • For A | B
others
Others
  • What would be representation for A+?
  • What would be representation for A??
  • What about for[a-z]?
example of regexp nfa conversion
Example of RegExp -> NFA conversion
  • Consider the regular expression

(1|0)*1

  • The NFA is

1

C

E

1

B

A

G

H

I

J

0

D

F

more homework problems
More Homework Problems
  • What is the NFA for the following RE?

(a(b+c))* a

  • What is the NFA for the following RE?

((a|b)*c) | (a b c*)

lexical analyzer
Lexical Analyzer
  • Can be programmed in a high-level language.
  • Can be generated using tools like LEX/Flex
  • Integrate these tools with C/C++ or Java code
  • In Java there are other tools Jflex for example
how can a tool like lex or javacc work
How can a tool like LEX or JAVACC work?
  • Translate regular expressions to Non-deterministic Finite Automata (NFA)
    • Easier expressive form than the DFA
    • Automata theory tells us how to optimize
  • Run the automata
    • Simulate NFA, or
    • Translate NFA to DFA: a new DFA where each state corresponds to a set of NFA states (see pgages 28-29 pf Appel for set construction)
      • Have DFA move between states in simulation of the NFAs states
non deterministic fa
Non-deterministic FA
  • NFA is modified to allow zero, one or MOREtransitions from a state on the same input symbol
  • Easier to express complex patterns as NFA
  • Harder to mechanically simulate NFS: what transition do we make on input (simulate all of them, then confirm it worked)
  • DFA and NFA are functionally equivalent.
dfa with null moves
DFA with null moves
  • The model of NFA can be extended to include transitions on <null> input.
  • Change the state without reading any symbol from the input stream.
  • e-closure(q) : set of all states reachable from q without reading any input symbol (following the null edges)
e closure operator
eClosure Operator
  • The eClosure operator is defined as eClosure(s) = { s } U states reachable from s using e transitions.
  • Example: eClosure(1) = {1,3}

a

start

1

5

3

a

a/b

b

2

4

re to fa
RE to FA
  • If we write expression as RE (easy for people) how do we turn it into an FA (something a machine can simulate)
  • Use Thompson’s Construction
    • At most twice as many states as there are symbols and operators in the regular expression.
    • Results in a NFA (needs a non-deterministic computer to run most efficiently, hmm….)
nfa to dfa
NFA to DFA
  • Build “super states” in a DFA where each “super state” represents the set of transitions that the NFA could make from a state on a symbol
  • e-closure(q) : states that can be arrived at from q with just null transitions
    • move(S, a) : states that can be reached on scanning a symbol a (from the input)
    • e-closure(S) : states that can be reached with E transitions from states in S
nfa to dfa cont
NFA to DFA (cont….)
  • Subset Construction (alg 3.2)

Find e-closure(q0)

while ( S in FAStates is unmarked)

{

mark S

for each a in alphabet {

T = e-closure ( move(S, a) ) ;

if (T  FAStates)

FAStates.include( T );

FATran[S, a] = T ;

} }

fa v s nfa
FA v.s. NFA
  • NFA is smaller O(|r|) space but more time for simulation O(|r|*|x|) time even with the nice properties of Thompson’s construction
  • DFA is faster O(|x|) time, but is not space efficient, O(2|r|) space
nfa t dfa
NFA t DFA
  • What is the difference between the two?
  • Is there a single DFA for a corresponding NFA?
  • Why do we want to do this anyway?
subset construction for nfa dfa
Subset Construction for NFA-> DFA
  • Compute A = eClosure(start)
  • Compute the set of states reachable from A on transition a, call this new set S’
  • Compute eClosure(S’) – this is the new state and label it with the next available label
  • Continue for all possible transitions from the current state for all applicable elements of S
  • Repeat steps 2-4 for each new state
example a c b
Example: a c*b

e

a

c

e

e

b

1

2

3

4

6

5

e

references
References
  • Compilers Principles, Techniques and Tools, Aho, Sethi, Ullman Chapter 3
  • http://www.cs.columbia.edu/~lerner/CS4115
  • Modern Compiler Implementation in Java, Andrew Appel, Cambridge University Press, 2003