Compiler construction
1 / 57

Compiler Construction - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Compiler Construction. 2 주 강의 Lexical Analysis. token. Lexical Analyzer. Parser. Source Program. get next token. Symbol Table. Lexical Analysis. “ get next token ” is a command sent from the parser to the lexical analyzer.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Compiler Construction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Compiler Construction

2주 강의

Lexical Analysis





get next token


Lexical Analysis

  • “get next token” is a command sent from the parser to the lexical analyzer.

  • On receipt of the command, the lexical analyzer scans the input until it determines the next token, and returns it.

Other jobs of the lexical analyzer

  • We also want the lexer to

    • Strip out comments and white space from the source code.

    • Correlate parser errors with the source code location (the parser doesn’t know what line of the file it’s at, but the lexer does)

Tokens, patterns, and lexemes

  • A TOKEN is a set of strings over the source alphabet.

  • A PATTERN is a rule that describes that set.

  • A LEXEME is a sequence of characters matching that pattern.

  • E.g. in Pascal, for the statement

    const pi = 3.1416;

  • The substring pi is a lexeme for the token identifier

Example tokens, lexemes, patterns


  • Together, the complete set of tokens form the set of terminal symbols used in the grammar for the parser.

  • In most languages, the tokens fall into these categories:

    • Keywords

    • Operators

    • Identifiers

    • Constants

    • Literal stirings

    • Punctuation

  • Usually the token is represented as an integer.

  • The lexer and parser just agree on which integers are used for each token.

Token attributes

  • If there is more than one lexeme for a token, we have to save additional information about the token.

  • Example: the token number matches lexemes 10 and 20.

  • Code generation needs the actual number, not just the token.

  • With each token, we associate ATTRIBUTES. Normally just a pointer into the symbol table.

Example attributes

  • For C source code

    E = M * C * C

  • We have token/attribute pairs

    <ID, ptr to symbol table entry for E>

    <Assign_op, NULL>

    <ID, ptr to symbol table entry for M>

    <Mult_op, NULL>

    <ID, ptr to symbol table entry for C>

    <Mult_op, NULL>

    <ID, ptr to symbol table entry for C>

Lexical errors

  • When errors occur, we could just crash

  • It is better to print an error message then continue.

  • Possible techniques to continue on error:

    • Delete a character

    • Insert a missing character

    • Replace an incorrect character by a correct character

    • Transpose adjacent characters

Token specification

  • REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification.

  • Every pattern specifies a set of strings, so an RE names a set of strings.

  • Definitions:

    • The ALPHABET (often written ∑) is the set of legal input symbols

    • A STRING over some alphabet ∑ is a finite sequence of symbols from ∑

    • The LENGTH of string s is written |s|

    • The EMPTY STRING is a special 0-length string denoted ε

More definitions: strings and substrings

  • A PREFIX of s is formed by removing 0 or more trailing symbols of s

  • A SUFFIX of s is formed by removing 0 or more leading symbols of s

  • A SUBSTRING of s is formed by deleting a prefix and a suffix from s

  • A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix, suffix, or substring of s but with x ≠ s.

More definitions

  • A LANGUAGE is a set of strings over a fixed alphabet ∑.

  • Example languages:

    • Ø (the empty set)

    • { ε }

    • { a, aa, aaa, aaaa }

  • The CONCATENATION of two strings x and y is written xy

  • String EXPONENTIATION is written si, where s0 = ε and si = si-1s for i>0.

Operations on languages

We often want to perform operations on sets of strings (languages). The important ones are:

  • The UNION of L and M: L ∪ M = { s | s is in L OR s is in M }

  • The CONCATENATION of L and M:LM = { st | s is in L and t is in M }



Regular expressions

  • REs let us precisely define a set of strings.

  • For C identifiers, we might use( letter | _ ) ( letter | digit | _ )*

  • Parentheses are for grouping, | means “OR”, and * means Kleene closure.

  • Every RE defines a language L(r).

Regular expressions

  • Here are the rules for writing REs over an alphabet ∑ :

    • ε is an RE denoting { ε }, the language containing only the empty string.

    • If a is in ∑, then a is a RE denoting { a }.

    • If r and s are REs denoting L(r) and L(s), then

      • (r)|(s) is a RE denoting L(r) ∪ L(s)

      • (r)(s) is a RE denoting L(r) L(s)

      • (r)* is a RE denoting (L(r))*

      • (r) is a RE denoting L(r)

Additional conventions

  • To avoid too many parentheses, we assume:

    • * has the highest precedence, and is left associative.

    • Concatenation has the 2nd highest precedence, and is left associative.

    • | has the lowest precedence and is left associative.

Example REs

  • a | b

  • ( a | b ) ( a | b )

  • a*

  • (a | b )*

  • a | a*b

Equivalence of REs

Regular definitions

  • To make our REs simpler, we can give names to subexpressions. A REGULAR DEFINITION is a sequence

    d1 -> r1

    d2 -> r2

    dn -> rn

Regular definitions

  • Example for identifiers in C:

    letter -> A | B | … | Z | a | b | … | z

    digit -> 0 | 1 | … | 9

    id -> ( letter | _ ) ( letter | digit | _ )*

  • Example for numbers in Pascal:

    digit -> 0 | 1 | … | 9

    digits -> digitdigit*

    optional_fraction -> . digits | ε

    optional_exponent -> ( E ( + | - | ε ) digits ) | ε

    num -> digits optional_fraction optional_exponent

Notational shorthand

  • To simplify out REs, we can use a few shortcuts:

    • 1. + means “one or more instances of”a+ (ab)+

    • 2. ? means “zero or one instance of”Optional_fraction -> ( . digits ) ?

    • 3. [] creates a character class[A-Za-z][A-Za-z0-9]*

  • You can prove that these shortcuts do not increase the representational power of REs, but they are convenient.

Token recognition

  • We now know how to specify the tokens for our language. But how do we write a program to recognize them?

    if -> if

    then -> then

    else -> else

    relop -> < | <= | = | <> | > | >=

    id -> letter ( letter | digit )*

    num -> digit ( . digit )? ( E (+|-)? digit )?

Token recognition

  • We also want to strip whitespace, so we need definitions

    delim -> blank | tab | newline

    ws -> delim+

Attribute values

Transition diagrams

  • Transition diagrams are also called finite automata.

  • We have a collection of STATES drawn as nodes in a graph.

  • TRANSITIONS between states are represented by directed edges in the graph.

  • Each transition leaving a state s is labeled with a set of input characters that can occur after state s.

  • For now, the transitions must be DETERMINISTIC.

  • Each transition diagram has a single START state and a set of TERMINAL STATES.

  • The label OTHER on an edge indicates all possible inputs not handled by the other transitions.

  • Usually, when we recognize OTHER, we need to put it back in the source stream since it is part of the next token. This action is denoted with a * next to the corresponding state.

Automated lexical analyzer generation

  • Next time we discuss Lex and how it does its job:

    • Given a set of regular expressions, produce C code to recognize the tokens.

Lexical Analysis

Lexical Analysis Example

Lexical Analysis With Lex

Lexical analysis with Lex

Lex source program format

  • The Lex program has three sections, separated by %%:



    transition rules


    auxiliary code

Declarations section

  • Code between %{ and }% is inserted directly into the lex.yy.c. Should contain:

    • Manifest constants (#define for each token)

    • Global variables, function declarations, typedefs

  • Outside %{ and }%, REGULAR DEFINITIONS are declared.Examples:

    delim [ \t\n]

    ws {delim}+

    letter [A-Za-z]

Each definition is a name followed by a pattern.

Declared names can be used in later patterns, if surrounded by { }.

Translation rules section

Translation rules take the form

p1 { action1 }

p2 { action2 }


pn { actionn }

Where pi is a regular expression and actioni is a C program fragment to be executed whenever pi is recognized in the input stream.

In regular expressions, references to regular definitions must be enclosed in {} to distinguish them from the corresponding character sequences.

Auxiliary procedures

  • Arbitrary C code can be placed in this section, e.g. functions to manipulate the symbol table.

  • 이미 설명했음

Special characters

Some characters have special meaning to Lex.

  • ‘.’ in a RE stands for ANY character

  • ‘*’ stands for Kleene closure

  • ‘+’ stands for positive closure

  • ‘?’ stands for 0-or-1 instance of

  • ‘-’ produces a character range (e.g. in [A-Z])

    When you want to use these characters in a RE, they must be “escaped”

    e.g. in RE {digit}+(\.{digit}+)? ‘.’ is escaped with ‘\’

Lex interface to yacc

  • The yacc parser calls a function yylex() produced by lex.

  • yylex() returns the next token it finds in the input stream.

  • yacc expects the token’s attribute, if any, to be returned via the global variable yylval.

  • The declaration of yylval is up to you (the compiler writer). In our example, we use a union, since we have a few different kinds of attributes.

Lookahead in Lex

Sometimes, we don’t know until looking ahead several characters what the next token is. Recognition of the DO keyword in Fortran is a famous example.

DO5I=1.25 assigns the value 1.25 to DO5I

DO5I=1,25 is a DO loop

Lex handles long-term lookahead with r1/r2:DO/({letter}|{digit})*=({letter}|{digit})*,

(if it’s followed by letters & digits, ‘=’,

more letters & digits, followed by a ‘,’)

Recognize keyword DO

Finite Automata for Lexical Analysis

Automatic lexical analyzer generation

  • How do Lex and similar tools do their job?

    • Lex translates regular expressions into transition diagrams.

    • Then it translates the transition diagrams into C code to recognize tokens in the input stream.

  • There are many possible algorithms.

  • The simplest algorithm is RE -> NFA -> DFA -> C code.

Finite automata (FAs) and regular languages

  • A RECOGNIZER takes language L and string x as input, and responds YES if x∈L, or NO otherwise.

  • The finite automaton (FA) is one class of recognizer.

  • A FA is DETERMINISTIC if there is only one possible transition for each <state,input> pair.

  • A FA is NONDETERMINISTIC if there is more than one possible transition some <state,input> pair.

  • BUT both DFAs and NFAs recognize the same class of languages: REGULAR languages, or the class of languages that can be written as regular expressions.


  • A NFA is a 5-tuple < S, ∑, move, s0, F >

  • S is the set of STATES in the automaton.


  • move( s, c ) = S is the TRANSITION FUNCTIONspecifying which states S the automaton can move to on seeing input c while in state s.

  • s0 is the START STATE.

  • F is the set of FINAL, or ACCEPTING STATES

NFA example

and recognizes the language L = (a|b)*abb

(the set of all strings of a’s and b’s ending with abb)


has move() function:

The language defined by a NFA

  • An NFA ACCEPTS string x iff there exists a path from s0 to an accepting state, such that the edge labels along the path spell out x.

  • The LANGUAGE DEFINED BY a NFA N, written L(N), is the set of strings it accepts.

Another NFA example

This NFA accepts L = aa*|bb*

Deterministic FAs (DFAs)

The DFA is a special case of the NFA except:

  • No state has an ε-transition

  • No state has more than one edge leaving it for the same input character.

    The benefit of DFAs is that they are simple to simulate: there is only one choice for the machine’s state after each input symbol.

Algorithm to simulate a DFA

Inputs: string x terminated by EOF; DFA D = < S, ∑, move, s0, F >

Outputs: YES if D accepts x; NO otherwise


s = s0;

c = nextchar;

while ( c != EOF ) {

s = move( s, c );

c = nextchar;


if ( s ∈ F ) return YES

else return NO

DFA example

This DFA accepts L = (a|b)*abb


  • Now we know how to simulate DFAs.

  • If we can convert our REs into a DFA, we can automatically generate lexical analyzers.

  • BUT it is not easy to convert REs directly into a DFA.

  • Instead, we will convert our REs to a NFA then convert the NFA to a DFA.

Converting a NFA to a DFA


  • NFAs are ambiguous: we don’t know what state a NFA is in after observing each input.

  • The simplest conversion method is to have the DFA track the SUBSET of states the NFA MIGHT be in.

  • We need three functions for the construction:

    • ε-closure(s): the set of NFA states reachable from NFA state s on ε-transitions alone.

    • ε-closure(T): the set of NFA states reachable from some state s ∈ T on ε-transitions alone.

    • move(T,a): the set of NFA states to which there is a transition on input a from some NFA state s ∈ T

Subset construction algorithm

  • Inputs: a NFA N = < SN, ∑, tranN, n0, FN >

  • Outputs: a DFA D = < SD, ∑, tranD, d0, FD >

  • Method:

    add a state d0 to SD corresponding to ε-closure(n0) while there is an unexpanded state di ∈ SD{

    for each input symbol a ∈ ∑ {

    dj = ε-closure(move(di,a))

    if dj ∉SD,

    add dj to SD

    tranN( di, a ) = dj



Examples: convert these NFAs



Converting a RE to a NFA


  • The construction is bottom up.

  • Construct NFAs to recognize ε and each element a ∈ ∑.

  • Recursively expand those NFAs for alternation, concatenation, and Kleene closure.

  • Every step introduces at most two additional NFA states.

  • Therefore the NFA is at most twice as large as the regular expression.

RE -> NFA algorithm

Inputs: A RE r over alphabet ∑

Outputs: A NFA N accepting L(r)

Method: Parse r.

If r = ε, then N is

If r = a ∈ ∑ , then N is

If r = s | t, construct N(s) for s and N(t) for t then N is

RE -> NFA algorithm

If r = st, construct N(s) for s and N(t) for t then N is

If r = s*, construct N(s) for s, then N is

If r = ( s ), construct N(s) then let N be N(s).


Use the NFA construction algorithm to build a NFA forr = (a|b)*abb

  • Login