Compiler principle and technology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 105

Compiler Principle and Technology PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Compiler Principle and Technology. Prof. Dongming LU Feb. 28th, 2014. 2. Scanning (Lexical Analysis). Contents. 2.1 The Scanning Process 2.2 Regular Expression 2.3 Finite Automata 2.4 From Regular Expressions to DFAs 2.5 Implementation of a TINY Scanner

Download Presentation

Compiler Principle and Technology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Compiler principle and technology

Compiler Principle and Technology

Prof. Dongming LU

Feb. 28th, 2014


2 scanning lexical analysis

2. Scanning (Lexical Analysis)


Contents

Contents

2.1 The Scanning Process

2.2 Regular Expression

2.3 Finite Automata

2.4 From Regular Expressions to DFAs

2.5 Implementation of a TINY Scanner

2.6 Use of Lex to Generate a Scanner Automatically


2 1 the scanning process

2.1 The Scanning Process


The function of a scanner

Source code

Scanner

Tokens

The Function of a Scanner

Patternrecognition

How to specification:Regular Expressin

Its Tools:DFA & NFA


Tokens

Tokens are defined as an enumerated type:

Typedef enum

{IF, THEN, ELSE, PLUS, MINUS, NUM, ID,…}

TokenType;

Tokens

  • RESERVED WORDS

    Such as IF and THEN, represent the strings of characters “if” and “then”

  • SPECIAL SYMBOLS

    Such as PLUS and MINUS, represent the characters “+” and “-“

  • OTHER TOKENS

    Such as NUM and ID, represent numbers and identifiers

Categories of Tokens:


Some practical issues of the scanner

Some Practical Issues of the Scanner

  • A token record : one structured data type to collect all the attributes of a token

    • Typedef struct

      {TokenType tokenval;

      char *stringval;

      int numval;

      } TokenRecord


Some practical issues of the scanner1

Some Practical Issues of the Scanner

  • The scanner returns the token value only and places the other attributes in variables

    TokeType getToken(void)

  • As an example of operation of getToken, consider the following line of C code.

    A[index] = 4+2


2 2 regular expression

2.2 Regular Expression


Some relative basic concepts

Some Relative Basic Concepts

  • Regular expressions :Patterns of strings of characters

  • Three elements:

    • r:the regular expression

    • L(r):The set of language match r

    • ∑:The set elements referred to as symbols

      (alphabet)

RE is only a tool for Scanning


2 2 1 definition of regular expressions

2.2.1 Definition of Regular Expressions


Basic regular expressions

Basic Regular Expressions

The single characters from alphabet matching themselves

Eg:

  • a matches the character a by writing L(a)={ a }

  • ε denotes the empty string, by L(ε)={ε}

  • {} or Φ matches no string at all, by L(Φ)={ }


Regular expression operations

Regular Expression Operations

  • Choice among alternatives

    • indicated by the meta-character |

  • Concatenation

    • indicated by juxtaposition

  • Repetition or “closure”

    • indicated by the meta-character *


Precedence of operation and use of parentheses

Precedence of Operation and Use of Parentheses

  • The standard convention

    * > concatenation > |

  • A simple example

    a|bc* is interpreted as a|(b(c*))

  • Parentheses is used to indicate a different precedence


Definition of regular expression

Definition of Regular Expression

  • A regular expression is one of the following:

    (1) A basic regular expression, a single legal character

    a from alphabet ∑, or meta-character ε or Φ.

    (2) The form r|s, where r and s are regular expressions

    (3) The form rs, where r and s are regular expressions

    (4) The form r*, where r is a regular expression

    (5) The form (r), where r is a regular expression

  • Parentheses do not change the language.


Examples of regular expressions

Examples of Regular Expressions

Example 1:

∑={ a,b,c} ,the set of all strings over this alphabet that contain exactly one b:

(a|c)*b(a|c)*

Example 2:

∑={ a,b,c} ,the set of all strings that contain at most one b:

(a|c)*|(a|c)*b(a|c)* (a|c)*(b|ε)(a|c)*

the same language may be generated by many different regular expressions.


Examples of regular expressions1

Examples of Regular Expressions

Example 3:

∑={ a,b}, the set of strings consists of a single b

surrounded by the same number of a’s :

S = {b, aba, aabaa,aaabaaa,……} = { anban | n≠0}

This set can not be described by a regular expression.

  • “regular expression can’t count ”

  • not all sets of strings can be generated by regular expressions.

  • A regular set : a set of strings that is the language for a regular expression is distinguished from other sets.


  • Examples of regular expressions2

    Examples of Regular Expressions

    Example 4:

    ∑={ a,b,c}, the strings contain no two consecutive b’s :

    ( (a|c)* | (b(a|c))* )*

    ( (a | c ) | (b( a | c )) )* or (a | c | ba | bc)*

    Not yet the correct answer

    The correct regular expression :

    • (a | c | ba | bc)* (b |ε)

    • ((b |ε) (a | c | ab| cb )*

    • (not b |b not b)*(b|ε) not b = a|c


    Examples of regular expressions3

    Examples of Regular Expressions

    Example 5:

    ∑={ a,b,c}, ((b|c)* a(b|c)*a)* (b|c)* Determine a concise English description of the language.The strings contain an even number of a’s:

    (not a* a not a* a)* not a*


    2 2 2 extensions to regular expression

    2.2.2 Extensions to Regular Expression


    List of new operations

    List of New Operations

    1) One or more repetitions:

    r+

    2) Any character:

    period “.”

    3) A range of characters:

    [0-9], [a-zA-Z]

    4) Any character not in a given set:

    (a|b|c) a character not either a or b or c

    [^abc] in Lex

    5) Optional sub-expressions

    r ? the strings matched by r are optional


    2 2 3 regular expressions for programming language tokens

    2.2.3 Regular Expressions for Programming Language Tokens


    Number reserved word and identifiers

    Number, Reserved word and Identifiers

    Numbers

    • nat = [0-9]+

    • signedNat = (+|-)?nat

    • number = signedNat(“.”nat)? (E signedNat)?

      Reserved Words and Identifiers

    • reserved = if | while | do |………

    • letter = [a-z A-Z]

    • digit = [0-9]

    • identifier = letter(letter|digit)*


    Comments

    Comments

    Several forms:

    { this is a pascal comment } {(  })*}

    ; this is a schema comment

    -- this is an Ada comment --(newline)* newline \n

    /* this is a C comment */

    can not written as ba(~(ab))*ab, ~ restricted to single character

    one solution for ~(ab) : b*(a*(a|b)b*)*a*

    Because of the complexity of regular expression, the comments will be handled by ad hoc methods in actual scanners.


    Ambiguity

    Ambiguity

    • Ambiguity: some strings can be matched by several different regular expressions

      • Either an identifier or a keyword, keyword interpretation preferred. Such as “if”

      • A single token or a sequence of several tokens, the single-token preferred.( the principle oflongest sub-string.) such as “abcdefg”


    White space and lookahead

    White Space and Lookahead

    White space:

    • Delimiters: characters that are unambiguously part of other tokens are delimiters.

      Such as xtemp=ytemp

    • whitespace = ( newline | blank | tab | comment)+

      Free format or fixed format

      Lookahead:

    • Buffering of input characters , marking places for Backtracking , such as (well-known example)

      DO99I=1,10

      DO99I=1.10


    2 3 finite automata

    2.3 FINITE AUTOMATA


    Introduction to finite automata

    letter

    letter

    1

    2

    digit

    Introduction to Finite Automata

    • Finite automata (finite-state machines) : a mathematical way of describing particular kinds of algorithms.

    • Astrong relationship between finite automata and regular expression

      eg: Identifier = letter (letter | digit)*


    Introduction to finite automata1

    letter

    letter

    1

    2

    digit

    Introduction to Finite Automata

    • Transition:

      • Record a change from one state to another upon a match of the character or characters by which they are labeled.

    • Start state:

      • The recognition process begin

      • Drawing an unlabeled arrowed line to it coming “from nowhere”

    • Accepting states:

      • Represent the end of the recognition process.

      • Drawing a double-line border around the state in the diagram


    2 3 1 definition of deterministic finite automata

    2.3.1 Definition of Deterministic Finite Automata


    The concept of dfa

    The Concept of DFA

    DFA: Automata where the next state is uniquely given by the current state and the current input character.

    Definition of a DFA:

    A DFA (Deterministic Finite Automation) M consist of

    (1) An alphabet ∑,

    (2) A set of states S,

    (3) A transition function T : S ×∑ → S,

    (4) A start state S0∈S,

    (5) A set of accepting states A  S


    The concept of dfa1

    The Concept of DFA

    The language accepted by a DFA M, written L(M)

    The set of strings of characters c1c2c3….cn with each ci ∈∑ such that there exist states s1 = t(s0,c1),s2 = t(s1,c2), sn = T(sn-1,cn) with sn an element of A (i.e. an accepting state).

    Accepting state sn means the same thing as the diagram:

    c1 c2 cn

     s0 s1 s2………sn-1 sn


    Some differences between definition of dfa and the diagram

    letter

    letter

    start

    In-id

    digit

    Some differences between definition of DFA and the diagram:

    The Definition is more mathematic

    1) The definition does not restrict the set of states to numbers

    2) We have not labeled the transitions with characters but with names representing a set of characters

    3) Definitions T: S ×∑ → S , T(s, c) must have a value for every s and c.

    • In the diagram, T (start, c) defined only if c is a letter,

      T(In_id, c) is defined only if c is a letter or a digit.

    • Error transitions are not drawn in the diagram but are simply assumed to always exist.


    Examples of dfa

    not b

    not b

    not b

    not b

    b

    b

    Examples of DFA

    Example 2.6: exactly accept one b

    Example 2.7: at most one b


    Examples of dfa1

    digi

    digit

    digit

    +

    digit

    digit

    digit

    Examples of DFA

    Example 2.8:

    digit = [0-9]

    nat = digit +

    signedNat = (+|-)? nat

    Number = singedNat(“.”nat)?(E signedNat)?

    A DFA of nat:

    A DFA of signedNat:


    Examples of dfa2

    other

    *

    1

    /

    2

    *

    3

    *

    4

    /

    5

    other

    Examples of DFA

    Example 2.9 : A DFA of C Comments

    (easy than write down a regular expression)


    2 3 2 lookahead backtracking and nondeterministic automata

    2.3.2 Lookahead, Backtracking, and Nondeterministic Automata


    A typical action of dfa algorithm

    letter

    letter

    [other]

    start

    in_id

    finish

    return ID

    digit

    A Typical Action of DFA Algorithm

    • Making a transition: move the character from the input string to a string that accumulates the characters belonging to a single token (the token string value or lexeme of the token)

    • Reaching an accepting state: return the token just recognized, along with any associated attributes.

    • Reaching an error state: either back up in the input (backtracking) or to generate an error token.


    Finite automation for an identifier with delimiter and return value

    letter

    letter

    [other]

    start

    in_id

    finish

    return ID

    digit

    Finite automation for an identifier with delimiter and return value

    • The error state: represents the fact that either an identifier is not to be recognized (if came from the start state) or a delimiter has been seen and we should now accept and generate an identifier-token.

    • [other]: indicate that the delimiting character should be considered look-ahead, it should be returned to the input string and not consumed.


    Finite automation for an identifier with delimiter and return value1

    letter

    letter

    start

    In-id

    digit

    letter

    [other]

    letter

    start

    in_id

    finish

    return ID

    digit

    Finite automation for an identifier with delimiter and return value

    • This diagram also expresses the principle of longest sub-string described in Section 2.2.3: the DFA continues to match letters and digits (in state in_id) until a delimiter is found.

    • By contrast the old diagram allowed the DFA to accept at any point while reading an identifier string.


    How to arrive at the start state in the first place

    How to arrive at the start state in the first place

    (combine all the tokens into one DFA)


    Each of these tokens begins with a different character

    :

    =

    return ASSIGN

    <

    =

    return LE

    =

    return EQ

    =

    return ASSIGN

    :

    <

    =

    return LE

    =

    return EQ

    Each of these tokens begins with a different character

    • Consider the tokens given by the strings : =, <=, and =

    • Each of these is a fixed string, and DFAs for them can be written as right

    • Uniting all of their start states into a single start state to get the DFA


    Several tokens beginning with the same character

    =

    =

    return LE

    return LE

    <

    <

    >

    <

    >

    return NE

    return NE

    <

    return LT

    [other]

    return LT

    Several tokens beginning with the same character

    • They cannot be simply written as the right diagram, since it is not a DFA

    • The diagram can be rearranged into a DFA


    Expand the definition of a finite automaton

    Expand the Definition of a Finite Automaton

    • One solution for the problem is to expand the definition of a finite automaton

    • More than one transition from a state may exist for a particular character

      (NFA: non-deterministic finite automaton)

      Developing an algorithm for systematically turning these NFA into DFAs


    Transition

    ε-transition

    • A transition that may occur without consulting the input string (and without consuming any characters)

    • It may be viewed as a "match" of the empty string.

      ( This should not be confused with a match of the character

      ε in the input)


    Transitions used in two ways

    :

    =

    <

    =

    =

    ε-Transitions Used in Two Ways.

    • First: to express a choice of alternatives in a way without combining states

      Advantage: keeping the original automata intact and only adding a new start state to connect them

    • Second: to explicitly describe a match of the empty string.


    Definition of nfa

    Definition of NFA

    • An NFA (non-deterministic finite automaton) M consists of

      • An alphabet , a set of states S,

      • A transition function T: S x ( U{ε})℘(S),

      • A start state S0from S, and a set of accepting states A from S

    • The language accepted by M, written L(M),

      • Defined to be the set of strings of characters c1c2…. cnwith

      • Each cifrom  U{ε} such that

      • There exist states s1in T(s0 ,c1), s2in (s1, c2),..., snin T(sn-1 , cn) with snan element of A.


    Some notes

    Some Notes

    • Any of the cI in c1c2…cn may be ε, and the string that is actually accepted is the string c1,c2…cnwith the ε's removed (since the concatenation of s with ε is s itself). Thus, the string c1,c2.. .cnmay actually have fewer than n characters in it

    • The sequence of transitions that accepts a particular string is not determined at each step by the state and the next input character.

      (Indeed, arbitrary numbers of ε's can be introduced into the string at any point, corresponding to any number of ε-transitions in the NFA)

    • An NFA does not represent an algorithm.

      However, it can be simulated by an algorithm that backtracks through every non-deterministic choice.


    Examples of nfas

    2

    a

    b

    a

    a

    1

    3

    4

    b

    b

    b

    Examples of NFAs

    Example 2.10

    • The string abb can be accepted by either of the following sequences of transitions:

      a b ε b

      →1→2→4→2→4

      a ε ε b ε b

      →1→3→4→2→4→2→4

    • This NFA accepts the languages as follows:

      regular expression: (a|ε)b*

      ab+|ab*|b*

    • Left DFA accepts the same language.


    Examples of nfas1

    a

    3

    4

    b

    1

    2

    7

    8

    9

    10

    c

    5

    6

    Examples of NFAs

    Example 2.11

    • It accepts the string acab by making the following transitions:

      (1)(2)(3)a(4)(7)(2)(5)(6)c(7)(2)(3)a(4)(7)(8)(9)b(10)

    • It accepts the same language as that generated by the regular expression : (a | c) *b


    2 3 3 implementation of finite automata in code

    2.3.3 Implementation of Finite Automata in Code


    Ways to translate a dfa or nfa into code

    letter

    letter

    [other]

    1

    2

    3

    digit

    Ways to Translate a DFA or NFA into Code

    The code for the DFA accepting identifiers:

    { starting in state 1 }

    ifthe next character is a letter then

    advance the input;

    { now in state 2 }

    while the next character is a letter or a digit doadvance the input; { stay in state 2 }

    end while;

    { go to state 3 without advancing the input}

    accept;

    else

    { error or other cases }

    end if;


    Ways to translate a dfa or nfa into code1

    letter

    letter

    [other]

    1

    2

    3

    digit

    Ways to Translate a DFA or NFA into Code

    Two drawbacks:

    • It is ad hoc—each DFA has to be treated slightly differently, and it is difficult to state an algorithm that will translate every DFA to code in this way.

    • The complexity of the code increases dramatically as the number of states rises or, more specifically, as the number of different states along arbi­trary paths rises.


    Ways to translate a dfa or nfa into code2

    letter

    letter

    [other]

    1

    2

    3

    digit

    Ways to Translate a DFA or NFA into Code

    A better method:

    • Using a variable to maintain the current state and

    • writing the transitions as a doubly nested case statement inside a loop,

    • where the first case statement tests the current state and the nested sec­ond level tests the input character.

      The code of the DFA for identifier:

      state := 1; { start }

      while state = 1 or 2 do

      case state of

      1: case input character of

      letter: advance the input :

      state := 2;

      else state := ….{ error or other };

      end case;

      2: case input character of

      letter , digit: advance the input;

      state := 2; { actually unnecessary }

    else state := 3;

    end case;

    end case;

    end while;

    if state = 3 then accept else error;


    Ways to translate a dfa or nfa into code3

    Characters in the alphabet c

    Ways to Translate a DFA or NFA into Code

    Generic code:

    Express the DFA as a data structure and then write "generic" code;

    A transition table (two-dimensional array) - indexed by state and input character that expresses the values of the transition function T


    Ways to translate a dfa or nfa into code4

    Ways to Translate a DFA or NFA into Code

    The transition table of the DFA for identifier:

    Assume :the first state listed is the start state

    This column

    indicates accepting

    states

    Brackets indicate “noninput-consuming” transitions


    Features of table driven method

    Features of Table-Driven Method

    Table driven: use tables to direct the progress of the algorithm.

    The advantage:

    • The size of the code is reduced, the same code will work for many different problems, and the code is easier to change (maintain).

      The disadvantage:

    • The tables can become very large

    • Table-driven methods often rely on table-compression methods .

      such as sparse-array representations, although there is usually a time penalty to be paid for such compression, since table lookup becomes slower. Since scanners must be efficient, these methods are rarely used for them.

      NFAs can be implemented in similar ways to DFAs, except

      NFAs are nondeterministic


    2 4 from regular expression to dfas

    2.4 From Regular Expression To DFAs


    Main purpose

    Main Purpose

    Program

    Regular

    Expression

    NFA

    DFA

    Study an algorithm: Translating a regular expression into a DFA via NFA.

    State-transition diagram

    Regular

    Expression

    Scanner

    equivalence

    FA

    NFA

    DFA


    2 4 1 from a regular expression to an nfa

    2.4.1 From a Regular Expression to an NFA


    Compiler principle and technology

    Regular

    Expression

    NFA

    State-transition diagram

    Regular

    Expression

    Scanner

    equivalence

    FA

    NFA

    DFA


    The idea of thompson s construction

    a

    The Idea of Thompson’s Construction

    • Use ε-transitions

      • to “glue together”the machine of each piece of a regular expression

      • to form a machine that corresponds to the whole expression

    • Basic regular expression

      The NFAs for basic regular expression of the form a, ε,or φ


    The idea of thompson s construction1

    r

    s

    The Idea of Thompson’s Construction

    • Concatenation: to construct an NFA equal to rs

      1) To connect the accepting state of the machine of r to the start state of the machine of s by an ε-transition.

      2) The start state of the machine of r as its start state and the accepting state of the machine of s as its accepting state.

      This machine accepts L(rs)= L(r)L(s) and so corresponds to the regular expression rs.


    The idea of thompson s construction2

    r

    s

    The Idea of Thompson’s Construction

    • Choice among alternatives: To construct an NFA equal to r | s

      To add a new start state and a new accepting state and connected them as shown using ε-transitions.

      Clearly, this machine accepts the language L(r|s) =L(r )UL (s), and so corresponds to the regular expression r|s.


    The idea of thompson s construction3

    r

    The Idea of Thompson’s Construction

    • Repetition: Given a machine that corresponds to r , Construct a machine that corresponds to r*

      1) To add two new states, a start state and an accepting state.

      2) The repetition is afforded by the new ε-transition from the accepting state of the machine of r to its start state.

      3) To draw an ε-transition from the new start state to the new accepting state.

      This construction is not unique, sim­plifications are possible in the many cases.


    Examples of nfas construction

    a

    a

    a

    b

    b

    b

    a

    Examples of NFAs Construction

    Example 1.12:

    Translate regular expression ab|a into NFA

    a

    b

    ab


    Examples of nfas construction1

    letter

    letter

    letter

    letter

    letter

    letter

    letter

    digit

    Examples of NFAs Construction

    letter

    letter|digit

    Example 1.13:

    Translate regular expression letter(letter|digit)* into NFA

    (letter|digit)*

    letter(letter|digit)*


    2 4 2 from an nfa to a dfa

    2.4.2 From an NFA to a DFA


    Compiler principle and technology

    NFA

    DFA

    State-transition diagram

    Regular

    Expression

    Scanner

    equivalence

    FA

    NFA

    DFA


    Goal and methods

    Goal and Methods

    • Goal:

      Given an arbitrary NFA, construct an equivalent DFA. (i.e., one that accepts precisely the same strings)

    • Some processes

      (1) Eliminating -transitions

      -closure: the set of all states reachable by -transitions from a state or states

      (2) Eliminating multiple transitions from a state on a single input character.

      Keeping track of the set of states that are reachable by matching a single character

      Both these processes lead us to consider sets of states instead of single states. Thus, it is not surprising that the DFA we construct has sets of states of the original NFA as its states.


    Subset construction

    a

    1

    2

    3

    4

    Subset Construction

    The -closure of a Set of states:

    The -closure of a single state s is the set of states reachable by a series of zero or more -transitions, and we write this set as

    Example 2.14: regular a*


    The subset construction algorithm

    The Subset Construction Algorithm


    Examples of subset construction

    a

    a

    a

    1

    2

    3

    4

    Examples of Subset Construction

    1,2,4

    2,3,4


    Examples of subset construction1

    letter

    5

    6

    letter

    1

    2

    3

    4

    9

    letter

    7

    8

    letter

    {4,5,6,7,9,10}

    letter

    letter

    {1}

    {2,3,4,5,7,10}

    digit

    letter

    digit

    {4,5,7,8,9,10}

    digit

    Examples of Subset Construction

    10

    1

    2,3,4,5,7,10

    4,5,6,7,9,10

    4,5,7,8,9,10


    2 4 3 simulating an nfa using the subset construction

    2.4.3 Simulating an NFA using the Subset Construction


    One way of simulating an nfa

    One Way of Simulating an NFA

    • NFAs can be implemented in similar ways to DFAs, except that NFAs are nondeterministic

      • Many different sequences of transitions that must be tried.

      • Store up transitions that have not yet been tried and backtrack to them on failure.


    Another way of simulating an nfa

    a

    b

    {1,2,6}

    {3,4,7,8}

    {5,8}

    Another Way of Simulating an NFA

    Use the subset construction

    • Instead of constructing all the states of the associated DFA

    • Construct only the state at each point that is indicated by the next input character

      The advantage: Not need to construct the entire DFA

      Example: input single character a, construct the start state {1,2,6}and then the second state {3,4,7,8} to move and match the a.

      Since no following b, accept without generating the state {5,8}


    Another way of simulating an nfa1

    letter

    letter

    {4,5,6,7,9,10}

    letter

    {1}

    {2,3,4,5,7,10}

    digit

    letter

    digit

    {4,5,7,8,9,10}

    digit

    Another Way of Simulating an NFA

    The disadvantage: A state may be constructed many times, if the path contains loops

    Example: given the input string r2d3, the sequence of states as showing below

    If these states are constructed as the transitions occur, then the states of the DFA have been constructed and the state {4,5,7,8,9,10}has even been constructed twice

    Less efficient than constructing the entire DFA


    2 4 4 minimizing the number of states in a dfa

    2.4.4 Minimizing the Number of States in a DFA


    Compiler principle and technology

    NFA

    DFA

    minimum-state DFA

    State-transition diagram

    Regular

    Expression

    Scanner

    equivalence

    FA

    Minimum-state-DFA

    NFA

    DFA


    Why need minimizing

    a

    a

    a

    Why need Minimizing ?

    The process of deriving a DFA algorithmically from a regular expression has the unfortunate property that The resulting DFA may be more complex than necessary.

    Example: The derived the DFA for the regular expression a* and an equivalent DFA


    An important result from automata theory for minimizing

    An Important Result from Automata Theory for Minimizing

    • Given any DFA, there is an equivalent DFA containing a minimum number of states, and, that this minimum-state DFA is unique (except for renaming of states)

    • It is also possible to directly obtain this minimum-state DFA from any given DFA.


    Algorithm obtaining mini states dfa

    Algorithm obtaining Mini-States DFA

    1. Begins with the most optimistic assumption possible. Creates two sets: one consisting of all the accepting states and the other consisting of all the non-accepting states.

    2. Given this partition of the states of the original DFA, consider the transitions on each character a of the alphabet.

    (1) If all accepting states have transitions on a to accepting states, then this defines an a-transition from the new accepting state (the set of all the old accepting states) to itself.

    (2) If all accepting states have transitions on a to non-accepting states, then this defines an a-transition from the new accepting state to the new non-accepting state (the set of all the old non-accepting states).


    Algorithm obtaining mini states dfa1

    Algorithm obtaining Mini-States DFA

    (3) On the other hand, if there are two accepting states s and t that have transitions on a that land in different sets, then no a-transition can be defined for this grouping of the states. We say that a distinguishes the states s and t

    (4) We must also consider error transitions to an error state that is non-accepting. If there are accepting states s and t such that s has an a-transition to another accepting state, while t has no a-transition at all (i.e., an error transition), then a distinguishes s and t.

    3. If any further sets are split, we must return and repeat the process from the beginning. This process continues until either all sets contain only one element (in which case, we have shown the original DFA to be minimal) or until no further splitting of sets occurs.


    Examples of minimizing dfa

    letter

    letter

    1

    2

    digit

    letter

    {4,5,6,7,9,10}

    letter

    letter

    {1}

    {2,3,4,5,7,10}

    digit

    letter

    digit

    {4,5,7,8,9,10}

    digit

    Examples of Minimizing DFA

    Example 2.18: The regular expression letter(letter|digit)*

    {2,3,4,5,7,10},{4,5,6,7,9,10},{4,5,7,8,9,10}

    {1}


    2 5 implementation of a tiny scanner

    2.5 Implementation of a Tiny Scanner


    2 6 use of lex to generate a scanner automatically

    2.6 Use of Lex to Generate a Scanner Automatically


    Introduction to lex

    Introduction to Lex

    Lex

    • Use the Lex scanner generator to generate a scanner from a description of the tokens of TINY as regular expressions, the most popular version of Lex is called flex {for Fast Lex)

    State-transition diagram

    Regular

    Expression

    Scanner

    equivalence

    FA

    Minimum-state-DFA

    NFA

    DFA


    Introduction to lex1

    Lex

    compiler

    lex.yy.c

    Lex source code lex.l

    C

    compiler

    lex.yy.c

    a.out

    a.out

    tokens

    input stream

    Introduction to Lex


    2 6 1 lex conventions for regular expression

    2.6.1 Lex conventions for regular expression


    The table of conventions

    The table of Conventions


    Conventions

    Conventions

    • Matching of single characters, or strings of characters, by writing the characters in sequence.

    • Metacharacters matched as actual characters by surrounding the characters in quotes; Quotes written around characters that are not metacharacters, where they have no effect.

      i) match a left parenthesis, we must write " (“

      ii)an alternative is to use the backslash metacharacter \

      iii)match the character sequence (* , have to write \(\* or " (* "


    Conventions1

    Conventions

    • Metacharacters : *, +, (, ) , |, ?

      Example: The set of strings of a’s and b’s that begin with either aa or bb and have an optical c at the end:

      (aa|bb)(a|b)*c? ("aa"|"bb")("a"|"b")*"c"

      The Lex convention for character classes (sets of characters) is to write them between square brackets. The above example can be writen as:

      (aa|bb) [ab]*c?


    Conventions2

    Conventions

    • Ranges of characters written using a hyphen.

      The expression [0-9] means in Lex any of the digits zero through nine.

    • A period is a metacharacter represents a set of characters:

      It represents any character except a new-line.

    • Complementary sets written in this notation, using the carat ^ as the first character inside the brackets

      [^0-9abc] means any character that is not a digit and is not one of the letters a, b, or c.


    2 6 2 the format of a lex input file

    2.6.2 The format of a Lex input file


    The format

    The format

    { definitions }

    %%

    { rules }

    %%

    { auxiliary routines}

    • The definition section occurs before the first %%.

      Any C code that must be inserted external to any function should appear in this section between the delimiters %{and %}

    • Names for regular expressions must also be defined in this section. A name is defined by writing it on a separate line starting in the first column and following it (after one or more blanks) by the regular expression it represents.


    The format1

    The format

    { definitions }

    %%

    { rules }

    %%

    { auxiliary routines}

    • The second section: rules

      These consist of a sequence of regular expressions followed by the C code that is to be executed when the corresponding reg­ular expression is matched.


    The format2

    The format

    { definitions }

    %%

    { rules }

    %%

    { auxiliary routines}

    • The third section: auxiliary routines

      Routines are called in the second section and not defined elsewhere.

      This section may also contain a main program, if we want to compile the Lex output as a standalone program.

      This section can also be missing. (the second %% need not be written. The first %% is always necessary.)


    Examples

    Examples

    The following Lex input specifies a scanner that adds line numbers to text, sending its output to the screen.

    %{

    /* a Lex program that adds line numbers

    to lines of text, printing the

    new text to the standard output

    */

    #include <stdio.h>

    int lineno = l;

    %}

    line .*\n

    %%

    {line} { printf ("%5d %s",lineno++,yytext) ; }

    %%

    main( )

    { yylex( ); return 0; }


    Examples1

    Examples

    Running the program obtained from Lex on this input file itself gives the following output:

    1 %{

    2 /* a Lex program that adds line numbers

    3 to lines of text, printing the

    4 new text to the standard

    5 */

    6 #include <stdio.h>

    7 int lineno = l;

    8 %}

    9 line .*\n

    10 %%

    11 {line} { printf ("%5d %s",lineno++, yytext) ; }

    12 %%

    13 main( )

    14 { yylex( ); return 0; }


    Additional feature of lex input

    Additional feature of Lex input

    • Lex has a priority system for resolving such ambiguities.

      • First, Lex always matches the longest possible substring {so Lex always generates a scanner that follows the longest substring principle).

      • Then, if the longest substring still matches two or more rules, Lex picks the first rule in the order they are listed in the action section.

    • If the rules and actions as follows:

      • .*\n ;

      • {ends_with_a} ECHO;

      • {begins_with_a} ECHO;

        The program produced by Lex would generate no output at all for any file, since every line of input will be matched by the first rule.


    Summary

    Summary

    • Ambiguity resolution

      • Lex's output will always first match the longest possible substring to a rule.

      • If two or more rules cause substrings of equal length to be matched, then Lex's output will pick the rule listed first in the action section.

      • If no rule matches any nonempty sub­string, then the default action copies the next character to the output and continues.


    Summary1

    Summary

    • Insertion of c code

      • Any text written between %{ and %} in the definition section will be copied directly to the output program external to any procedure.

      • Any text in the auxiliary procedures section will be copied directly to the output program at the end of the Lex code.

      • Any code that follows a regular expression (by at least one space) in the action section (after the first %%) will be inserted at the appropriate place in the recognition procedure yylex and will be executed when a match of the corresponding regular expression occurs.

      • The C code representing an action may be either a single C statement or a compound C statement consisting of any declara­tions and statements surrounded by curly brackets.


    Lex internal names

    Lex internal names


    End of chapter two

    End of Chapter Two

    THANKS


  • Login