Programming Language Implementation Lexical and Syntax Analysis Part II

Programming Language ImplementationLexical and Syntax AnalysisPart II Introduction

Outline • Overview of parsing • Introduction • Parsing • Some more details • Lexical analysis • Parsing Introduction

Reference • Compilers: Principles, Techniques, and ToolsA. V. Aho, R. Sethi, and J. D. UllmanAddison-Wesley Publishing Company 1988Chapters 1, 2, 3, 4, and 5 Introduction

Introduction • A programming language can be defined by describing its • Syntax and • Semantics • Grammar-oriented compilation technique • Syntax-directed translation • Example • Infix expressions translated to post-fix expressions • Input to output mapping • 9 -5 +2 to 95-2+ Introduction

Example • Syntax • e  e + d • e  e – d • e  d • d  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Tokens • + - 0 1 2 3 4 5 6 7 8 9 Introduction

Parsing • Parsing is a process to determine if a string of tokens can be generated by a grammar • Parsing methods • Top-down parsing • Bottom-up parsing Introduction

Parsing • Top-down parsing • At node n (labeled with nonterminal A), select one of the productions for A and construct children at n for the symbols on the RHS of the production • Find the next node at which a subtree is to be constructed • Recursive-descent parsing is a top-down syntax analysis method in which a set of recursive procedures are executed to process the input • A procedure is associated with each non terminal of a grammar • Left recursive rules can loop forever Introduction

Parsing • Bottom-up parsing Bottom-up parsing constructs a aprse tree for an input string of tokens beginning at the leaves and working up towards the root. • Shift-reduce parsing • Operator precedence parsing • LR parsing Introduction

Parsing • Removing left recursion • Example • Left-recursive grammar A  A α | β • Equivalent grammar without left recursion A  β R R  αR | ε Introduction

Some Important Basic Definitions lexical: Of or relating to the morphemes of a language. morpheme: A meaningful linguistic unit that cannot be divided into smaller meaningful parts. lexical analysis: The task concerned with breaking an input into its smallest meaningful units, called tokens. Introduction

Some Important Basic Definitions syntax: The way in which words are put together to form phrases, clauses, or sentences. The rules governing the formation of statements in a programming language. syntax analysis: The task concerned with fitting a sequence of tokens into a specified syntax. parsing: To break a sentence down into its component parts of speech with an explanation of the form, function, and syntactical relationship of each part. Introduction

Some Important Basic Definitions parsing = lexical analysis + syntax analysis semantic analysis: The task concerned with calculating the program’s meaning. Introduction

Regular Expressions Symbol: a A regular expression formed by a. Alternation: M | N A regular expression formed by M or N. Concatenation: M • N A regular expression formed by M followed by N. Epsilon: The empty string. Repetition: M*A regular expression formed by zero or more repetitions of M. Introduction

General approach: 1. Build a deterministic finite automaton (DFA) from regular expression E 2. Execute the DFA to determine whether an input string belongs to L(E) Note: The DFA construction is done automatically by a tool such as lex. Building a Recognizer for a Language Introduction

Finite Automata A nondeterministic finite automaton A = {S, , s0, F, move } consists of: 1. A set of statesS 2. A set of input symbols  (the input symbol alphabet) 3. A state s0 that is distinguished as the start state 4. A state F distinguished as the accepting state 5. A transition function move that maps state-symbol pairs into sets of state. In a Deterministic Finite State Automata (DFA), the function move maps each state-symbol pair into a unique state. Introduction

a start a b b 0 1 2 3 b Finite Automata What languages are accepted by these automata? A Deterministic Finite Automaton (DFA): start a b b 0 1 2 3 b*abb b A Nondeterministic Finite Automaton (NFA): (a|b)*abb Introduction (Aho,Sethi,Ullman, pp. 114)

Another NFA a a  start b b  An -transition is taken without consuming any character from the input. What does the NFA above accepts? aa*|bb* Introduction (Aho,Sethi,Ullman, pp. 116)

Constructing NFA How do we define an NFA that accepts a regular expression? It is very simple. Remember that a regular expression is formed by the use of alternation, concatenation, and repetition. Thus all we need to do is to know how to build the NFA for a single symbol, and how to compose NFAs. Introduction

f Given two NFA N(s) and N(t) , the NFA N(s|t) is: N(s)   start i f   N(t) Composing NFAs with Alternation start a The NFA for a symbol a is: i Introduction (Aho,Sethi,Ullman, pp. 122)

i f Composing NFAs with Concatenation Given two NFA N(s) and N(t), the NFA N(st) is: N(s) N(t) start Introduction (Aho,Sethi,Ullman, pp. 123)

i f Composing NFAs with Repetition  The NFA for N(s*) is   N(s)  Introduction (Aho,Sethi,Ullman, pp. 123)

Properties of the NFA • Following this construction rules, we obtain an NFA N(r) with these properties: • N(r) has at most twice as many states as the number of symbols and operators in r; • N(r) has exactly one starting and one accepting state; • Each state of N(r) has at most one outgoing transition on a symbol of the alphabet  or at most two outgoing -transitions. Introduction (Aho,Sethi,Ullman, pp. 124)

How to Parse a Regular Expression? Given a DFA, we can generate an automaton that recognizes the longest substring of an input that is a valid token. Using the three simple rules presented, it is easy to generate an NFA to recognize a regular expression. Given a regular expression, how do we generate an automaton to recognize tokens? Create an NFA and convert it to a DFA. Introduction

a An ordinary character stands for itself. The empty string. Another way to write the empty string. M | NAlternation,Choosing from M or N. M N Concatenation,an M followed by an N. M* Repetition(zero or more times). M+Repetition(one or more times). M?Optional, zero or one occurrence of M. [a -zA -Z] Character set alternation. .Stands for any single character except newline. “a.+*” Quotation, a string in quotes stands for itself literally. Regular expression notation: An Example Introduction (Appel, pp. 20)

Regular expressions for some tokens if {return IF;} [a - z] [a - z0 - 9 ] * {return ID;} [0 - 9] + {return NUM;} ([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;} (“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/} . {error ();} Introduction (Appel, pp. 20)

2 2 start i The NFA for a symbol i is: 1 start f The NFA for a symbol f is: 1 i f start 1 2 3 Building Finite Automatas for Lexical Tokens if {return IF;} The NFA for the regular expression if is: IF Introduction (Appel, pp. 21)

a-z a-z 2 0-9 Building Finite Automatas for Lexical Tokens [a-z] [a-z0-9 ] * {return ID;} start 1 ID Introduction (Appel, pp. 21)

0-9 2 0-9 Building Finite Automatas for Lexical Tokens [0 - 9] + {return NUM;} start 1 NUM Introduction (Appel, pp. 21)

0-9 0-9 . 0-9 2 3 . 0-9 4 0-9 5 Building Finite Automatas for Lexical Tokens ([0 - 9] + “.” [0 - 9] *) | ([0 - 9] * “.” [0 - 9] +) {return REAL;} start 1 REAL Introduction (Appel, pp. 21)

a-z - \n - 2 4 3 \t \n \n blank 5 \t blank Building Finite Automatas for Lexical Tokens (“--” [a - z]* “\n”) | (“ ” | “ \n ” | “ \t ”) + {/* do nothing*/} start 1 /* do nothing */ Introduction (Appel, pp. 21)

a-z 3 a-z i f 0 - 9 0 - 9 2 1 2 1 2 1 0-9 0 - 9 0 - 9 1 2 3 4 - 0 - 9 - \n a-z 2 1 3 1 2 any but \n blank, etc. 4 0 - 9 5 0 - 9 5 blank, etc. Building Finite Automatas for Lexical Tokens IF ID NUM . . REAL White space error Introduction (Appel, pp. 21)

12 11 10 7 1 2 5 6 15 13 3 8 Conversion of NFA into DFA a-z IF ID  f  a-z 4  i 0-9   NUM   0-9 0-9 9 14 any character  error  What states can be reached from state 1 without consuming a character? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   12 11 10 2 7 5 6 1 15 13 3 8 NUM   0-9 0-9 9 14 any character  error  Conversion of NFA into DFA What states can be reached from state 1 without consuming a character? {1,4,9,14} form the -closure of state 1 Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   10 12 11 5 6 2 1 7 15 13 8 3 NUM   0-9 0-9 9 14 any character  error  Conversion of NFA into DFA What are all the state closures in this NFA? closure(1) = {1,4,9,14} closure(5) = {5,6,8} closure(8) = {6,8} closure(7) = {7,8} closure(10) = {10,11,13} closure(13) = {11,13} closure(12) = {12,13} Introduction (Appel, pp. 27)

Conversion of NFA into DFA Given a set of NFA states T, the -closure(T) is the set of states that are reachable through -transiton from any state s T. Given a set of NFA states T, move(T, a) is the set of states that are reachable on input a from any state sT. Introduction (Aho,Sethi,Ullman, pp. 118)

Problem Statement for Conversion of NFA into DFA Given an NFA find the DFA with the minimum number of states that has the same behavior as the NFA for all inputs. If the initial state in the NFA is s0, then the set of states in the DFA, Dstates, is initialized with a state representing -closure(s0). Introduction (Aho,Sethi,Ullman, pp. 118)

a-z IF ID  f  a-z 4  i 0-9   12 11 10 1 7 6 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  Conversion of NFA into DFA Dstates = {1-4-9-14} Now we need to compute: move(1-4-9-14,a-h) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   12 10 11 2 5 6 7 1 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  Conversion of NFA into DFA Dstates = {1-4-9-14} Now we need to compute: move(1-4-9-14,a-h) = {5,15} -closure({5,15}) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   12 11 10 1 6 7 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 Conversion of NFA into DFA Dstates = {1-4-9-14} Now we need to compute: move(1-4-9-14,a-h) = {5,15} -closure({5,15}) = {5,6,8,15} Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   12 11 10 7 6 1 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, i) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 0-9   12 10 11 7 6 1 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, i) = {2,5,15} -closure({2,5,15}) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   11 10 12 1 7 6 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, i) = {2,5,15} -closure({2,5,15}) = {2,5,6,8,15} Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   11 10 12 1 2 5 6 7 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, j-z) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   11 10 12 1 7 6 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, j-z) = {5,15} -closure({5,15}) = ? Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   11 12 10 2 1 7 6 5 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  a-h 5-6-8-15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, j-z) = {5,15} j-z -closure({5,15}) = {5,6,8,15} Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   12 11 10 1 7 6 5 2 15 13 3 8 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  10-11-13-15 a-h 5-6-8-15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, 0-9) = {10,15} j-z -closure({10,15}) = {10,11,13,15} 0-9 Introduction (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   10 11 12 7 6 1 5 2 15 13 8 3 NUM   0-9 0-9 9 14 any character  1-4-9-14 error  10-11-13-15 a-h 5-6-8-15 15 i Conversion of NFA into DFA Dstates = {1-4-9-14} move(1-4-9-14, other) = {15} j-z -closure({15}) = {15} 0-9 Introduction other (Appel, pp. 27)

a-z IF ID  f  a-z 4  i 2-5-6-8-15 0-9   10 11 12 5 2 6 7 1 15 13 3 8 NUM   0-9 0-9 9 14 any character  error  10-11-13-15 a-h 5-6-8-15 15 i Conversion of NFA into DFA Dstates = {1-4-9-14} The analysis for 1-4-9-14 is complete. We mark it and pick another state in the DFA to analyse. j-z 0-9 1-4-9-14 Introduction other (Appel, pp. 27)

2-5-6-8-15 3-6-7-8 6-7-8 1-4-9-14 11-12-13 10-11-13-15 15 The corresponding DFA a-e, g-z, 0-9 ID f IF i a-z,0-9 ID ID a-h 5-6-8-15 a-z,0-9 j-z NUM NUM a-z,0-9 0-9 0-9 error other 0-9 See pp. 118 of Aho-Sethi-Ullman and pp. 29 of Appel. Introduction (Appel, pp. 29)

Lexical Analyzer and Parser next token lexical analyzer next char Syntax analyzer get next char get next token Source Program symbol table (Contains a record for each identifier) token: smallest meaningful sequence of characters of interest in source program Introduction (Aho,Sethi,Ullman, pp. 160)

Programming Language Implementation Lexical and Syntax Analysis Part II

Programming Language Implementation Lexical and Syntax Analysis Part II

Presentation Transcript

Syntax: Part II

Lexical Analysis Part 1

Part II : Implementation

4. Phase 2 : Syntax Analysis Part II

Lexical Analysis Part 1

SYNTAX ANALYSIS - II

Lexical and syntax analysis

Lexical and Syntax Analysis

Lexical Analysis Part 2

Lexical and Syntax Analysis Chapter 4

Lexical Analysis – Part II

Part 4 Syntax Analysis

Lexical Analysis (II)

CSC 3315 Lexical and Syntax Analysis

Lexical Analysis Part 2

Chapter 2 :: Programming Language Syntax

Programming Language Syntax 3

Programming Language Syntax 6

Lexical Analysis (II)