Compiler Construction Principles & Implementation Techniques

Compiler Construction Principles & Implementation Techniques Dr. Ying JIN Associate Professor Sept. 2007

What we have already introduced • What this course is about? • What is a compiler? • The ways to design and implement a compiler; • General functional components of a compiler; • The general translating process of a compiler;

What will be introduced • Scanning • The first phase in a compiler • Functional requirement (input, output, functions) • Data structures • General techniques in developing a scanner • Two formal languages • Regular expression • Finite automata (DFA, NFA) • Three algorithms • From regular expression to NFA • From NFA to DFA • Minimizing DFA • One implementation • Implementing DFA to get a scanner

Functional components of a Compiler Target Code Generation Lexical Analysis scanning Intermediate Code Optimization Syntax Analysis Parsing Intermediate Code Generation Semantic Analysis analysis/front end synthesis/back end

Presentation Time What you have read and known?

Outline 2.1 Overview 2.1.1 General Function of a Scanner 2.1.2 Some Issues about Scanning 2.2 Finite Automata 2.2.1 Definition and Implementation of DFA 2.2.2 Non-Determinate Finite Automata 2.2.3 Transforming NFA into DFA 2.2.4 Minimizing DFA 2.3 Regular Expressions 2.3.1 Definition of Regular Expressions 2.3.2 Regular Definition 2.3.4 From Regular Expression to DFA 2.4 Design and Implementation of a Scanner 2.4.1 Developing a Scanner from DFA 2.4.2 A Scanner Generator – Lex

Knowledge Relation Graph using transforming Lexical definition Regular Expression NFA transforming basing on DFA Develop a Scanner minimizing minimized DFA implement

Outline 2.1 Overview 2.1.1 General Function of a Scanner 2.1.2 Some Issues about Scanning 2.2 Finite Automata 2.2.1 Definition and Implementation of DFA 2.2.2 Non-Determinate Finite Automata 2.2.3 Transforming NFA into DFA 2.2.4 Minimizing DFA 2.3 Regular Expressions 2.3.1 Definition of Regular Expressions 2.3.2 Regular Definition 2.3.4 From Regular Expression to DFA 2.4 Design and Implementation of a Scanner 2.4.1 Developing a Scanner from DFA 2.4.2 A Scanner Generator – Lex

2.1 Overview • General function of scanning • Input • Output • Functional description • Some issues about scanning • Tokens • Blank/tab, return, newline, comments • Lexical errors

General Function of Scanning • Input • Source program • Output • Sequence of words (tokens) • Functional description (similar to spelling) • Read source program； • Recognize words one by one according to the lexical definition of the source language； • Build internal representation of words – tokens; • Check lexical errors; • Output the sequence of words；

Some Issues about Scanning

Tokens • Source program • Stream of characters • Token • A sequence of characters that can be treated as a unit in the grammar of a programming language; • Token types • Identifier: x, y1, …… • Number: 12, 12.3, • Keywords (reserved words): int, real, main, • Operator: +, -, *, / , >, < , …… • Delimiter: ;, {, }, ……

Tokens • The content of a token token-type Semantic information • Semantic information: • - Identifier: the string • - Number: the value • - Keywords (reserved words): the number in the keyword table • - Operator: itself • - Delimiter: itself

Keywords • Keywords • Words that have special meaning • Can not be used for other meanings • Reserved words • Words that are reserved by a programming language for special meaning; • Can be used for other meaning, overload previous meaning; • Keyword table • to record all keywords defined by the source programming language

△ w Sample Source Program v a r  i n t △ x , y ;  {  r e a d ( x ) ;  r a d ( y e ) ;  i f △ x > y △ t h e n △ w r i t e ( 1 ) △ e l s e ;  } t ( 0 ) e r i

Sequence of Tokens

Blank, tab, newline, comments • No semantic meaning • Only for readability • Can be removed • Line number should be calculated;

The End of Scanning • Two optional situations • Once read the character representing the end of a program; • PASCAL: ‘.’ • The end of the source program file

Lexical Errors • Limited types of errors can be found during scanning • illegal character ; • §, ← • the first character is wrong; • “/abc” • Lexical Error Recovery • Once a lexical error has been found, scanner will not stop, it will take some measures to continue the process of scanning • Ignore current character, start from next character if § a then x = 12.else ……

Scanner • Two forms CharList call Attached Scanner Syntax Analysis Token CharList Independent Scanner TokenList Syntax Analysis

To Develop a Scanner • Now we know what is the function of a Scanner； • How to implement a scanner? • Basis : lexical rules of the source programming language • Set of allowed characters; • What kinds of tokens it has? • The structure of each token-type • Scanner will be developed according to the lexical rules;

How to define the lexical structure of a programming language? • Natural language (English, Chinese ……) • - easy to write, ambiguous and hard to implement • Formal languages • - need some special background knowledge; • - concise, easy to implement; • - automation;

Two formal languages for defining lexical structure of a programming language • Finite automata • Non-deterministic finite automata (NFA) • Deterministic finite automata (DFA) • Regular expressions (RE) • Both of them can formally describing lexical structure • The set of allowed words • They are equivalent • FA is easy to implement; RE is easy to use;

Outline √ 2.1 Overview 2.1.1 General Function of a Scanner 2.1.2 Some Issues about Scanning 2.2 Finite Automata 2.2.1 Definition and Implementation of DFA 2.2.2 Non-Determinate Finite Automata 2.2.3 Transforming NFA into DFA 2.2.4 Minimizing DFA 2.3 Regular Expressions 2.3.1 Definition of Regular Expressions 2.3.2 Regular Definition 2.3.4 From Regular Expression to DFA 2.4 Design and Implementation of a Scanner 2.4.1 Developing a Scanner from DFA 2.4.2 A Scanner Generator – Lex

2.2 Finite Automata • Definition of DFA • Implementation of DFA • Non-Determinate Finite Automata • Transforming NFA into DFA • Minimizing DFA

Definition of DFA - formal definition - two ways of representations - examples - some concepts

Formal Definition of DFA One DFA defines a set of strings; each string is a sequence of characters in ; Start state gives the start point of generating strings; Terminal states give the end point; Transforming function give the rules how to generate strings; • （，SS, S0, , TS） • （alphabet），set of allowed characters, each character can be called as input symbol; • SS = {S0, S1, S2, ……}，a finite set, each element is called state; • S0 SS, start state •  ：SS    SS  {  }, transforming function • TSSS, set of terminal (accept) states • Note： is a function which accepts a state and a symbol and returns either one unique state or (no definition) ；

Features of a DFA • One start state； • For a state and a symbol, it has at most one edge; • Functions of DFA • It defines a set of strings; • It can be used for defining lexical structure of a programming language

Two ways of Representations • Table • Convenient for implementation • Graph • easy to read and understand

Two ways of Representations (Table) • Transforming Table • start state : S0 • terminal state: * • Row(行): characters • Column(列)：states • Cell(单元): states or ⊥

Example of Transforming Table • : {a, b, c, d} SS: {S0, S1, S2, S*} Start state: S0 Set of terminal states: {S*} • : {(S0,a) S1, (S0,c)S2, (S0,d)S*, (S1,b)S1, (S1,d)S2, (S2,a)S*, (S*, c)S*}

S0 S a, b Two ways of Representations (graph) • Graph • start state: • terminal state: • State • Edge: S

b S0 a c d S* d a c Example of Graphical DFA • : {a, b, c, d} SS: {S0, S1, S2, S*} Start state: S0 Set of terminal states: {S*} • : {(S0,a) S1, (S0,c)S2, (S0,d)S*, (S1,b)S1, (S1,d)S2, (S2,a)S*, (S*, c)S*} S1 S2

a1 a2 an Some Concepts • String acceptable by a DFA • If A is a DFA, a1a2…anis a string, if there exists a sequence of states (S0, S1 , … ,Sn), which satisfies S0 S1 , S1 S2 , …… , Sn-1 Sn where S0 is the start symbol, Sn is one of the accept states, the string a1a2…an is acceptable by the DFA A. • Set of strings defined by DFA • The set of all the strings that are acceptable by a DFA A is called the set of strings defined by A, which is denoted as L(A)

S Some Concepts • Special case： • If a DFA is composed by one state, which is the start state and the accept state, the set of strings defined by the DFA is an empty set  (空字符串集).

0,1,2,…,9 0,1,2,…,9 S0 . 1, 2, …, 9 S1 S1 Relating DFA to Lexical Structure of a Programming language • Use a DFA to define the lexical structure of one word-typein a programming language • usigned real number (无符号实数) • A DFA can be defined for the lexical structure of all the words in a programming language; • The set of strings defined by the DFA is the set of allowed words in the programming language; • The implementation of the DFA can be used as a scanner for the programming language;

Assignment • For standard C programming language • Find out the token types and their lexical structures • Write down DFA for each token type

Implementation of DFA

Implementation of DFA • Objective (meaning of implementing a DFA) • Given a DFA which defines rules for a set of strings • Develop a program, which • Read a string • Check whether this string is accepted by the DFA • Two ways • Basing on transforming table of DFA • Basing on graphical representation of DFA

Transforming Table based Implementation • Main idea • Input : a string • Output: true if acceptable, otherwise false • Data structure • Transforming table (two dimensional array T) • Two variables • CurrentState: record current state; • CurrentChar: record current character that is read in the string;

Transforming Table based Implementation • Main idea • General Algorithm 1.CurrentState = S0 2. read the first character as CurrentChar 3.if CurrentChar is not the end of the string, if T(CurrentState,CurrentChar) error, CurrentState = T(CurrentState,CurrentChar), read next character of the string as CurrentChar, goto 3; 4. if CurrentChar is the end of the string and CurrentState is one of the terminal states, return true; otherwise, return false.

Example 1) abbacc true 2) cab false

Transforming table for the DFA Variables: CurrentChar, CurrentState Read the string that want to be checked Checking process:

i j a b k Graph based Implementation of DFA • each state corresponds to a case statement • each edge corresponds to a goto statement • for accept state, add one more branch, if current char is the end of the string then accept; Li: case CurrentChar of a ：goto Lj b : goto Lk other : Error( ) Li: case CurrentChar of a ：goto Lj b : goto Lk # : return true; other : Error( ) i

b S0 S1 a c d S2 S* d a c LS0: read character to CurrentChar; case CurrentChar of a: goto LS1; c: goto LS2: d: goto LS3; other: return false; LS1: read character to CurrentChar; case CurrentChar of b: goto LS1; d: goto LS2: other: return false;

b S0 S1 a c d S2 S* d a c LS2: read character to CurrentChar; case CurrentChar of a: goto LS3; other: return false; LS3: read character to CurrentChar; case CurrentChar of c: goto LS2: #: return true; other: return false;

Definition of NFA

Formal Definition • （，SS, SS0, , TS） • （alphabet），set of allowed characters, each character can be called as input symbol; • SS = {S0, S1, S2, ……}，a finite set, each element is called state; • S0 SS, set of start states •  ：SS    power set of SS  {  }, transforming function • TSSS, set of terminal (accept) states • Note： is a function which accepts a state and a symbol and returns a set of states or (no definition) ；

Differences between DFA & NFA

b S0 a   S* a  c Example of NFA • : {a, b, c, d} SS: {S0, S10, S2, S*} Set of Start state: {S0 , S10} Set of terminal states: {S*} • : {(S0,a) {S10, S*},(S0,) {S2}, (S10,b){S10}, (S10, ){S2}, (S2, ){S*}, (S*, c){S*}} S10 S2

Compiler Construction Principles & Implementation Techniques