Course Overview

Course Overview PART I: overview material 1 Introduction 2 Language processors (tombstone diagrams, bootstrapping) 3 Architecture of a compiler PART II: inside a compiler 4 Syntax analysis 5 Contextual analysis 6 Runtime organization 7 Code generation PART III: conclusion • Interpretation 9 Review Supplementary material: Theoretical foundations (Regular expressions)

Regular Expressions • finite state machine is a good “visual” aid • but it is not very suitable as a specification (its textual description is too clumsy) • regular expressions are a suitable specification • a more compact way to define a language that can be accepted by an FSM • used to give the lexical description of a programming language • define each “token” (keywords, identifiers, literals, operators, punctuation, etc) • define white-space, comments, etc • these are not tokens, but must be recognized and ignored

| means "or" . means "followed by“ (dot may be omitted) * means zero or more instances of ( ) are used for grouping Example: Pascal identifier • Lexical specification (in English): • a letter, followed by zero or more letters or digits • Lexical specification (as a regular expression): • letter . (letter | digit)*

Operands of a regular expression • Operands are same as labels on the edges of an FSM • single characters, or • the special character  (the empty string) • "letter" is a shorthand for • a | b | c | ... | z | A | B | C | ... | Z • "digit“ is a shorthand for • 0 | 1 | 2 | … | 9 • sometimes we put the characters in quotes • necessary when denoting | . * ( )

Precedence of | . * operators. • Consider regular expressions: • letter.letter | digit* • letter.(letter | digit)*

TEST YOURSELF Question 1: Describe (in English) the language defined by each of the following regular expressions: • letter (letter* | digit*) • (letter | _ ) (letter | digit | _ )* • digit* "." digit* • digit digit* "." digit digit*

TEST YOURSELF Question 2: Write a regular expression for each of these languages: • The set of all C++ reserved words • Examples: if, while, for, class, int, case, char, true, false • C++ string literals that begin with ” and end with ” and don’t contain any other ” except possibly in the escape sequence \” • Example: ”The escape sequence \” occurs in this string” • C++ comments that begin with /* and end with */ and don’t contain any other */ within the string • Example: /* This is a comment * still the same comment */

Example: Integer Literals • An integer literal with an optional sign can be defined in English as: • “(nothing or + or -) followed by one or more digits” • The corresponding regular expression is: • (+|-|) (digit.digit*) • A new convenient operator ‘+’ • same precedence as ‘*’ • digit digit* is the same as • digit + which means "one or more digits"

Regular Exp. Corresponding Set of Strings  {""} a {"a"} a.b.c {"abc"} a | b | c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} Language Defined by a Regular Expression • Recall: language = set of strings • Language defined by an automaton • the set of strings accepted by the automaton • Language defined by a regular expression • the set of strings that match the expression

Concept of Reg Exp Generating a String Rewrite regular expression until have only a sequence of letters (string) left Replacement Rules 1) r1 | r2 ––>r1 2) r1 | r2 ––>r2 3) r* ––> r r* 4) r* ––>  • Example • (0|1)* 2 (0|1)* • (0|1) (0|1)* 2 (0|1)* • 1 (0|1)* 2 (0|1)* • 1 2 (0|1)* • 1 2 (0|1) (0|1)* • 1 2 (0|1) • 1 2 0

Non–determinism in Generation • Different rule applications may yield different final results • Example 1 • (0|1)* 2 (0|1)* • (0|1) (0|1)* 2 (0|1)* • 1 (0|1)* 2 (0|1)* • 1 2 (0|1)* • 1 2 (0|1) (0|1)* • 1 2 (0|1) • 1 2 0 • Example 2 • (0|1)* 2 (0|1)* • (0|1) (0|1)* 2 (0|1)* • 0 (0|1)* 2 (0|1)* • 0 2 (0|1)* • 0 2 (0|1) (0|1)* • 0 2 (0|1) • 0 2 1

Concept of Language Generated by Reg Exp • Set of all strings generated by a regular expression is the language of the regular expression • In general, language may be infinite • String generated by regular expression language is often called a “token”

Examples of Languages and Reg Exp •  = { 0, 1, . } • (0 | 1)+ "." (0 | 1)* | (0 | 1)* "." (0 | 1)+  binary floating point numbers • (0 0)*  even-length all-zero strings • 1* (0 1* 0 1*)*  binary strings with even number of zeros •  = { a,b,c, 0, 1, 2 } • (a|b|c)(a|b|c|0|1|2)*  alphanumeric identifiers • (0|1|2)+  trinary numbers

Reg Exp Notational Shorthand • R + one or more strings of R: R(R*) • R? optional R: (R|) • [abcd] one of listed characters: (a|b|c|d) • [a-z] one character from this range: (a|b|c|d...|z) • [^abc] anything but one of the listed chars • [^a-z] any one character not from this range

Equivalence of FSM and Regular Expressions • Theorem: • For each finite state machine M, we can construct a regular expression R such that M and R accept the same language. • [proof omitted] • Theorem: • For each regular expression R, we can construct a finite state machine M such that R and M accept the same language. • [proof outline follows]

M  a Regular Expressions to NFSM (1) • For each kind of reg exp, define a NFSM • Notation: NFSM for reg exp M • For  • For input a

 A B A     B Regular Expressions to NFSM (2) • For A . B • For A | B

Regular Expressions to NFSM (3) • For A*  A   

Example of RegExp -> NFSM conversion • Consider the regular expression (1|0)*1 • The NFSM is  1   C E 1 B A G H  I J 0    D F  

Converting NFSM to DFSM • Simulate the NFSM • Each state of DFSM – is a non-empty subset of states of the NFSM • Start state of DFSM – is the set of NFSM states reachable from the NFSM start state using only -moves • Add a transition S a > S’ to DFSM iff • S’ is the set of NFSM states reachable from any state in S after consuming only the input a, considering -moves as well

Remarks on converting NFSM to DFSM • An NFSM may be in many states at any time • How many different states ? • If there are N states, the NFSM must be in some subset of those N states • How many subsets are there? • 2N = finitely many • For example, if N = 5 then 2N = 32 subsets

NFSM -> DFSM Example  1   C E 1 B A G H  I J 0    D F   0 FGHIABCD 0 1 0 ABCDHI 1 1 EJGHIABCD

TEST YOURSELF Question 3: First convert each of these regular expressions to a NFSM • (a | b | ) (a | b) • (ab | ba)* (aa | bb) Question 4: Next convert each resulting NFSM to a DFSM

Course Overview

Course Overview

Presentation Transcript

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

COURSE OVERVIEW

Course overview

Course overview

Course overview

Course overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview

Course Overview