1 / 34

1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars

1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars. Short History of Compiler Construction. Formerly "a mystery", today one of the best-known areas of computing. 1957 Fortran first compilers

malisaj
Download Presentation

1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

  2. Short History of Compiler Construction Formerly "a mystery", today one of the best-known areas of computing 1957Fortranfirst compilers (arithmetic expressions, statements, procedures) 1960Algolfirst formal language definition (grammars in Backus-Naur form, block structure, recursion, ...) 1970Pascaluser-defined types, virtual machines (P-code) 1985C++object-orientation, exceptions, templates 1995Javajust-in-time compilation

  3. Why should I learn about compilers? It's part of the general background of a software engineer • How do compilers work? • How do computers work?(instruction set, registers, addressing modes, run-time data structures, ...) • What machine code is generated for certain language constructs?(efficiency considerations) • What is good language design? • Opportunity for a non-trivial programming project • Also useful for general software development • Reading syntactically structured command-line arguments • Reading structured data (e.g. XML files, part lists, image files, ...) • Searching in hierarchical namespaces • Interpretation of command codes • ...

  4. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

  5. lexical analysis (scanning) token stream token number 1 (ident) "val" 3 (assign) - 2 (number) 10 4 (times) - 1 (ident) "val" 5 (plus) - 1 (ident) "i" token value syntax analysis (parsing) Statement syntax tree Expression Term ident = number * ident + ident Dynamic Structure of a Compiler character stream v a l = 1 0 * v a l + i

  6. optimization code generation ld.i4.s 10 ldloc.1 mul ... machine code Dynamic Structure of a Compiler Statement syntax tree Expression Term ident = number * ident + ident semantic analysis (type checking, ...) intermediate representation syntax tree, symbol table, ...

  7. Single-Pass Compilers Phases work in an interleaved way scan token parse token check token generate code for token n eof? y The target program is already generated while the source program is read.

  8. Multi-Pass Compilers Phases are separate "programs", which run sequentially sem. analysis scanner parser ... characters tokens tree code Each phase reads from a file and writes to a new file Why multi-pass? • if memory is scarce (irrelevant today) • if the language is complex • if portability is important

  9. language-dependent machine-dependent Java C Pascal Pentium PowerPC SPARC any combination possible Today: Often Two-Pass Compilers Front End Back End scanning parsing sem. analysis code generation intermediate representation • Advantages • better portability • many combinations between front endsand back ends possible • optimizations are easier on the intermediaterepresentation than on source code • Disadvantages • slower • needs more memory

  10. Interpreter executes source code "directly" • statements in a loop arescanned and parsedagain and again scanner parser source code interpretation Variant: interpretation of intermediate code • source code is translated into thecode of a virtual machine (VM) • VM interprets the codesimulating the physical machine ... compiler ... VM source code intermediate code (e.g. Common IntermediateLanguage (CIL)) Compiler versus Interpreter Compiler translates to machine code scanner parser ... code generator loader source code machine code

  11. data flow Static Structure of a Compiler "main program" directs the whole compilation parser & sem. analysis scanner code generation provides tokens from the source code generates machine code symbol table maintains information about declared names and types uses

  12. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

  13. Four components terminal symbols are atomic "if", ">=", ident, number, ... nonterminal symbols are derived into smaller units Statement, Expr, Type, ... productions rules how to decom- pose nonterminals Statement = Designator "=" Expr ";". Designator = ident ["." ident]. ... start symbol topmost nonterminal CSharp What is a grammar? Example Statement = "if" "(" Condition ")" Statement ["else" Statement].

  14. Regular Grammars • In Computer Science, a right regular grammar is a formal grammar (N, Σ, P, S) such that all the production rules in P are of one of the following forms: A → a - where A is a non-terminal in N and a is a terminal in Σ A → aB - where A and B are in N and a is in Σ A → ε - where A is in N and ε denotes the empty string, i.e. the string of length 0. In a left regular grammar, all rules obey the forms A → a - where A is a non-terminal in N and a is a terminal in Σ A → Ba - where A and B are in N and a is in Σ A → ε - where A is in N and ε is the empty string.

  15. | (...) [...] {...} separates alternatives groups alternatives optional part repetitive part a | b | c  a or b or c a ( b | c )  ab | ac [ a ] b  ab | b { a } b  b | ab | aab | aaab | ... EBNF Notation Extended Backus-Naur form John Backus: developed the first Fortran compiler Peter Naur: edited the Algol60 report symbol meaning examples string name = . denotes itself denotes a T or NT symbol separates the sides of a production terminates a production "=", "while" ident, Statement A = b c d . • Conventions • terminal symbols start with lower-case letters (e.g. ident) • nonterminal symbols start with upper-case letters (e.g. Statement)

  16. Terminal symbols simple TS: terminal classes: "+", "-", "*", "/", "(", ")" (just 1 instance) ident, number (multiple instances) Nonterminal symbols Expr, Term, Factor Start symbol Expr Example: Grammar for Arithmetic Expressions Productions Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". Expr Term Factor

  17. - ident * number + ident / number - ident  - Factor * Factor + Factor / Factor - Factor  "*" and "/" have higher priority than "+" and "-" - Term + Term - Term "-" does not refer to a, but to a*3  Expr Operator Priority Grammars can be used to define the priority of operators Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". input:- a * 3 + b / 4 - c How must the grammar be transformed, so that "-" refers to a? Expr = Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = [ "+" | "-" ] ( ident | number | "(" Expr ")" ).

  18. Terminal Start Symbols of Nonterminals Which terminal symbols can a nonterminal start with? Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")". First(Factor) = ident, number, "(" First(Term) = First(Factor) = ident, number, "(" First(Expr) = "+", "-", First(Term) = "+", "-", ident, number, "("

  19. Terminal Successors of Nonterminals Which terminal symbols can follow after a nonterminal in the grammar? Expr = [ "+" | "-" ] Term { ( "+" | "-" ) Term }. Term = Factor { ( "*" | "/" ) Factor }. Factor = ident | number | "(" Expr ")". Where does Expr occur on the right-hand side of a production? What terminal symbols can follow there? Follow(Expr) = ")", eof Follow(Term) = "+", "-", Follow(Expr) = "+", "-", ")", eof Follow(Factor) = "*", "/", Follow(Term) = "*", "/", "+", "-", ")", eof

  20. String A finite sequence of symbols from an alphabet. Empty String The string that contains no symbol (denoted by e). Some Terminology Alphabet The set of terminal and nonterminal symbols of a grammar Strings are denoted by greek letters (a, b, g, ...) e.g: a = ident + number b = - Term + Factor * number

  21. a*b (indirect derivation) ag1g2 ... gnb aLb (left-canonical derivation) the leftmost NTS in a is derived first aRb (right-canonical derivation) the rightmost NTS in a is derived first Reduction The converse of a derivation: If the right-hand side of a production occurs in b it is replaced with the corresponding NTS Derivations and Reductions Derivation a b ab (direct derivation)  Term + Factor * Factor Term + ident * Factor NTS right-hand side of a production of NTS

  22. Deletability A string a is called deletable, if it can be derived to the empty string. a* e Example A = B C. B = [ b ]. C = c | d | . B is deletable: B e C is deletable: C e A is deletable: A  B C  C e

  23. Sentential form Any string that can be derived from the start symbol of the grammar. e.g.: Expr Term + Term + Term Term + Factor * ident + Term ... Sentence A sentential form that consists of terminal symbols only. e.g.: ident * number + ident Language (formal language) The set of all sentences of a grammar (usually infinitely large). e.g.: the C# language is the set of all valid C# programs More Terminology Phrase Any string that can be derived from a nonterminal symbol. Term phrases: Factor Factor * Factor ident * Factor ...

  24. Direct recursion A w1 A w2 Left recursion A = b | A a. A  A a  A a a  A a a a  b a a a a a ... Right recursion A = b | a A. A  a A  a a A  a a a A  ... a a a a a b Central recursion A = b | "(" A ")". A  (A)  ((A))  (((A)))  (((... (b)...))) Indirect recursion A * w1 A w2 Example Expr  Term  Factor  "(" Expr ")" Expr = Term { "+" Term }. Term = Factor { "*" Factor }. Factor = id | "(" Expr ")". Recursion A production is recursive if A * w1 A w2 Can be used to represent repetitions and nested structures

  25. Left recursion can be transformed to iteration E = T | E "+" T. What sentences can be derived? T T + T T + T + T ... From this one can deduce the iterative EBNF rule: E = T { "+" T }. How to Remove Left Recursion Left recursion cannot be handled in topdown syntax analysis Both alternatives start with b. The parser cannot decide which one to choose A = b | A a.

  26. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

  27. BNF grammar for arithmetic expressions • Alternatives are transformed intoseparate productions • Repetition must be expressed by recursion <Expr> ::= <Sign> <Term> <Expr> ::= <Expr> <Addop> <Term> <Sign> ::= + <Sign> ::= - <Sign> ::= <Addop> ::= + <Addop> ::= - <Term> ::= <Factor> <Term> ::= <Term> <Mulop> <Factor> <Mulop> ::= * <Mulop> ::= / <Factor> ::= ident <Factor> ::= number <Factor> ::= ( <Expr> ) Plain BNF Notation terminal symbols are written without quotes (ident, +, -) nonterminal symbols are written in angle brackets (<Expr>, <Term>) sides of a production are separated by ::= • Advantages • fewer meta symbols (no |, (), [], {}) • it is easier to build a syntax tree • Disadvantages • more clumsy

  28. Abstract Syntax Tree(leaves = operands, inner nodes = operators) + often used as an internal program representation; used for optimizations * ident number ident Syntax Tree Shows the structure of a particular sentence e.g. for 10 + 3 * i Concrete Syntax Tree(parse tree) Would not be possible with EBNF because of [...] and {...}, e.g.: Expr = [ Sign ] Term { Addop Term }. Expr Expr Addop Term Sign Term Term Mulop Factor Also reflects operator priorities:operators further down in the tree have a higher priority than operators further up in the tree. Factor Factor number + number * ident e

  29. Two syntax trees can be built for this sentence: T T T T T T T T T T F F F F F F id * id * id id * id * id Ambiguous grammars cause problems in syntax analysis! Ambiguity A grammar is ambiguous, if more than one syntax tree can be built for a given sentence. Example sentence:id * id * id T = F | T "*" T. F = id.

  30. Even better: transformation to EBNF T = F { "*" F }. F = id. Removing Ambiguity Example T = F | T "*" T. F = id. Only the grammar is ambiguous, not the language. The grammar can be transformed to i.e. T has priority over F T T = F | T "*" F. F = id. T only this syntax tree is possible T T T F F F id * id * id

  31. Statement There is no unambiguous grammar for this language! Statement Condition Condition Statement Statement C# solutionAlways recognize the longest possible right-hand side of a production  leads to the lower of the two syntax trees Condition Condition Statement Statement Statement Statement Inherent Ambiguity There are languages which are inherently ambiguous Example: Dangling Else Statement = Assignment | "if" Condition Statement | "if" Condition Statement "else" Statement | ... . if (a < b) if (b < c) x = c; else x = b;

  32. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

  33. class 0 Unrestricted grammars (a and b arbitrary) e.g: A = a A b | B c B. aBc = d. dB = bb. class 1 Contex-sensitive grammars (|a|  |b|) e.g: a A = a b c. Recognized by linear bounded automata class 2 Context-free grammars (a = NT, be) e.g: A = a b c. Recognized by push-down automata Only these two classes are relevant in compiler construction class 3 Regular grammars (a = NT, b= T | T NT) e.g: A = b | b B. Recognized by finite automata Classification of Grammars Due to Noam Chomsky (1956) Grammars are sets of productions of the form a = b. A  aAb  aBcBb  dBb  bbb Recognized by Turing machines

  34. 1. Overview 1.1 Motivation 1.2 Structure of a Compiler 1.3 Grammars 1.4 Syntax Tree and Ambiguity 1.5 Chomsky's Classification of Grammars

More Related