Lecture 5 Grammars

Lecture 5 Grammars CSCE 531 Compiler Construction • Topics • Moving on from Lexical Analysis • Grammars • Derivations • CFLs • Readings: 4.1 January 25, 2006

Overview • Last Time • Symbol table - hash table from K&R • DFA review • Simulating DFA figure 3.22 • NFAs • Thompson Construction: re  NFA • Examples • NFA  DFA, the subset construction • ε – closure(s), ε – closure(T), move(T,a) • Today’s Lecture • Flex example Fig 3.28 revisited • References

Pop Quiz- • Draw the NFA that recognizes (00 | 11)* (01 | 10). • Given an NFA MR that recognizes the language denoted by a regular expression R build a machine that recognizes Reven – that matches R an even number of times

Lexical analyzer for subset of C • int constants: int, octal, hex, • Float constants • C identifiers • Keywords • for, while, if, else • Relational operators • < > >= <= != == • Arithmetic, Boolean and bit operators • + - * / && || ! ~ & | • Other symbols • ; { } [ ] * ->

Write core.l Flex Specification • Due Monday Jan 30 • Notes • Install Identifiers and constants into symbol table • Return separate token code for each relational operator. Not as in text!! • Homework 02 Dues Thursday Jan 26 (now Saturday 28) • Construct NFA for recognizing (a|b|ε)(ab)* • Convert to DFA

Flex example Fig 3.28 revisited • /class/csce531-001/Examples/Flex • Put “e=/class/csce531-001/Examples/” in your .bash_profile in your home directory (note the period makes it hidden.) • Then when you login you can use “cd $e” to move to the Examples directory • Files • ex0.l, ex1.l (note last character is lowercase “L”) • ex3.18.l, Makefile, y.tab.h • Fixed a few things so it would actually compile and run

Building and Runningex3.18 • Preliminary steps • cp $e/Flex/ex3.18.l . // copy lex-spec to current directory • cp $e/Flex/Makefile . • cp $e/Flex/y.tab.h . • flex ex3.18.l // creates the file lex.yy.c • ls • gcc lex.yy.c –lfl • ./a.out • if then else xbar • (output)

Routines section • %% • main(){ • int tok; • while((tok = yylex() ) != EOF){ • printf("Token code %d\t lexeme %s \n", tok, yytext); • } • } • /*Code for install_id() and install_num(); */ • int • install_id() { • } • int • install_num(){ • }

Regular Languages • Regular Expressions  NFA  DFA • All specify/recognize the same languages; these languages in formal language theory are called regular.

Example of a Non-Regular Language • But then from state q following the path determined by the string 1i • must leave you in a final state. • But then 0j1i must be accepted also. • This is a contradiction, which proves that L is not regular. • QED • Intuitively a DFA can count only a finite (bounded) number of things. • The language of balanced parentheses is non-regular also. • L = { 0n1n | n > 0 } is non-regular. • Proof: Suppose that L were a regular language, then there would exist some DFA M that accepts L. • Suppose that M has k states. • Consider the collection of strings • 0 • 00 • 000 • … • 0k • 0k+1 • Then by the Pigeon hole principle if you start at q0 and follow the paths determined by the k+1 strings above, two of the strings, say 0i and 0j leave you in the same state q.

Moving on Up to the Parsing Side • Lexical analysis can’t do it all • Syntax analysis recognize things from context. • The process of discovering the structure for some sentence or program. • Need a mathematical model of syntax — a grammar G • Need an algorithm for testing membership in L(G)

The Role of the Parser • Figure 4.1

Context Free Grammars • A Context free grammar is a formal mathematical model that has 4 components, G = (N, T, P, S), where • N is a set of grammar symbols called nonterminals • T is a set of terminals (or tokens) • P is a set of productions or rewrite rules of the form, • Nonterminal  string of grammar symbols • E.g., N => a b N • Terminology, left hand side, right hand side • grammar symbols = N U T • S is the start symbol (a nonterminal) • Generally a grammar is specified by listing the productions.

Example Context Free Grammars • Example: G = (N, T, P, S) • N = {S, T} • T = {a, b, c} • P = { S aS, S  bT, T  c} • Notational conventions • Nonterminals are typically represented by capital letters N, T, P, S, … or lower case strings in italics, e.g., expr • Terminals are typically represented by lower case letters a, b, … z, punctuation symbols, operators, parentheses, digits • Unless otherwise stated the nonterminal of the first production is the start symbol • “|” shorthand SaS | bT is shorthand for the two S productions S aS, S  bT • Lower case greek symbol represent strings of grammar symbols

Derivations • The derives (=>) relation is a binary relation between strings of grammar symbols. • We define derives as below: • If T  X1X2…Xn is a production and α and β are strings of grammar symbols then we say αTβ derives αX1X2…Xnβ and denote this by αTβ => αX1X2…Xnβ • Example

Review of Properties of Binary Relations • If R is a binary relation on A then R is • a subset of A x A • Symmetric if a R b implies b R a • Transitive if a R b and b R c implies a R c • The transitive closure of R is the minimal subset of A x A that contains R and is a transitive relation. • Henceforth we will use  (read derives) to denote the transitive closure of “=>” the “one-step” derives on the previous slide. • α  β means α => α1 => α2 => … αn = β • Thus α  β means that one can apply a sequence of productions and rewrite α as α1, then apply a production to α1 to rewrite to obtain α2 … and eventually obtain β

Derivations and Sentential Forms • If α  β or α => α1 => α2 => … αn = β then we say the sequence of rewrites forms a derivation of β from α. • The purpose of a grammar is to rewrite strings of grammar symbols until we obtain a string of terminals(tokens). • If G = (N, T, P, S) is a grammar then α is a sentential form if • The Start symbol derives α, S  α • α derives a string of tokens, α ω, where ωЄ T* • Or written more concisely • S  α ω, where ωЄ T*

Language Generated by a grammar • If G = (N, T, P, S) is a grammar then the language generated by G, denoted by L(G) is • L(G) = {x Є T* | S  x} • Example • S  0 S 1 | ε

Parse Trees • A parse tree is a graphical presentation of a derivation, satisfying • The root is the start symbol • Each leaf is a token or ε (note different font from text) • Each interior node is a nonterminal • If A is a parent with children X1 , X2 … Xn then A  X1X2 … Xn is a production

Top down vs. Bottom Up Construction of Parse Trees S • G: • S  (E)*S • S  (E) • E  id + id ( E ) * S id + id ( E ) * S ( E ) id + id id + id

Bottom Up Construction of Parse Trees • G: • S  (E)*S • S  (E) • E  id + id X * Y + Z * W

Leftmost (Rightmost) derivations • A derivation S  ω, where ωЄ (N U T)* is called Leftmost if at each step you rewrite the leftmost nonterminal in the sentential form. • If we want to emphacize that this is a leftmost derviation we will write S LM ω, read S leftmost derives ω. • Example • E  E + E • E  E * E • E  id • We will henceforth use the ‘|’ shorthand and write this grammar as • E  E + E | E * E | id • Rightmost derivations are defined in a similar manner.

Ambiguity • A grammar is ambiguous if there is a string of terminals that has two distinct parse trees (or two distinct LM derivations or 2 RM derivations) • Example: E  E + E | E * E | id

Eliminating Ambiguity • Rewrite the grammar is the approach taken. • However there are certain languages that no matter what grammar is chosen it will have to be ambiguous. • These languages are called inherently ambiguous languages. We will not consider any of these languages in this class.

Consider the grammar for expressions • E  E + E | E * E | id

Derivations and Precedence This grammar has no notion of precedence! To add precedence Create a non-terminal for each level of precedence Isolate the corresponding part of the grammar Force the parser to recognize high precedence subexpressions first For algebraic expressions Multiplication and division, first (level one) Subtraction and addition, next (level two)

Rewriting the Expression Grammar • Add nonterminals for each level of precedence • Term (product) for components of sums • Factor for components of products(terms) • Expr  Expr + Term • Expr  Expr - Term • Expr  Term • Term  Term + Factor • Term  Term - Factor • Term  Factor • Factor  ID • Factor  NUMBER • Factor  ( Expr )

Derivation of 5 * X + 3 * Y

Notes on rewritten grammar • It is more complex; more nonterminals, more productions. • It requires more steps in the derivation • But it does eliminate the ambiguity.

Ambiguous Grammar 2 If-else The leftmost and rightmost derivations for a sentential form may differ, even in an unambiguous grammar Classic example — the if-then-else problem Stmt if Expr then Stmt | if Expr then Stmt else Stmt | other stmts

Ambiguity This sentential form has two derivations ifExpr1thenifExpr2thenStmt1elseStmt2

Removing the ambiguity • To eliminate the ambiguity • We must rewrite the grammar to avoid generating the problem • We must associate each else with the innermost unmatched if • S  withElse

Ambiguity Removing the ambiguity Must rewrite the grammar to avoid generating the problem Match each else to innermost unmatched if With this grammar, the example has only one derivation Intuition: a NoElse always has no else on its last cascaded else if statement

Ambiguity ifExpr1thenifExpr2thenStmt1elseStmt2 This binds the else controlling S2 to the inner if

Deeper Ambiguity Ambiguity usually refers to confusion in the CFG Overloading can create deeper ambiguity a = f(17) In many Algol-like languages, f could be either a function or a subscripted variable Disambiguating this one requires context Need values of declarations Really an issue of type, not context-free syntax Requires an extra-grammatical solution (not in CFG) Must handle these with a different mechanism Step outside grammar rather than use a more complex grammar

Regular Languages and Grammars • A grammar where all productions are of the form • A  a or A  a B, where A,B  N and a  T • Is called left-linear or sometimes a regular grammar. • It turns out that the language generated by a left-linear grammar is a regular language. • How would you prove that?

Context Free Languages • A language L is called a context free language (CFL) if there exits a context free grammar that generates it, i.e., L = L(G).

Left recursion • A  Aα | β

Elimination of Immediate Left Recursion

Error handling • Error handling • Error detection • Error recovery

Fig 3.27 NFA  DFA

Lecture 5 Grammars

Lecture 5 Grammars

Presentation Transcript

Grammars

Lecture 7: Definite Clause Grammars

Grammars

Grammars

Transforming Grammars

Context-Free Grammars – Regular Grammars

Grammars

Grammars

Chapter 5 Context-Free Grammars

Chapter 5 Context-Free Grammars

Lecture 4 Context-free grammars

Grammars

Grammars

Linear Grammars

Unification Grammars

Grammars

Chapter 5 Grammars and Parsers

Grammars

Lecture 14 Grammars – Parse Trees– Normal Forms

5. Context-Free Grammars and Languages

Grammars