CSE 505 Lecture 3 February 7, 2017

CSE 505Lecture 3February 7, 2017

Examples of Grammars • Pascal Grammar • C (Yacc) Grammar • Java Grammar • Python Grammar CSE 505 / Jayaraman

Brief Remarks on the Syntax ofLisp and C CSE 505 / Jayaraman

Remarks on Syntax of PLs • The languages LISP and Scheme use a fully parenthesized notation, called Cambridge Prefix Notation. It is very systematic and has the benefit of avoiding ambiguity altogether. • Sample Program: • (defun f (x y) • (if (eql x y) • x • (g (h x) y))) CSE 505 / Jayaraman

Advantages of LISP Syntax Example: The expression 10+20*30+40, with * having higher precedence than +, would be written in LISP as: (+ 10 (+ (* 20 30) 40)) The fully parenthesized notation allows + and * to take more than 2 arguments without any ambiguity, and thus we can write the above expression more compactly as: (+ 10 20 (* 3 4 5) 600) The LISP notation is good for functional expressions – it eliminates the need for keywords. CSE 505 / Jayaraman

A note on C expressions The C programming language does not have a separate boolean type. The integer 0 stands for false and all non-zero integers stand for true. Thus, arithmetic expressions form part of the boolean expression grammar. For example, 5 && 6 evaluates to true, because 5 and 6 (being non-zero) are considered true. CSE 505 / Jayaraman

Ambiguous Grammar Definition: A grammar G is said to be ambiguous if there is some string s ε L(G) with two or more parse trees. The aexpand bexpgrammars by themselves are unambiguous, but the rule expr  aexp | bexp is ambiguous. CSE 505 / Jayaraman

Expression Grammar expr  aexp | bexp aexp  term | aexp + term term  fact | term * fact fact  num | id | ‘(‘ aexp ‘)’ bexp  bterm | bexp ‘||’ bterm bterm  bfact | bterm && bfact bfact  true | false | id | ! bfact | ‘(‘ bexp ‘)’ | ‘(‘ aexp relop aexp ‘)’ relop  = | <= | ‘>=‘ | < | ‘>’ CSE 505 / Jayaraman

Ambiguity in Expression Grammar expr  aexp | bexp expr ==> aexp ==> term ==> fact ==> id expr ==> bexp ==> bterm ==> bfact ==> id There are an infinite number of strings that can be derived in two ways: id, (id), ((id)), (((id))), … CSE 505 / Jayaraman

Aside: Theory on Ambiguity To show that a grammar is ambiguous, we need to show one string with two parse trees. But to show that a grammar is unambiguous we need to reason about all possible strings – this is harder to prove. From the Theory of Computing: “Ambiguity of a Context-Free Grammar is undecidable.” That is, it is impossible to write a computer program that inputs an arbitrary CFG and outputs yes/no indicating whether the input grammar is ambiguous. CSE 505 / Jayaraman

Removing Ambiguity in Expression Grammar assign  var = expr expr  aexp | bexp aexp  … bexp  … Ambiguous, because both aexp and bexp generate id, (id), ((id)), etc. Let’s merge the aexp and bexp grammar, as follows: assign  var ‘=‘ expr expr  term | term op1 expr term  fact | fact op2 term fact  num | true | false | ‘(‘ expr ‘)’ op1  ‘+’ | ‘-‘ | ‘||’ op2  ‘*’ | ‘/’ | ‘&&’ CSE 505 / Jayaraman

Need for Attributes The merged expression grammar is unambiguous, but it generates many incorrectly typed expressions, such as: 10 & 20 true * 101 10*20 || false – 30 … We need to constrain the grammar through the use of attributes and semantic clauses, to avoid over-generation. This takes us into the subject of Attribute Grammars … which we will examine shortly. CSE 505 / Jayaraman

Program Statements stmt  assign | cond | loop | cmpd assign  var = expr ; expr  aexp | bexp cond  if ‘(‘ expr ‘)’ stmt [else stmt] loop  while ‘(‘ expr ‘)’ stmt cmpd  ‘{’ stmts ‘}’ stmts  { stmt } CSE 505 / Jayaraman

Dangling-else Ambiguity (Java) There are two possible parses for if (e1) if (e2) s1 else s2. cond => if (expr) stmt else stmt => if (expr) cond else stmt => if (expr) if (expr) stmt else stmt =>* if (e1) if (e2) s1 else s2 cond => if (expr) stmt => if (expr) cond => if (expr) if (expr) stmt else stmt =>* if (e1) if (e2) s1 else s2 Programming Languages prefer the second parse. CSE 505 / Jayaraman

Dangling-else Ambiguity (cont’d) if (e1) if (e2) s1 else s2 Preferred Parse if (e1) if (e2) s1 else s2 if (e1) if (e2) s1 else s2 There are two possible parses, of which the one on the left is the preferred parse. CSE 505 / Jayaraman

More Ambiguity in Java Stmts if (exp1) while (exp2) if (exp3) stmt1 else stmt2 Preferred Parse if (exp1) while (exp2) if (exp3) stmt1 else stmt2 if (exp1) while (exp2) if (exp3) stmt1 else stmt2 CSE 505 / Jayaraman

How to Resolve Ambiguity For operators, try to rewrite the grammar using precedence and associativity rules. For statements, we can try to rewrite the grammar by other methods. * Rewriting grammar to remove ambiguity in cond and loop involves introducing additional nonterminals and rules. * Clarity of the grammar is lost, hencethis approach is not used. We can also remove ambiguity using “semantic” concepts. This leads to the study of attributed grammars. CSE 505 / Jayaraman

Let’s see how Attributes and Semantic Constraintsare used in PL Grammars CSE 505 / Jayaraman

A Simple Example Consider: L = { an bn cn | n > 0 } i.e., L = {abc, aabbcc, aaabbbccc, …} L cannot be defined by any context-free grammar, but can be defined easily by an attribute grammar. L is a context-sensitive language. CSE 505 / Jayaraman

Towards a Solution Starting Point: S  As Bs Cs As  a | a As Bs  b | b Bs Cs  c | c Cs Problem: Over-generation! Does not ensure an equal number of a’s, b’s, and c’s are generated. Solution: Count the number of a’s, b’s, and c’s generated and check that they are equal. CSE 505 / Jayaraman

Attribute Grammar S  As(n1) Bs(n2) Cs(n3) {{n1 == n2 /\ n2 == n3}} As(n)  a {{n  1; }} As(n)  a As(n2) {{n  n2 + 1;}} Bs(n)  b {{n  1; }} Bs(n)  b Bs(n2) {{n  n2 + 1; }} Cs(n)  c {{n  1; }} Cs(n)  c Cs(n2) {{n  n2 + 1; }} CSE 505 / Jayaraman

Attribute computation for aabbccc S n3 n1 (3) Cs n2 (2) As (2) Bs (2) Cs c As (1) (1) Cs Bs a b c (1) a b c n1 == n2 /\ n2 != n3 CSE 505 / Jayaraman

Writing Attribute Grammars • Start with a context-free grammar rule. • Add synthesized and/or inherited attributes as well as semantic constraints over these attributes. • The semantic constraints for a grammar rule must refer only to attributes from that rule. • There is a close connection between attribute grammar rules and a procedures of a PL. • We will explore this connection soon. CSE 505 / Jayaraman

Attribute Grammar for Expressions Recall expr grammar below, which has the problem of over-generating the valid expressions: assign  var ‘=‘ expr expr  term | term op1 expr term  fact | fact op2 term fact  num | true | false | ‘(‘ expr ‘)’ op1  ‘+’ | ‘-‘ | ‘||’ op2  ‘*’ | ‘/’ | ‘&&’ CSE 505 / Jayaraman

Use Attributes and Semantic Rules to Specify Type-Correctness SYNTAX RULESEMANTICS op1(t)  (‘+’ | ‘-’ ) {{t  “int” ; }} op1(t)  ‘||’ {{t  “bool”; }} In these examples, attribute t is a “synthesized attribute” op2(t)  (‘*’ | ‘/’) {{t  “int”; }} op2(t)  ‘&&’ {{t  “bool”; }} fact(t)  num {{t  “int”; }} fact(t)  true {{t  “bool”; }} fact(t)  false {{t  “bool”; }} CSE 505 / Jayaraman fact(t)  ‘(‘ expr(t2) ‘)’ {{t  t2; }}

Attribute Grammar (cont’d) SYNTAX RULESEMANTICS term(t)  fact(t1) {{ t  t1; }} term(t)  fact(t1) op2(top) term(t2) {{t  t1; t1 == t2 /\ t2 == top }} expr(t)  term(t1) {{ t = t1; }} expr(t)  term(t1) op1(top) expr(t2) {{t  t1; t1 == t2 /\ t2 == top }} CSE 505 / Jayaraman

Remarks on ‘expr’ Attribute Grammar 1. The constraint t1 == t2 /\ t2 == topin the following two rules effectively precludes all incorrectly typed expressions from being defined: term(t)  fact(t1) op2(top) term(t2) expr(t)  term(t1) op2(top) expr(t2) 2. All attributes used in the preceding grammar are “synthesized” attributes. To go from Attribute Grammars to Programs, let us look at the compilation process more closely … CSE 505 / Jayaraman

Source Code Target Code Analysis Synthesis intermediate code generation optimization code genrn lexical syntactic semantic Broader Context for Parsing: Compiler Phases Compiler Source Code Target Code CSE 505 / Jayaraman

lexical syntactic semantic Compiler Structure • Lexical: translates sequence of characters into sequence of ‘tokens’ 2. Syntactic: translates sequence of tokens into a ‘parse tree’; also builds symbol table • Semantic: traverses parse tree and performs global checks, e.g. type-checking, actual-parameter correspondence CSE 505 / Jayaraman

interm code generation optimization machine code gen. Compiler Structure (cont’d) • Intermediate CG: Traverses parse tree and generates ‘abstract machine code’, e.g. triples, quadruples • Optimization: Performs control and data flow analysis; remove redundant ops, move loop invariant ops outside loop • Code Generation: Translate intermediate code to actual machine code CSE 505 / Jayaraman

Target Code Source Code Lexical Tokens Parse Tree LD R1, #1 ST R1, Mf LD R2, #1 ST R2, Mi LD R3, Mn L: CMP R2, R3 JF Out INC R2 ST R2, Mi MUL R1, R2 ST R1, Mf JMP L Out: Print Mf // declarations // not shown f = 1; i = 1; while (i < n) { i = i + 1; f = f * i; } print(f); id 1 op 1 int 1 p 1 id 2 op 1 int 1 p 1 key 1 … op1 p1 p1 key1 op1 id1 int1 … id2 op2 int1 id3 id2 A Simple Example lexer parser … … code gen’r … CSE 505 / Jayaraman

Java Bytecodes publicstatic int fact(int n) { // n >= 0; int f = 1; int i = 1; while (i < n) { i = i + 1; f = f * i; } return f; } CSE 505 / Jayaraman cmd> javap –c Factorial

Lexical Analyzer (lex) • Scans the input file character by character, skips over comments and white space (except in Python where indentation is important). • Two main outputs: token and value • Token is an integer code for each lexical class: • identifiers, numbers, keywords, operators, punctuation • Value is the actual instance: • for identifier, it is the string; • for numbers, it is their numeric value; • for keywords, operators and punctuation, the • token code = token value CSE 505 / Jayaraman

Clarifying the Lexical-Syntax Analyzer Interaction • Although the diagram shows the lexical analyzer feeding its output to the syntax analyzer, in practice, the syntax analyzer calls the lexical analyzer repeatedly. • At each call, the lexical analyzer prepares the next token for the syntax analyzer. • The lexical analyzer would not need to create an explicit ‘Lexical Token’ table, as shown in the previous diagram, since the syntax analyzer only needs to work with one token at a time. CSE 505 / Jayaraman

Design of a Simple Parser • We will see how to design a top-down parser for simple language. • In the next few slides is the structure of the lexical analyzer – some of the details and terminology taken from the PL textbook by Robert Sebesta. • After a brief look at the lexical analyzer, we will see how to design the parser. CSE 505 / Jayaraman

Token Codes class Token { public static final int SEMICOLON = 0; public static final int COMMA = 1; public static final int NOT_EQ = 2; public static final int ADD_OP = 3; public static final int SUB_OP = 4; public static final int MULT_OP = 5; public static final int DIV_OP = 6; public static final int ASSIGN_OP = 7; public static final int GREATER_OP = 8; public static final int LESSER_OP = 9; public static final int LEFT_PAREN= 10; public static final int RIGHT_PAREN= 11; public static final int LEFT_BRACE= 12; public static final int RIGHT_BRACE= 13; public static final int ID = 14; public static final int INT_LIT = 15; public static final int KEY_IF = 16; public static final int KEY_INT = 17; public static final int KEY_ELSE = 18; public static final int KEY_WHILE = 19; public static final int KEY_END = 20; } CSE 505 / Jayaraman

Lexer: Lexical Analyzer public class Lexer { static private Buffer buffer = new Buffer(…); static public int nextToken; // code static public int intValue; // value … public static int lex() { … sets nextToken and intValue each time it is called … } } CSE 505 / Jayaraman

Parsing Strategies There are two broad strategies for parsing: * top-down parsing (a.k.a. recursive-descent parsing) * bottom-up parsing Top-down parsing is less powerful than bottom-up parsing. But it is preferred when manuallyconstructing a parser. Tools such as YACC and JavaCC automatically construct a bottom-up parser from a grammar, but this bottom-up parser is hard to understand. CSE 505 / Jayaraman

T + E T T F * F F Top-down Parsing E Grammar*: E  E + T E  T T  T * F T  F F  id F  ( E ) a c + b * Choosing the correct expansion at each step is the issue. * This grammar is not suited for top-down parsing - will discuss later. CSE 505 / Jayaraman

E T + E T T F * F F Bottom-up Parsing Grammar: E  E + T E  T T  T * F T  F F  id F  ( E ) a c + b * Choosing whether to ‘shift’ or ‘reduce’ and, if the latter, choosing the correct reduction are the issues. CSE 505 / Jayaraman

Deterministic Parsing The term ‘deterministic parsing’ means that the parser can, at each step, correctly decide which rule to use without any guesswork. This requires some peeking into (or, looking ahead in) the input. For example: stmt  assign | cond | loop | cmpd For a top-down parser to decide which of the above four cases applies, it needs to look into the input to see which is the next symbol, or “token”, in the input: identifier, if, while, { CSE 505 / Jayaraman

Constructing a Top-down Parser (one void procedure per nonterminal) Case 1: Alternation on RHS of rule, e.g., stmt  assign | cond | loop | cmpd Parser code: void stmt () { switch (Lexer.nextToken) { case Token.ID : { assign(); break; } case Token.IF : { cond(); break; } case Token.WHILE: { loop(); break; } case Token.LBRACE: { cmpd(); break; } default: break; } } CSE 505 / Jayaraman

Constructing Top-down Parser(cont’d) Case 2:Sequencing on RHS of a rule, e.g., decl type idlist Parser code: voiddecl() { type(); idlist(); } CSE 505 / Jayaraman

Constructing Parser(cont’d) Case 3: Terminal Symbols on RHS of a rule: factor  num | ‘(‘ expr ‘)’ Parser code: void factor() { switch (Lexer.nextToken) { case Token.INT_LIT: int i = Lexer.intValue; Lexer.lex(); break; case Token.LPAR: Lexer.lex(); expr(); if (Lexer.nextToken == Token.RPAR) Lexer.lex(); else syntaxerror(“missing ‘)’”); default: break; } CSE 505 / Jayaraman

Constructing a Top-down Parser(cont’d) Case 4: Left-Factoring the RHS of a rule: expr  term | term + expr Parser Code: voidexpr() { term(); if (Lexer.nextToken == Token.ADD_OP) { Lexer.lex(); expr(); } } CSE 505 / Jayaraman

Left-recursion is not compatible with Top-down Parsing Problem: Left-recursive Rule expr  term | expr + term Problem: We cannot decide which alternative to use even with lookahead. Reason: The recursion in ‘expr’ must eventually end in ‘term’, thus both alternatives have the same set of leading terminal symbols. CSE 505 / Jayaraman

Recognizer vs Parser Terminology: A “recognizer” only outputs a yes/no answer indicating whether an input string belongs to L(G), the language defined by a grammar G. A “parser” builds upon the basic structure provided by the recognizer, enhancing it with attributes and semantic actions so as to produce additional output. CSE 505 / Jayaraman

Adding attributes to the Parser In a top-down parser, the attribute information is incorporated as follows: • Inherited attributes of a grammar rule become input parameters of the corresponding procedure. • Synthesizedattributesd become output parameters of the procedure. NOTE: Java does not have output parameters, hence we explain how synthesized attributes are represented in a Java setting. CSE 505 / Jayaraman

Adding synthesized attributes to Java OO Parser expr(t)  term(t1) {{ t = t1; }} expr(t)  term(t1) op2(top) expr(t2) {{t = t1; t1 == t2 /\ t2 == top }} class Expr { … other fields … String t; public Expr() { … code for expr … } } Synthesized attributes on LHS of the rule become fields of the corresponding class. CSE 505 / Jayaraman

Attributes on RHS of Rule expr(t)  term(t1) {{ t = t1; }} expr(t)  term(t1) op2(top) expr(t2) {{t = t1; t1 == t2 /\ t2 == top }} The attributes t1, top, and t2 refer to the type fields in the objects created for term, op2, and expr respectively. Thus, if we have in class Expr the field declarations Term v1; Op2 v2; Expr v3; Then, in the constructor Expr(), we would refer to t1, top, and t2 as v1.t, v2.t, and v3.t, respectively, where t is the name of the type field in Term, Op2, and Expr. CSE 505 / Jayaraman

CSE 505 Lecture 3 February 7, 2017

CSE 505 Lecture 3 February 7, 2017

Presentation Transcript

CSE 140L Lecture 3

CSE 390a Lecture 7

CSE 390a Lecture 3

Lecture 3 February 7, 2008

CSE 8A Lecture 7

CSE 390a Lecture 3

CSE 390a Lecture 7

CSE 143 Lecture 3

CSE 143 Lecture 7

CSE 143 Lecture 7

CSE 391 Lecture 7

CSE 8A Lecture 7

CSE 143 Lecture 7

CSE 390a Lecture 7

CSE 524: Lecture 7

CSE 143 Lecture 3