Languages and Compilers (SProg og Oversættere)

Languages and Compilers(SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson who’s slides this lecture is based on.

N ::= X private void parseN() { parse X } Algorithm to convert EBNF into a RD parser • The conversion of an EBNF specification into a Java implementation for a recursive descent parser is so “mechanical” that it can easily be automated! • => JavaCC “Java Compiler Compiler” • We can describe the algorithm by a set of mechanical rewrite rules

parsetwhere t is a terminal accept(t); parseNwhere N is a non-terminal parseN(); parsee // a dummy statement parseXY parseX parseY Algorithm to convert EBNF into a RD parser

parseX* while (currentToken.kind is in starters[X]) { parseX } parseX|Y switch (currentToken.kind) { cases instarters[X]: parseX break; cases instarters[Y]: parseY break; default: report syntax error } Algorithm to convert EBNF into a RD parser

Example: “Generation” of parseCommand Command ::= single-Command ( ;single-Command )* private void parseCommand() { parse single-Command (;single-Command )* } private void parseCommand() { parse single-Command parse (;single-Command )* } private void parseCommand() { parseSingleCommand(); parse (;single-Command )* } private void parseCommand() { parseSingleCommand(); while (currentToken.kind==Token.SEMICOLON) { parse;single-Command } } private void parseCommand() { parseSingleCommand(); while (currentToken.kind==Token.SEMICOLON) { parse; parse single-Command } } private void parseCommand() { parseSingleCommand(); while (currentToken.kind==Token.SEMICOLON) { acceptIt(); parseSingleCommand(); } }

Example: Generation of parseSingleDeclaration single-Declaration ::= const Identifier ~Type-denoter | varIdentifier :Expression private void parseSingleDeclaration() { switch (currentToken.kind) { case Token.CONST: acceptIt(); parseIdentifier(); acceptIt(Token.IS); parseTypeDenoter(); case Token.VAR: acceptIt(); parseIdentifier(); acceptIt(Token.COLON); parseExpression(); default: report syntax error } } private void parseSingleDeclaration() { switch (currentToken.kind) { case Token.CONST: parse const parse Identifier parse ~ parse Type-denoter case Token.VAR: parse var Identifier : Expression default: report syntax error } } private void parseSingleDeclaration() { switch (currentToken.kind) { case Token.CONST: acceptIt(); parseIdentifier(); acceptIt(Token.IS); parseTypeDenoter(); case Token.VAR: parse var Identifier : Expression default: report syntax error } } private void parseSingleDeclaration() { parse const Identifier ~ Type-denoter | var Identifier : Expression } private void parseSingleDeclaration() { switch (currentToken.kind) { case Token.CONST: parse const Identifier ~ Type-denoter case Token.VAR: parse var Identifier : Expression default: report syntax error } }

LL 1 Grammars • The presented algorithm to convert EBNF into a parser does not work for all possible grammars. • It only works for so called “LL 1” grammars. • What grammars are LL1? • Basically, an LL1 grammar is a grammar which can be parsed with a top-down parser with a lookahead (in the input stream of tokens) of one token. How can we recognize that a grammar is (or is not) LL1? • There is a formal definition which we will skip for now • We can deduce the necessary conditions from the parser generation algorithm.

LL 1 Grammars parseX* while (currentToken.kind is in starters[X]) { parseX } Condition: starters[X] must be disjoint from the set of tokens that can immediately follow X * parseX|Y switch (currentToken.kind) { cases instarters[X]: parseX break; cases instarters[Y]: parseY break; default: report syntax error } Condition: starters[X] and starters[Y] must be disjoint sets.

LL1 grammars and left factorisation The original mini-Triangle grammar is not LL 1: For example: single-Command ::= V-name :=Expression | Identifier ( Expression ) | ... V-name ::= Identifier Starters[V-name :=Expression] = Starters[V-name] = Starters[Identifier] Starters[Identifier ( Expression )] = Starters[Identifier] NOT DISJOINT!

wrong: overlapping cases LL1 grammars: left factorization What happens when we generate a RD parser from a non LL1 grammar? single-Command ::= V-name :=Expression | Identifier ( Expression ) | ... private void parseSingleCommand() { switch (currentToken.kind) { case Token.IDENTIFIER: parse V-name := Expression case Token.IDENTIFIER: parse Identifier ( Expression ) ...other cases... default: report syntax error } }

Left factorization (and substitution of V-name) LL1 grammars: left factorization single-Command ::= V-name :=Expression | Identifier ( Expression ) | ... single-Command ::= Identifier ( :=Expression | ( Expression ) ) | ...

LL1 Grammars: left recursion elimination Command ::= single-Command | Command ;single-Command What happens if we don’t perform left-recursion elimination? public void parseCommand() { switch (currentToken.kind) { case in starters[single-Command] parseSingleCommand(); case in starters[Command] parseCommand(); accept(Token.SEMICOLON); parseSingleCommand(); default: report syntax error } } wrong: overlapping cases

LL1 Grammars: left recursion elimination Command ::= single-Command | Command ;single-Command Left recursion elimination Command ::= single-Command (;single-Command)*

Systematic Development of RD Parser (1) Express grammar in EBNF (2) Grammar Transformations: Left factorization and Left recursion elimination (3) Create a parser class with • private variable currentToken • methods to call the scanner: accept and acceptIt (4) Implement private parsing methods: • add private parseNmethod for each non terminal N • public parsemethod that • gets the first token form the scanner • calls parseS (S is the start symbol of the grammar)

Abstract Syntax Trees • So far we have talked about how to build a recursive descent parser which recognizes a given language described by an (LL1) EBNF grammar. • Now we will look at • how to represent AST as data structures. • how to refine a recognizer to construct an AST data structure.

AST Representation: Possible Tree Shapes The possible form of AST structures is completely determined by an AST grammar (as described before in lecture 1-2) Example: remember the Mini-triangle abstract syntax Command ::= V-name := ExpressionAssignCmd | Identifier ( Expression )CallCmd | if Expression then Command else CommandIfCmd | while Expression do CommandWhileCmd | let Declaration in CommandLetCmd | Command; CommandSequentialCmd

AST Representation: Possible Tree Shapes Example: remember the Mini-triangle AST (excerpt below) Command ::= VName := ExpressionAssignCmd | ... AssignCmd V E

AST Representation: Possible Tree Shapes Example: remember the Mini-triangle AST (excerpt below) Command ::= ... | Identifier ( Expression )CallCmd ... CallCmd Identifier E Spelling

AST Representation: Possible Tree Shapes Example: remember the Mini-triangle AST (excerpt below) Command ::= ... | if Expression then Command else CommandIfCmd ... IfCmd E C1 C2

AST abstract abstract LHS concrete Tag1 Tag2 … AST Representation: Java Data Structures Example: Java classes to represent Mini-Triangle AST’s 1) A common (abstract) super class for all AST nodes public abstract class AST { ... } • 2) A Java class for each “type” of node. • abstract as well as concrete node types LHS ::= ... Tag1 | ... Tag2

Example: Mini Triangle Commands ASTs Command ::= V-name := ExpressionAssignCmd | Identifier ( Expression )CallCmd | if Expression then Command else CommandIfCmd | while Expression do CommandWhileCmd | let Declaration in CommandLetCmd | Command; CommandSequentialCmd public abstract class Command extends AST { ... } public class AssignCommand extends Command { ... } public class CallCommand extends Command { ... } public class IfCommand extends Command { ... } etc.

Example: Mini Triangle Command ASTs Command ::= V-name := ExpressionAssignCmd | Identifier ( Expression )CallCmd | ... public class AssignCommand extends Command { public Vname V; // assign to what variable? public Expression E; // what to assign? ... } public class CallCommand extends Command { public Identifier I; //procedure name public Expression E; //actual parameter ... } ...

AST Terminal Nodes public abstract class Terminal extends AST { public String spelling; ... } public class Identifier extends Terminal { ... } public class IntegerLiteral extends Terminal { ... } public class Operator extends Terminal { ... }

AST Construction First, every concrete AST class of course needs a constructor. Examples: public class AssignCommand extends Command { public Vname V; // Left side variable public Expression E; // right side expression public AssignCommand(Vname V; Expression E) { this.V = V; this.E=E; } ... } public class Identifier extends Terminal { public class Identifier(String spelling) { this.spelling = spelling; } ... }

AST Construction We will now show how to refine our recursive descent parser to actually construct an AST. N ::= X private NparseN() { NitsAST; parse X at the same time constructing itsAST return itsAST; }

Example: Construction Mini-Triangle ASTs Command ::= single-Command ( ;single-Command )* // old (recognizing only) version: private void parseCommand() { parseSingleCommand(); while (currentToken.kind==Token.SEMICOLON) { acceptIt(); parseSingleCommand(); } } // AST-generating version private CommandparseCommand() { Command itsAST; itsAST = parseSingleCommand(); while (currentToken.kind==Token.SEMICOLON) { acceptIt(); Command extraCmd = parseSingleCommand(); itsAST = new SequentialCommand(itsAST,extraCmd); } return itsAST; }

Example: Construction Mini-Triangle ASTs single-Command ::= Identifier ( :=Expression | ( Expression ) ) | ifExpression thensingle-Command elsesingle-Command | while Expression dosingle-Command | letDeclaration insingle-Command | beginCommandend private CommandparseSingleCommand() { Command comAST; parse it and construct AST return comAST; }

Example: Construction Mini-Triangle ASTs private CommandparseSingleCommand() { Command comAST; switch (currentToken.kind) { case Token.IDENTIFIER: parse Identifier ( := Expression | ( Expression ) ) case Token.IF: parseif Expression then single-Command else single-Command case Token.WHILE: parsewhile Expression do single-Command case Token.LET: parselet Declaration in single-Command case Token.BEGIN: parsebegin Command end } return comAST; }

Example: Construction Mini-Triangle ASTs ... case Token.IDENTIFIER: //parse Identifier ( := Expression // | ( Expression ) ) Identifier iAST = parseIdentifier(); switch (currentToken.kind) { case Token.BECOMES: acceptIt(); Expression eAST = parseExpression(); comAST = new AssignmentCommand(iAST,eAST); break; case Token.LPAREN: acceptIt(); Expression eAST = parseExpression(); comAST = new CallCommand(iAST,eAST); accept(Token.RPAREN); break; } break; ...

Example: Construction Mini-Triangle ASTs ... break; case Token.IF: //parseif Expression then single-Command // else single-Command acceptIt(); Expression eAST = parseExpression(); accept(Token.THEN); Command thnAST = parseSingleCommand(); accept(Token.ELSE); Command elsAST = parseSingleCommand(); comAST = new IfCommand(eAST,thnAST,elsAST); break; case Token.WHILE: ...

Example: Construction Mini-Triangle ASTs ... break; case Token.BEGIN: //parsebegin Command end acceptIt(); comAST = parseCommand(); accept(Token.END); break; default: report a syntax error; } return comAST; }

Syntax Analysis: Scanner Dataflow chart Source Program Stream of Characters Scanner Error Reports Stream of “Tokens” Parser Error Reports Abstract Syntax Tree

Scanner Remember: public class Parser { private Token currentToken; private void accept(byte expectedKind) { if (currentToken.kind == expectedKind) currentToken = scanner.scan(); else report syntax error } private void acceptIt() { currentToken = scanner.scan(); } public void parse() { ... ... } We have not yet implemented this

Steps for Developing a Scanner 1) Express the “lexical” grammar in EBNF (do necessary transformations) 2) Implement Scanner based on this grammar (details explained later) 3) Refine scanner to keep track of spelling and kind of currently scanned token. To save some time we’ll do step 2 and 3 at once this time

Developing a Scanner • Express the “lexical” grammar in EBNF Token ::= Identifier | Integer-Literal | Operator | ;| : |:= | ~ | ( | ) | eot Identifier ::= Letter (Letter | Digit)* Integer-Literal ::= Digit Digit* Operator ::= +| - |* | / | < | > | = Separator ::= Comment | space | eol Comment ::= ! Graphic* eol Now perform substitution and left factorization... Token ::= Letter (Letter | Digit)* | Digit Digit* | +| - |* | / | < | > | = | ;| :(=|e) | ~ | ( | ) | eot Separator ::= ! Graphic* eol | space | eol

Developing a Scanner Implementation of the scanner public class Scanner { private char currentChar; private StringBuffer currentSpelling; private byte currentKind; private char take(char expectedChar) { ... } private char takeIt() { ... } // other private auxiliary methods and scanning // methods here. public Tokenscan() { ... } }

Developing Scanner The scanner will return instances of Token: public class Token { byte kind; String spelling; final static byte IDENTIFIER = 0; INTLITERAL = 1; OPERATOR = 2; BEGIN = 3; CONST = 4; ... ... public Token(byte kind, String spelling) { this.kind = kind; this.spelling = spelling; if spelling matches a keyword change my kind automatically } ... }

Developing a Scanner public class Scanner { private char currentChar = get first source char; private StringBuffer currentSpelling; private byte currentKind; private char take(char expectedChar) { if (currentChar == expectedChar) { currentSpelling.append(currentChar); currentChar = get next source char; } else report lexical error } private char takeIt() { currentSpelling.append(currentChar); currentChar = get next source char; } ...

Developing a Scanner ... public Token scan() { // Get rid of potential separators before // scanning a token while ((currentChar == ‘!’) || (currentChar == ‘ ’) || (currentChar == ‘\n’ ) ) scanSeparator(); currentSpelling = new StringBuffer(); currentKind = scanToken(); return new Token(currentkind, currentSpelling.toString()); } private void scanSeparator() { ... } private byte scanToken() { ... } ... Developed much in the same way as parsing methods

Developing a Scanner Token ::= Letter (Letter | Digit)* | Digit Digit* | +| - |* | / | < | > | = | ;| :(=|e) | ~ | ( | ) | eot private byte scanToken() { switch (currentChar) { case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’: ... case ‘9’: scan Digit Digit* return Token.INTLITERAL ; case ‘+’: case ‘-’: ... : case ‘=’: takeIt(); return Token.OPERATOR; ...etc... }

Developing a Scanner Let’s look at the identifier case in more detail ... return ... case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: scan Letter (Letter | Digit)* return Token.IDENTIFIER; case ‘0’: ... case ‘9’: ... ... return ... case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: scan Letter scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’: ... case ‘9’: ... ... return ... case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: acceptIt(); scan (Letter | Digit)* return Token.IDENTIFIER; case ‘0’: ... case ‘9’: ... ... return ... case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) scan (Letter | Digit) return Token.IDENTIFIER; case ‘0’: ... case ‘9’: ... ... return ... case ‘a’: case ‘b’: ... case ‘z’: case ‘A’: case ‘B’: ... case ‘Z’: acceptIt(); while (isLetter(currentChar) || isDigit(currentChar) ) acceptIt(); return Token.IDENTIFIER; case ‘0’: ... case ‘9’: ...

Quick review • Syntactic analysis • Lexical analysis • Group letters into words • Use regular expressions and DFAs • Grammar transformations • Left-factoring • Left-recursion removal • Substitution • Parsing - Phrase structure analysis • Group words into sentences, paragraphs and complete documents • Top-Down and Bottom-Up

Languages and Compilers (SProg og Oversættere)