Loading in 5 sec....

Chapter 4: Syntax Analysis Part 1: Grammar ConceptsPowerPoint Presentation

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

- 69 Views
- Uploaded on
- Presentation posted in: General

Chapter 4: Syntax Analysis Part 1: Grammar Concepts

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Prof. Steven A. Demurjian

Computer Science & Engineering Department

The University of Connecticut

371 Fairfield Way, Unit 2155

Storrs, CT 06269-3155

steve@engr.uconn.edu

http://www.engr.uconn.edu/~steve

(860) 486 - 4818

Material for course thanks to:

Laurent Michel

Aggelos Kiayias

Robert LeBarre

- An overview of parsing : Functions & Responsibilities
- Context Free Grammars: Concepts & Terminology
- Writing and Designing Grammars
- Resolving Grammar Problems / Difficulties
- Top-Down Parsing
- Recursive Descent & Predictive LL

- Bottom-Up Parsing
- LR & LALR

- Key Issues in Error Handling
- Concluding Remarks/Looking Ahead

- Why are Grammars to formally describe Languages Important ?
- Precise, easy-to-understand representations
- Compiler-writing tools can take grammar and generate a compiler
- allow language to be evolved (new statements, changes to statements, etc.) Languages are not static, but are constantly upgraded to add new features or fix “old” ones
- ADA ADA9x, C++ Adds: Templates, exceptions,

How do grammars relate to parsing process ?

errors

token

lexical analyzer

rest of front end

source program

parse tree

parser

get next token

symbol table

regular expressions

interm repres

- also technically part or parsing
- includes augmenting info on tokens in source, type checking, semantic analysis

- uses a grammar to check structure of tokens
- produces a parse tree
- syntactic errors and recovery
- recognize correct syntax
- report errors

- Syntax Error Identification / Handling
- Recall typical error types:
- Lexical : Misspellings
- Syntactic : Omission, wrong order of tokens
- Semantic : Incompatible types
- Logical : Infinite loop / recursive call

- Majority of error processing occurs during syntax analysis
- NOTE: Not all errors are identifiable !! Which ones?

- Detection
- Actual Detection
- Finding position at which they occur

- Reporting
- Clear / accurate presentation

- Recovery
- How to skip overto continue and find later errors
- Cannot impact compilation of correct programs

#include<stdio.h>

int f1(int v)

{ int i,j=0;

for (i=1;i<5;i++)

{ j=v+f2(i) }

return j; }

int f2(int u)

{ int j;

j=u+f1(u*u)

return j; }

int main()

{ int i,j=0;

for (i=1;i<10;i++)

{ j=j+i*I; printf(“%d\n”,i);

printf("%d\n",f1(j));

return 0;

}

As reported by MS VC++

'f2' undefined; assuming extern returning intsyntax error : missing ';' before ‘}‘syntax error : missing ';' before ‘return‘fatal error : unexpected end of file found

Which are “easy” to recover from? Which are “hard” ?

Panic Mode – Discard tokens until a “synchro” token is

found ( end, “;”, “}”, etc. )

-- Decision of designer

-- Problems:

skip input miss declaration – causing more errors

miss errors in skipped material

-- Advantages:

simple suited to 1 error per statement

Phase Level – Local correction on input

-- “,” ”;” – Delete “,” – insert “;”

-- Also decision of designer

-- Not suited to all situations

-- Used in conjunction with panic mode to

allow less input to be skipped

Error Productions:

-- Augment grammar with rules

-- Augment grammar used for parser

construction / generation

-- Add a rule for

:= in C assignment statements

Report error but continue compile

-- Self correction + diagnostic messages

-- What are other rules ? (supported by yacc)

Global Correction:

-- Adding / deleting / replacing symbols is

chancy – may do many changes !

-- Algorithms available to minimize changes

costly - key issues

What do you think of each approach? What approach do compilers you’ve used take ?

Reg. Lang.

CFLs

- Regular Expressions
- Basis of lexical analysis
- Represent regular languages
- Context Free Grammars
- Basis of parsing
- Represent language constructs
- Characterize context free languages

EXAMPLE: anbn , n 1 : Is it regular ?

Context Free Languages

Context Free

Languages

Regular Languages

Reg. Languages

Token Structure

Sentence Structure

Parser Generation

Scanner Generation

Definition: A Context Free Grammar, CFG, is described by T, NT, S, PR, where:

T: Terminals / tokens of the language

NT: Non-terminals to denote sets of strings generatable by the grammar & in the language

S: Start symbol, SNT, which defines all strings of the language

PR: Production rules to indicate how T and NT are combines to generate valid strings of the language.

PR: NT (T | NT)*

Like a Regular Expression / DFA / NFA, a Context Free Grammar is a mathematical model !

assign_stmt id := expr ;

expr term operator term

term id

term real

term integer

operator +

operator -

What do “BLUE” symbols represent?

What do “BLACK” symbols represent?

Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a collection of terminals / tokens.

Simply stated: Grammars / production rules allow us to “rewrite” and “identify” correct syntax.

Given the rules on the previous slide, suppose

id := real + int;

is input. Is it syntactically correct? How do we know?

expr is represented as: expr term operator term

Is this accurate / complete?

expr expr operator term

expr term

How does this affect the derivations which are possible?

Let’s derive: id := id + real – integer ;

What’s its parse tree ?

- Let’s derive: id := id + real – integer ;

assign_stmt assign_stmt → id := expr ;

→ id := expr ;expr → expr operator term

→ id := expr operator term;expr → expr operator term

→ id := expr operator term operator term; expr → term

→ id := term operator term operator term; term → id

→ id := id operator term operator term; operator→ +

→ id := id + term operator term; term→ real

→ id := id + real operator term; operator→ -

→ id := id + real - term;term→ integer

→ id := id + real - integer;

assign_stmt → id := expr ;

expr → expr operator term

expr → term

term → id

term → real

term → integer

operator → +

operator → -

expr expr op expr

expr ( expr )

expr - expr

expr id

op +

op -

op *

op /

op

Black : NT

Blue : T

expr : S

9 Production rules

To simplify / standardize notation, we offer a synopsis of terminology.

Terminals: a,b,c,+,-,punc,0,1,…,9, blue strings

Non Terminals: A,B,C,S, black strings

T or NT: X,Y,Z

Strings of Terminals: u,v,…,z in T*

Strings of T / NT: , in ( T NT)*

Alternatives of production rules:

A 1; A 2; …; A k; A 1 | 2 | … | 1

First NT on LHS of 1st production rule is designated as start symbol !

E E A E | ( E ) | -E | id

A + | - | * | / |

A step in a derivation is zero or one action that replaces a NT with the RHS of a production rule.

EXAMPLE: E -E (the means “derives” in one step) using the production rule: E -E

EXAMPLE: E E A E E * E E * ( E )

DEFINITION: derives in one step

derives in one step

derives in zero steps

+

*

EXAMPLES: A if A is a production rule

1 2 … n 1 n ; for all

If and then

*

*

*

*

+

Let G be a CFG with start symbol S. Then S W (where W has no non-terminals) represents the language generated by G, denoted L(G). So WL(G) S W.

+

W : is a sentence of G

When S (and may have NTs) it is called a sentential form of G.

EXAMPLE: id * id is a sentence

Here’s the derivation:

E E A E E * E id * E id * id

E id * id

Sentential forms

*

lm

rm

Leftmost: Replace the leftmost non-terminal symbol

E E A E id A E id * E id * id

Rightmost: Replace the leftmost non-terminal symbol

E E A E E A id E *id id * id

lm

lm

lm

lm

rm

rm

rm

rm

Important Notes: A

If A , what’s true about ?

If A , what’s true about ?

Derivations: Actions to parse input can be represented pictorially in a parse tree.

E E A E | ( E ) | -E | id

A + | - | * | / |

A leftmost derivation of : id + id * id

A rightmost derivation of : id + id * id

E

E

E

E

*

E

E

E

E

A

A

A

A

E

E

E

E

id

*

id

id

*

E E A E

E * E

id * E

id * id

E

E

E

E

E + E id+ E

E

E

E

E

+

+

*

+

E

E

E

E

id+ E id+ E * E

id

id

Consider the expression grammar:

E E+E | E*E | (E) | -E | id

Leftmost derivations of id + id * id

E E + E

id+ E * E id+id* E

id

id

id

id

id

E

E

E

E

E

E

E

E

+

*

*

+

E

E

E

E

id+id* E id+id* id

E

E

E

*

E

E

+

E

id

id

id

E E * E

E + E * E

id + E * E

id + id * E

id + id * id

WHAT’S THE ISSUE HERE ?

%term

T_AND T_ARRAY T_BEGIN T_CASE

T_CONST T_DIV T_DO T_RANGE

T_TO T_ELSE T_END T_FILE

T_FOR T_FUNCTION /* T_GOTO */ T_IDENTIFIER

T_IF T_IN T_INTEGER T_LABEL

T_MOD T_NOT T_REAL T_OF

T_OR /* T_PACKED */ T_NIL T_PROCEDURE

T_PROGRAM T_RECORD T_REPEAT T_SET

T_STRING T_THEN T_DOWNTO T_TYPE

T_UNTIL T_VAR T_WHILE /* T_WITH */

T_COMMA T_PERIOD T_ASSIGN T_COLON

T_LPAREN T_RPAREN T_LBRACK T_RBRACK

T_UPARROW T_SEMI

%binary T_LT T_EQ T_GT T_GE T_LE T_NE T_IN

%left T_PLUS T_MINUS T_OR

%left UNARYSIGN

%left T_MULT T_RDIV T_DIV T_MOD T_AND

%left T_NOT

%%

Start : Program_Header Declarations Block T_PERIOD

Program_Header : T_PROGRAM T_IDENTIFIER T_LPAREN Id_List T_RPAREN T_SEMI

Block : T_BEGIN Stt_List T_END

Declarations : Const_Def_Part

Type_Def_Part

Var_Decl_Part

Routine_Decl_Part

Const_Def_Part : T_CONST Const_Def_List

| /* epsilon */ ;

Const_Def_List : Const_Def

| Const_Def_List Const_Def ;

Const_Def : T_IDENTIFIER T_EQ Constant T_SEMI

Type_Def_Part : T_TYPE Type_Def_List

| /* epsilon */

Type_Def_List : Type_Def

| Type_Def_List Type_Def

Type_Def : T_IDENTIFIER T_EQ Type T_SEMI

Var_Decl_Part : T_VAR Var_Decl_List

| /* epsilon */

Var_Decl_List : Var_Decl_List Var_Decl

| Var_Decl

Var_Decl : Id_List T_COLON Type T_SEMI

Routine_Decl_Part : Routine_Decl_Part Routine_Decl

| /* epsilon */

Routine_Decl : Routine_Head T_IDENTIFIER T_SEMI

| Routine_Head Declarations Block T_SEMI

Routine_Head : T_PROCEDURE T_IDENTIFIER Params T_SEMI

| T_FUNCTION T_IDENTIFIER Params Function_Type T_SEMI

Params : T_LPAREN Params_List T_RPAREN

| /* epsilon */

Param_Group : Id_List T_COLON Type

| T_VAR Id_List T_COLON Type

Function_Type : T_COLON Type

| /* epsilon */

Params_List : Param_Group

| Params_List T_SEMI Param_Group

Constant : T_STRING | Number | T_PLUS Number | T_MINUS Number

Number : T_IDENTIFIER | T_INTEGER | T_REAL

Const_List : Constant | Const_List T_COMMA Constant

Type : Simple_Type | T_UPARROW T_IDENTIFIER | Structured_Type

Simple_Type : T_IDENTIFIER

| T_LPAREN Id_List T_RPAREN

| Constant T_RANGE Constant

Structured_Type : T_ARRAY T_LBRACK Simple_Type_List T_RBRACK T_OF Type

| T_SET T_OF Simple_Type

| T_RECORD Field_List T_END

Simple_Type_List : Simple_Type

| Simple_Type_List T_COMMA Simple_Type

Field_List : Fixed_Part /* Variant_part */

Fixed_Part : Field | Fixed_Part T_SEMI Field

Field : /* epsilon */ | Id_List T_COLON Type

Variant_part : / * epsilon * /

| T_CASE T_IDENTIFIER T_OF Variant_List

| T_CASE T_IDENTIFIER T_COLON T_IDENTIFIER T_OF Variant_List

Variant_List : Variant | Variant_List T_SEMI Variant

Variant : / * epsilon * /

| Const_List T_COLON T_LPAREN Field_List T_RPAREN

Stt_List : Statement | Stt_List T_SEMI Statement

Case_Stt_List : Case_Statement | Case_Stt_List T_SEMI Case_Statement

Case_Statement : Const_List T_COLON Statement | /* epsilon */

Statement : /* epsilon */

| T_IDENTIFIER | T_IDENTIFIER T_LPAREN Expr_List T_RPAREN

| Variable T_ASSIGN Expression | T_BEGIN Stt_List T_END

| T_CASE Expression T_OF Case_Stt_List T_END

| T_WHILE Expression T_DO Statement

| T_REPEAT Stt_List T_UNTIL Expression

| T_FOR Variable T_ASSIGN Expression T_TO Expression T_DO Statement

| T_FOR Variable T_ASSIGN Expression T_DOWNTO Expression T_DO Statement

| T_IF Expression T_THEN Statement

| T_IF Expression T_THEN Statement T_ELSE Statement

Expression : Expression relop Expression %prec T_LT

| T_PLUS Expression %prec UNARYSIGN

| T_MINUS Expression %prec UNARYSIGN

| Expression addop Expression %prec T_PLUS

| Expression divop Expression %prec T_MULT

| T_NIL | T_STRING | T_INTEGER | T_REAL

| Variable

| T_IDENTIFIER T_LPAREN Expr_List T_RPAREN

| T_LPAREN Expression T_RPAREN

| negop Expression %prec T_NOT

| T_LBRACK Element_List T_RBRACK

| T_LBRACK T_RBRACK

Element_List : Element | Element_List T_COMMA Element

Element : Expression | Expression T_RANGE Expression

Variable : T_IDENTIFIER

| Variable T_LBRACK Expr_List T_RBRACK

| Variable T_PERIOD T_IDENTIFIER

| Variable T_UPARROW

Expr_List : Expression | Expr_List T_COMMA Expression

relop : T_EQ | T_LT | T_GT | T_NE | T_LE | T_GE | T_IN

addop : T_PLUS | T_MINUS | T_OR

divop : T_MULT | T_RDIV | T_DIV | T_MOD | T_AND ;

negop : T_NOT

Reg. Lang.

CFLs

Regular Expressions : Basis of Lexical Analysis

Reg. Expr. generate/represent regular languages

Reg. Languages smallest, most well defined class

of languages

Context Free Grammars: Basis of Parsing

CFGs represent context free languages

CFLs contain more powerful languages

anbn – CFL that’s not regular!

Try to write DFA/NFA for it !

a

start

a

b

b

0

2

1

3

b

Since Reg. Lang. Context Free Lang., it is possible to go from reg. expr. to CFGs via NFA.

Recall: (a | b)*abb

a

b

i

j

i

j

Construct CFG as follows:

- Each State I has non-terminal Ai : A0, A1, A2, A3
- If then Aia Aj
- If then Ai Aj
- If I is an accepting state, Ai : A3
- If I is a starting state, Ai is the start symbol : A0

: A0 aA0, A0 aA1

: A0 bA0,A1 bA2

: A2 bA3

T={a,b}, NT={A0, A1, A2, A3}, S = A0

PR ={ A0 aA0 | aA1 | bA0 ;

A1 bA2 ;

A2 3A3 ;

A3 }

a

start

a

b

b

0

2

1

3

b

vs.

A0 aA0, A0 aA1

A0 bA0, A1 bA2

A2 bA3, A3

How is abaabb derived in each ?

- Regular expressions for lexical syntax
- CFGs are overkill, lexical rules are quite simple and straightforward
- REs – concise / easy to understand
- More efficient lexical analyzer can be constructed
- RE for lexical analysis and CFGs for parsing promotes modularity, low coupling & high cohesion. Why?

CFGs : Match tokens “(“ “)”, begin / end, if-then-else, whiles, proc/func calls, …

Intended for structural associations between tokens !

Are tokens in correct order ?

- ambiguity
- -moves
- cycles
- left recursion
- left factoring

- Humans write / develop grammars
- Different parsing approaches have different needs

LL(k)

Recursive

LR(k)

LALR(k)

Top-Down vs. Bottom-Up

- For: 1 remove “errors”
- For: 2 put / redesign grammar

Grammar Problems

Consider the following grammar segment:

stmt if exprthen stmt

| if exprthen stmtelse stmt

| other (any other statement)

What’s problem here ?

Consider the Program:

if e1

then if e2

then s1

else s2

Else must match to previous then.

Structure indicates parse subtree for expression.

- Easy case
- Else must match to previous then.

if e1

then s1

else if e2

then s2

else s3

If E1then if E2then S1else S2

How is this parsed ?

if E1then

if E2then

S1

else

S2

if E1then

if E2then

S1

else

S2

vs.

What’s the issue here ?

if e1

then if e2

then s1

else s2

Form 1:

Form 2:

What’s the issue here ?

Take Original Grammar:

stmt if exprthen stmt

| if exprthen stmtelse stmt

| other (any other statement)

Revise to remove ambiguity:

stmt matched_stmt | unmatched_stmt

matched_stmt if exprthen matched_stmt else matched_stmt | other

unmatched_stmt if exprthen stmt

| if exprthen matched_stmt else unmatched_stmt

How does this grammar work ?

A left recursive grammar has rules that support the derivation : A A, for some .

+

Top-Down parsing can’t reconcile this type of grammar, since it could consistently make choice which wouldn’t allow termination.

A A A A … etc. A A |

Take left recursive grammar:

A A |

To the following:

A’ A’

A’ A’ |

Derive : id + id + id

E E + T

Consider:

E E + T | T

T T * F | F

F ( E ) | id

How can left recursion be removed ?

E E + T | T What does this generate?

E E + T T + T

E E + T E + T + T T + T + T

How does this build strings ?

What does each string have to start with ?

For our example:

E E + T | T

T T * F | F

F ( E ) | id

E TE’ E’ + TE’ |

T FT’ T’ * FT’ |

F ( E ) | id

Informal Discussion:

Take all productions for A and order as:

A A1 | A2 | … | Am | 1 | 2 | … | n

Where no i begins with A.

Now apply concepts of previous slide:

A 1A’ | 2A’ | … | nA’

A’ 1A’ | 2A’| … | m A’ |

S Aa | b

A Ac | Sd |

S Aa Sda

Problem: If left recursion is two-or-more levels deep,

this isn’t enough

Algorithm:

- Input: Grammar G with no cycles or -productions
- Output: An equivalent grammar with no left recursion
- Arrange the non-terminals in some order A1,A2,…An
- for i := 1 to n do begin
- for j := 1 to i – 1 do begin
- replace each production of the form Ai Aj
- by the productions Ai 1 | 2 | … | k
- where Aj 1|2|…|k are all current Aj productions;
- end
- eliminate the immediate left recursion among Ai productions
- end

Apply the algorithm to: A1 A2a | b

A2 A2c | A1d |

i = 1

For A1 there is no left recursion

i = 2

for j=1 to 1 do

Take productions: A2 A1 and replace with

A2 1 | 2 | … | k |

where A1 1 | 2 | … | k are A1 productions

in our case A2 A1d becomes A2 A2ad | bd

What’s left: A1 A2a | b

A2 A2 c | A2ad | bd |

Are we done ?

No ! We must still remove A2 left recursion !

A1 A2a | b

A2 A2 c | A2ad | bd |

Recall:

A A1 | A2 | … | Am | 1 | 2 | … | n

A 1A’ | 2A’ | … | nA’

A’ 1A’ | 2A’| … | m A’ |

Apply to above case. What do you get ?

Very Simple: A and B uAv implies add B uv to the grammar G.

Why does this work ?

Examples:

E TE’ E’ + TE’ |

T FT’ T’ * FT’ |

F ( E ) | id

A1 A2a | b

A2 bd A2’ | A2’

A2’ c A2’ | bd A2’ |

How would cycles be removed ?

S SS | ( S ) |

Has: S SS S

S

Problem : Uncertain which of 2 rules to choose:

stmt if exprthen stmt else stmt

| if exprthen stmt

When do you know which one is valid ?

What’s the general form of stmt ?

A 1 | 2 : if exprthen stmt

1: else stmt 2 :

Transform to:

A A’

A’ 1 | 2

EXAMPLE:

stmt if exprthen stmt rest

rest else stmt |

Leads to:

- Forcing Matching Within Grammar
- - if-then-else else must go with most recent if

- stmt if expr then stmt | if expr then stmt else stmt | other

stmt matched_stmt | unmatched_stmt

matched_stmt if exprthen matched_stmt else matched_stmt | other

unmatched_stmt if exprthen stmt

| if exprthen matched_stmt else unmatched_stmt

- Left Factoring - Know correct choice in advance !
- where_queries where are_does declarations of symbol
- | where are_does “search_option” appear

Rules given in Assignment or

where_queries where rest_where

rest_where are declarations of symbol

| does“search_option” appear

E TE’ E’ + TE’ |

E E + T | T

T T * F | F

F ( E ) | id

T FT’ T’ * FT’ |

F ( E ) | id

- Left Recursion - Avoid infinite loop when choosing rules

A A’

A’ A’ |

A A |

EXAMPLE:

If we use the original production rules on input: id + id (with leftmost):

E E + T E + T + T E + T + T + T T + T + T + … + T

*

not assured

If we use the transformed production rules on input: id + id (with leftmost):

E TE’ T + TE’ id+ TE’ id + id E’ id + id

*

*

- Removing -Moves

E TE’ E’ + TE’ |

E T

E TE’

E’ + TE’

E’ + T

Is there still a problem ?

Reg. Lang.

CFLs

- Note: Not all aspects of a programming language can be represented by context free grammars / languages.
- Examples:
- 1. Declaring ID before its use
- 2. Valid typing within expressions (Type Rules)
- 3. Parameters in definition vs. in call
- 4. Method Overloading/hiding

CSLs

- Examples:
- L1 = { wcw | w is in (a | b)* } : Declare before use
- L2 = { an bm cn dm | n 1, m 1 }
- an bm : formal parameter
- cn dm : actual parameter

What does it mean when L is context-sensitive ?

L3 = { w c wR | w is in (a | b)* }

L4 = { an bm cm dn | n 1, m 1 }

L5 = { an bn cm dm | n 1, m 1 }

L6 = { an bn | n 1 }

L3 = { w c wR | w is in (a | b)* }

L4 = { an bm cm dn | n 1, m 1 }

L5 = { an bn cm dm | n 1, m 1 }

L6 = { an bn | n 1 }

S a S a | b S b | c

S a S d | a A d

A b A c | bc

S XY

X a X b | ab

Y c Y d | cd

S a S b | ab