Parsing Tools - PowerPoint PPT Presentation

Parsing tools l.jpg
Download
1 / 54

  • 218 Views
  • Updated On :
  • Presentation posted in: Pets / Animals

Parsing Tools. Introduction to Bison and Flex. Scanning/parsing tools. lex - original UNIX lexics generator (Lesk, 1975) create a C function that will parse input according to a set of regular expressions yacc - " yet another compiler compiler " UNIX parser (Johnson, 1975)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Parsing Tools

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parsing tools l.jpg

Parsing Tools

Introduction to Bison and Flex


Scanning parsing tools l.jpg

Scanning/parsing tools

lex - original UNIX lexics generator (Lesk, 1975)

  • create a C function that will parse input according to a set of regular expressions

    yacc - "yet another compiler compiler" UNIX parser (Johnson, 1975)

  • generate a C program for a parser from BNF rules

    bison and flex ("fast lex") - more powerful, free versions of yacc and lex, from GNU Software Fnd'n.

    Jflex - generates Java code for a scanner

    CUP - generates Java code for a parser


Bison overview l.jpg

Bison Overview

Purpose: automatically write a parser program for a grammar written in BNF.

Usage: you write a bison source file containing rules that look like BNF.

Bison creates a C program that parses according to the rules

term : term '*' factor { $$ = $1 * $3; }

| term '/' factor { $$ = $1 / $3; }

| factor { $$ = $1; }

;

factor : ID { $$ = valueof($1); }

| NUMBER { $$ = $1; }

;


Bison overview 2 l.jpg

Bison Overview (2)

The programmer puts BNF rules and token rules for the parser he wants in a bison source file myparser.y

myparser.y

BNF rules and actions for your grammar.

run bison to create a C program (*.tab.c) containing a parser function.

The programmer must also supply a tokenizer named yylex( )

> bison myparser.y

myparser.tab.c

parser source code

yylex.c

tokenizer function in C

> gcc -o myprog myparser.tab.c yylex.c

myprog

executable program


Bison overview 3 l.jpg

Bison Overview (3)

In operation:

your main program calls yyparse( ).

yyparse( ) calls yylex when it wants a token.

yylex returns the type of the token.

yylex puts the value of the token in a global variable named yylval

input file to be parsed.

yylex( )

tokenizer returns the type of the next token

yylval

yyparse( )

parser created by bison

parse tree or other result


Bison source file l.jpg

Bison source file

The file has 3 sections, separated by "%%" lines.

/* declarations go here*/

%%

/* grammar rules go here */

%%

/* additional C code goes here */

Note: format for "yacc" is the same as for bison.


Bison source file with c declarations l.jpg

Bison source file with C declarations

You usually include C code in the declarations section.

%{

/* C declarations and #define statements go here */

#include <stdio.h>

#define YYSTYPE double

%}

/* bison declarations go here*/

%%

/* grammar rules go here */

%%

/* additional C code goes here */

Declare that yylval will be a "double" type.


Bison example l.jpg

Bison Example

Create a parser for this grammar:

expression => expression + term

| expression - term

| term

term => term * factor

| term / factor

| factor

factor => ( expression )

| NUMBER


Bison yacc file for example 1 l.jpg

Bison/Yacc file for example (1)

Structure of Bison or Yacc input:

%{

/* C declarations and #DEFINE statements go here */

#include <stdio.h>

#define YYSTYPE double

%}

/* Bison/Yacc declarations go here*/

%token NUMBER /* define token type NUMBER */

%left '+' '-' /* + and - are left associative */

%left '*' '/' /* * and / are left associative */

%%

/* grammar rules go here */

%%

/* additional C code goes here */


Bison yacc example 2 l.jpg

Bison/Yacc example (2)

%% /* Bison grammar rules */

input : /* empty production to allow an empty input */

| input line

;

line : expr '\n' { printf("Result is %f\n", $1); }

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;

term : term '*' factor { $$ = $1 * $3; }

| term '/' factor { $$ = $1 / $3; }

| factor { $$ = $1; }

;

factor : '(' expr ')' { $$ = $2; }

| NUMBER { $$ = $1; }

;


Bison yacc example 3 l.jpg

Bison/Yacc example (3)

  • $1, $2, ... represent the actual values of tokens or non-terminals (rules) that match the production.

  • $$ is the result.

rule

pattern to match

action

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;

Example:if the input matches expr + term then set the result ($$) equal to the sum of expr plus term ($1 + $3).


Bison yacc example 4 l.jpg

Bison/Yacc example (4)

Q: why can we write "$$ = $1 + $3" ?

A: because we declared "#define YYSTYPE double", so all tokens and results are double.

rule

pattern to match

action

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;


Scanner function yylex l.jpg

Scanner function: yylex( )

  • You must supply a scanner named yylex.

  • yylex returns the token TYPE (not token value).

int yylex( void ) {

int c = getchar();/* read from stdin */

if (c < 0) return 0;/* end of input */

if ( c == '+' || c == '-' ) return c;

/* for character tokens, TYPE = character itself */

if ( isdigit(c) ) {

yylval = c - '0';/* yylval is a global var */

while( isdigit( c=getchar() ) )

yylval = 10*yylval + (c - '0');

if (c >= 0) ungetc(c,stdin);

return NUMBER;/* token type is NUMBER */

}

...

}


Where is the token value l.jpg

Where is the token value?

  • The value of the token is stored in a global variable named yylval.

int yylex( void ) {

int c = getchar();/* read from stdin */

if (c < 0) return 0;/* end of input */

if ( c == '+' || c == '-' ) return c;

/* read number and store as yylval */

if ( isdigit(c) ) {

yylval = c - '0'; /* get each digit */

while( isdigit( c=getchar() ) )

yylval = 10*yylval + (c - '0');

/* push next character back into input stream */

if (c >= 0) ungetc(c,stdin);

return NUMBER;/* token type is NUMBER */

}

...

}


Useful c functions for yylex l.jpg

Useful C functions for yylex

int c = getchar( );read next char from stdin

returns -1 at end of input

ungetc( c, stdin)put character back in input

#include <ctype.h>

isdigit(c)true if c contains a digit

isalpha(c)true if c contains a letter

isalnum(c)isdigit(c) || isalnum(c)

islower(c)true if c is lowercase

isupper(c)guess?

isspace(c)space tab newline formfeed

scanf("%d", &num)read an integer from input

scanf("%lf", &dnum)read a double from input


Scanner function for double values l.jpg

Scanner function for double values

  • Suppose we want the data type of all tokens to be double.

  • In Bison input: #defineYYSTYPE double.

  • This changes yylval to be a double!

%{

/* C declarations and #DEFINE statements go here */

#include <stdio.h>

#define YYSTYPE double

%}

%%

/* bison definitions go here */


Scanner function for double l.jpg

Scanner function for double

  • Now yylex must know that yylval is "extern double".

  • Here is example of using scanf to parse numbers.

int yylex( void ) {

int c = getchar();/* read from stdin */

if (c < 0) return 0;/* end of the input*/

while ( c == ' ' || c == '\t' ) c = getchar( );

if ( isdigit(c) || c == '.' ) {

ungetc(c, stdin); /* put c back into input */

scanf ("%lf", &yylval); /* get value using scanf */

return NUMBER; /* return the token type */

}

return c; /* anything else... return char itself */

}


Other c functions yyerror l.jpg

Other C functions: yyerror

  • Bison requires an error routine named yyerror.

  • yyerror is called by the parser when there is an error.

  • You can include yyerror( ) in your Bison source file.

/* display error message */

int yyerror( char *errmsg ) {

printf("%s\n", errmsg);

}


Other c functions main l.jpg

Other C functions: main

  • you need a write a main() function that starts the parser.

  • For a simple parser, main() calls yyparse().

/* main method to run the program */

int main( ) {

printf("Type some input. Enter ? for help.\n");

yyparse( );

}


Running bison l.jpg

Running Bison

  • Compile the example file simple.y

    CMD>bison simple.y

  • Output is "simple.tab.c" it is C code for the parser.

  • To compile the parser by itself using gcc:

    CMD> gcc -o simple.exe simple.tab.c

  • To run the program at the command prompt:

    CMD> simple.exe


Simple example definitions l.jpg

Simple example: definitions

/* The Bison declarations section */

%{

/* C declarations and #DEFINE statements go here */

#include <math.h>

#define YYSTYPE double

%}

%token NUMBER /* define token type for numbers */

%token '+' '-' /* + and - are left associative */

File:simple.y

No left/right associativity specified


Simple example grammar rules l.jpg

Simple example: grammar rules

%% /* Simple grammar rules */

input : /* allow empty input */

| input line

;

line : expr '\n' { printf("answer: %d\n", $1); }

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;

term : NUMBER { $$ = $1; }

;


Yyerror and main l.jpg

yyerror and main

%% /* extra C code */

/* display error message */

int yyerror( char *errmsg ) { printf("%s\n", errmsg); }

/* main */

int main() {

printf("type an expression:\n");

yyparse( );

}


Simple example does it work l.jpg

Simple example: does it work?

cmd> bison simple.y

cmd> gcc -o calc.exe simple.tab.c

cmd> calc

Test the grammar rules:

3 + 5

3 - 4 - 5 + 6

How about this: -3


Simple example exploring bnf l.jpg

Simple example: exploring BNF

%% /* Simple grammar rules */

input : /* empty input */

| input line

;

line : expr '\n' { printf("answer: %d\n", $1); }

expr : expr '+' expr { $$ = $1 + $3; }

| expr '-' expr { $$ = $1 - $3; }

| term { $$ = $1; }

;

term : NUMBER { $$ = $1; }

| '-' NUMBER { $$ = -$2; }

;

%token NUMBER

%right '+' '-'


Exercise l.jpg

Exercise

Expand the grammar to include these operations:

4 * 5multiplication

2 / 3division

10 + 3 * 4 – 1 / 2correct order of operations

2 * ( 3 + 4 )grouping


Complete example definitions l.jpg

Complete example: definitions

/* The Bison declarations section */

%{

/* C declarations and #DEFINE statements go here */

#include <math.h>

#define YYSTYPE double

%}

%token NUMBER /* define token type for numbers */

%left '+' '-' /* + and - are left associative */

%left '*' '/' /* * and / are left associative */

File:simple.y


Complete example grammar rules l.jpg

Complete example: grammar rules

%% /* Bison grammar rules */

input : /* allow empty input */

| input line

;

line : expr '\n' { printf("Result is %f\n", $1); }

expr : expr '+' term { $$ = $1 + $3; }

| expr '-' term { $$ = $1 - $3; }

| term { $$ = $1; }

;

term : term '*' factor { $$ = $1 * $3; }

| term '/' factor { $$ = $1 / $3; }

| factor { $$ = $1; }

;

factor : '(' expr ')' { $$ = $2; }

| NUMBER { $$ = $1; }

| '-' NUMBER { $$ = -$2; }

;


Common errors in bison l.jpg

Common Errors in Bison

  • Forgetting to quote literals: term '+' term

  • Not skipping space where space is allowed

  • Ambiguity in grammar rules

  • Forgetting ';' at the end of rule

%token NUMBER

%token + -

%% /* grammar rules */

expr : term + term{ $$ = $1 + $3; }

| term - term{ $$ = $1 - $3; }

| term{ $$ = $1; }

term: NUMBER{ $$ = $1; }

| - NUMBER{ $$ = -$2; }


Shift reduce and operator order l.jpg

Shift / Reduce and operator order

  • Bison uses token look-ahead and a token stack. It shifts tokens onto the stack until it can choose a rule to reduce (replace) tokens with a non-terminal. Example:

expr ::= expr + expr

| expr - expr

| expr * expr

| expr / expr

| term

term ::= NUMBER | ( expr )

  • Suppose the input read is 10 -

  • shift these onto stack because no matching rule yet.

  • Suppose the next token is 2. What should Bison do?


Shift reduce and operator order31 l.jpg

Shift / Reduce and operator order

STACK: 10 - CURRENT TOKEN: 2

  • This grammar is ambiguous. Bison could match "expr - expr" or it could shift 2 onto the stack and look at the next token... it maybe * such as: 10 - 2 * 3

  • This is called a "shift / reduce conflict".

  • Unless you specify a disambiguating rule (next slide), Bison chooses "shift" over "reduce".

  • Meaning: if it's not clear how to resolve conflicts, wait.

INPUTACTIONSTACK

10shift10

-shift10 -

2shift10 - 2

*shift10 - 2 *

INPUTACTIONSTACK

3reduce10 - 6

reduce4done


The declarations section l.jpg

The declarations section

  • The declarations section can contain Bison directives and C directives.

  • We already saw the use of %left, %right, etc.

%{

/* semantic value of tokens, as C datatype */

#define YYSTYPE double

#include <math.h>

/* why do we have to declare these? */

int yylex (void);

void yyerror (char const *);

%}

%token NUMBER

%left '+' '-'

%left '*' '/'


Specifying operator order l.jpg

Specifying operator order

  • In the declarations section you can write:

%noassoc NEG/* unary minus sign: - 3 */

%left '+' '-'

%left '*' '/' '%'

%right '^'

  • '+' and '-' are left associative and have the same precedence.

  • '*', '/', and '%' left associative and same precedence; they have higher precedence (later declarations are higher).

  • '^' is right associative and higher precedence than + - * / %

  • 'NEG' has no associativity: "- - 3" is an error


Scanner function main points l.jpg

Scanner function main points

  • Tokens have a TYPE and a VALUE.

  • The scanner (yylex) returns the TYPE of the next token

  • For one-character tokens like '+', '=', "(' the character itself can be used as the type.

  • Define symbolic names for token types in Bison using %token NAME

  • Return the VALUE of a token using the global variable yylval.

  • By default, yylval is type "int". Change it using: #defineYYSTYPE double

  • Rather than parse numbers yourself, use scanf.


Scanner function main points 1 l.jpg

Scanner function main points (1)

%{

#include <ctype.h>

#include <math.h>

#define YYSTYPE double

%}

%token NUMBER

%token SQRT

%left '+' '-'

%left '*' '/'

%%

line: expr '\n' { printf("%g\n", $$); }

;

expr: expr * expr { $$ = $1 + $3; }

| SQRT '(' expr ')' { $$ = sqrt( $2 ); }

...

/* more rules */

Token values are double

Token for sqrt function


Scanner function main points 2 l.jpg

Scanner function main points (2)

%%

int yylex(void) {

int c = getchar( );

while ( c == ' ' || c == '\t' ) c = getchar( );

if ( isdigit(c) ) { ungetc(c,stdin);

scanf("%f", &yylval);

return NUMBER;

}

if ( isalpha(c) ) {

char *word = getword(c); /* get next word */

if ( strcmp(word,"sqrt") ) return SQRT;

...

}

Token is a number

Token is sqrt function

For a more general way of handling identifiers, see the Bison User's Guide, "mfcalc" example.


Handling multiple data types l.jpg

Handling Multiple Data Types

  • For a more general grammar, the scanner should be able to return different data types as the value (yylval).

  • In definitions section, define "%union" as union of all data types that yylval (and hence $1, $2, ...) can have.

%union {

double number;

char* string;

}

%token <number> NUMBER

%token <string> IDENT

%type <number> expr

%type <number> term

%left '+' '-'

%left '*' '/'

Define all the data types that token values can have

For each token type, define the data type of its value.

For tokens that represent their own value, don't need to define data type.


Handling multiple data types 2 l.jpg

Handling Multiple Data Types (2)

%%

int yylex(void) {

int c = getchar( );

while ( c == ' ' || c == '\t' ) c = getchar( );

if ( isdigit(c) ) { ungetc(c,stdin);

double x;

scanf("%f", &x);

yylval.number = x;

return NUMBER;

}

if ( isalpha(c) ) {

char *word = getword(c); /* get next word */

yylval.string = word;

return IDENT;

...

}

token value is a double

token value is a string


Using a separate file for yylex l.jpg

Using a Separate File for yylex

  • You can put the scanner (yylex) in a separate file.

  • BUT, yylex needs values that are defined by Bison, such as NUMBER, IDENT, yylval.

  • Solution: add the "%defines" option to your rules

  • Bison will create a header file named "simple.tab.h".

%defines

%{

#include <math.h>

#define YYSTYPE double

%}

token NUMBER

...


Using a makefile l.jpg

Using a Makefile

  • make is an automatic build facility. Just type "make"

  • Reads a Makefile of "rules" for how to make things

# Makefile for simple calculator

calc.exe: simple.tab.c yylex.o

gcc -o calc.exe simple.tab.c yylex.o -lm

# how to make simple.tab.c from simple.y

simple.tab.c: simple.y

bison simple.y

# yylex.o (obj. file) depends on simple.tab.h

yylex.o: yylex.c simple.tab.h

gcc –c yylex.c


Running make l.jpg

Running Make

C:/calculator> make

bison -d simple.y

gcc -c simple.tab.c

gcc -o simple.exe simple.tab.o yylex.c -lm


Advantages of make l.jpg

Advantages of Make

  • Makefile record all the dependencies in a project

  • make only builds the parts that are missing or out of date

  • can manage complex projects with multiple Makefiles in subdirectories


Introduction to flex l.jpg

Introduction to flex

  • Flex is a program that automatically creates a scanner in C, using rules for tokens as regular expressions.

  • Format of the input file is like Bison.

%{

/* C definitions for scanner */

%}

flex definitions

%%

rules

%%

user code (extra C code)


Flex example l.jpg

Flex Example

  • Read console input and describes each token read.

%{

inline int yywrap(void) { return 1; };

%}

/* flex definitions */

DIGIT[0-9]

LETTER[A-Za-z]

%%

{LETTER}+ printf("Word: %s\n", yytext); return 1;

-?{DIGIT}+ printf("Number: %s\n", yytext); return 2;

[[:punct:]] printf("Punct: %s\n", yytext); return 3;

\nprintf("End of line\n"); return 0;

%% /* main method calls yylex to tokenize input */

int main( ) { printf("Type something: ");

while( yylex() > 0 ) { }; }


Flex example 2 l.jpg

Flex Example (2)

  • Run flex to create yylex.c (on Linux: lex.yy.c).

    cmd> flex mylex.fl

  • Compile lex.yy.c and create myscanner.exe

    cmd> gcc -o myscanner.exe yylex.c

  • Run the program:

    cmd> myscanner

Type something: hello, it's 9:00 o'clock.

word: hello

punct: ,

word: it

punct: '


Flex explained definitions l.jpg

Flex Explained: definitions

%{/* C definitions and includes */

#include "myparser.tab.h"

/* this fixes an unknown symbol "yywrap" */

#ifdef yywrap

#undef yywrap

#endif

inline int yywrap(void) { return 0; };

%}

/* flex definitions */

DIGIT[0-9]

LETTER[A-Za-z]

%%


Flex explained parsing rules 1 l.jpg

Flex Explained: Parsing Rules (1)

  • pattern is a regular expression or a literal value.

  • action is zero or more C statements to execute.

    • The current matched input is (char *) yytext

    • The length of matched input is yyleng

{LETTER}+printf("Found a word: %s", yytext);

-?[0-9]+ printf("Found an int: %s", yytext);

pattern

action

yytext is a string containing the current token.


Flex explained parsing rules 2 l.jpg

Flex Explained: Parsing Rules (2)

  • Keep matching input until a return statement.

  • Flex chooses the rule that produces longest match.

  • If there is a tie, choose the first rulethat matches.

  • Use yylval (global variable) for value of token.

%{ /* Bison header file defines INT, IDENT...*/

#include "myparser.tab.h"

%}

%%

[0-9]+ yylval.itype = atoi(yytext); return INT;

{LETTER}+ yylval.ctype = yytext; return IDENT;

"+"|"-" yylval.ctype = yytext; return OP;

"=" return ASSIGN;


Flex yywrap l.jpg

Flex yywrap

Flex can read input from multiple input files.

  • when flex reaches the end of a file, it calls yywrap()

  • if yywrap() returns 0, it means that there is another input file to read.

  • if yywrap() returns 1, it means that there are no more input files.

%{

inline int yywrap(void) { return 1; };

%}

/* same thing */

%option noyywrap


Using flex with bison l.jpg

Using flex with bison

  • Run "bison -d file.y" to create file.tab.h

  • #include "file.tab.h" in your flex source.

  • Set yylval to the token value.

  • Run flex.

  • Compile and link lex.yy.c and file.tab.c

%{

#include <math.h>

#include "file.tab.h"

%}

LETTER[A-Za-z]/* same as [[:alpha:]] */

%%

-?[0-9]+ yylval.itype = atoi(yytext); return INT;

{LETTER}+ yylval.ctype = yytext; return IDENT;

"+"|"-" yylval.ctype = yytext; return OP;

"=" return ASSIGN;


Using flex and bison l.jpg

Using Flex and Bison

mygrammer.y

BNF rules for your grammar.

mylex.fl

Regular expressions that match and return tokens

> bison –d mygrammar.y

> flex mylex.fl

mygrammar.tab.c

parser source code

lex.yy,c

tokenizer function in C

> gcc -o myprog mygrammar.tab.c lex.yy.c

myprog

executable program


Learning bison flex cup l.jpg

Learning Bison, Flex, CUP

  • On Linux or most UNIX systems type:

    info bison

    Using "info": SPACEBAR = next page

    ALT-V = previous page, CTRL-C = quit

  • Documentation on the Web:

    GNU Free Software Foundation

    http://www.gnu.org/software/bison/bison.html

    ... or many tutorials online (search Google)

  • Similar documentation for flex.

  • CUP manual: http://www2.cs.tum.edu/projects/cup

    • includes many easy examples!


Getting the software l.jpg

Getting the Software

  • bison and flex are included with Linux distributions

  • MS Windows version of GNU programs

    http://gnuwin32.sourceforge.net/

    http://gnuwin32.sourceforge.net/packages.html

    • The packages include excellent user's guides with examples.

  • CUP:

    http://www2.cs.tum.edu/projects/cup/

    • software (jar file) and excellent manual; links to resources

  • Jflex (lex that creates java code):

    http://jflex.sourceforge.net/


Gnu tools and c c compiler l.jpg

GNU Tools and C/C++ compiler

  • Cygwin: www.cygwin.com

  • MinGW: www.mingw.org

    • free GNU toolkit.

  • Dev-C++

    • C++ integrated development environment

    • two downloads: with gcc or without gcc


  • Login