GOLD: A Grammar Oriented Parsing System
This presentation is the property of its rightful owner.
Sponsored Links
1 / 63

Introduction PowerPoint PPT Presentation


  • 35 Views
  • Uploaded on
  • Presentation posted in: General

GOLD: A Grammar Oriented Parsing System Devin Cook and Du Zhang Department of Computer Science California State University Sacramento, CA 95819-6021. Introduction. What is a Parser? Software which breaks a source program into its various grammatical units w.r.t. a formal grammar

Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction

GOLD: A Grammar Oriented Parsing SystemDevin Cook and Du ZhangDepartment of Computer ScienceCalifornia State UniversitySacramento, CA 95819-6021

SEKE 2004


Introduction

Introduction

  • What is a Parser?

    • Software which breaks a source program into its various grammatical units w.r.t. a formal grammar

    • Used to convert a source program into an internal representation

  • Parsing Algorithms

    • LL Parsers: top-down, predictive

    • LR / LALR Parsers: bottom-up, shift-reduce


Motivation

Motivation

  • The common approach to create parsers is through compiler-compiler, or parser generator

  • Each parser generator is designed for a specific programming language. There is no consistent parser generator

    • Different grammatical notations

    • Features and interfaces of tools vary in both the look and the behavior


Goals

Goals

  • Design and implement a generalized parsing system that supports development of multiple programming languages

  • Offer a consistent development environment for the language developers


Introduction

GOLD

  • Grammar Oriented Language Developer.

  • Separating the component that generates parse tables for a target grammar from the component that does the actual parsing.

  • Support the full Unicode character set.

  • Include a set of tools that can aid language development process.


System structure

System Structure

Builder

  • Analyzes a target grammar and creates DFA and LALR parse tables

  • These tables are saved to a Compiled Grammar Table file

    Compiled Grammar Table file

  • Intermediary between the Builder and the Engine

  • The file format is platform independent

  • Format is designed to be very easy to read and extend in future versions

    Engine

  • Reads the tables & parses the source text

  • Can be implemented in different programming languages – as needed


Development flow

Development Flow

  • Grammar is defined and loaded

    • Any text editor can be used

  • Builder

    • Grammar is analyzed and errors reported

    • The parse tables are created and saved to .cgt file

  • Engine

    • Reads the tables, parses the source string, and produces parsing results

    • Can be implemented in different programming languages – as needed


The builder

The Builder

  • GOLD meta-language

  • Compiled grammar table (.cgt) file

  • Skeleton program creation for the Engine from program templates

  • Interactive source string testing

  • Display of various parse table information

  • Export parse tables to a web page, XML file, or formatted text


Gold meta language

GOLD Meta-Language

  • The GOLD Meta-Language is used to define a target grammar

  • It must not contain features that are programming language dependent

  • Its notation is very close to the standards

  • It supports all language attributes (including those which cannot be specified using BNF or regular expressions)


Gold meta language contd

GOLD Meta-Language (contd.)

  • Format

    • Parameters are used to specify attributes about the grammar

    • Character Sets are used to define the character domain for terminals

    • Terminals are defined using regular expressions

    • Rules are defined using Backus-Naur Form


Defining parameters

Defining Parameters

  • Used for Name, Author, Case Sensitive, Start Rule, ....

  • Parameter names are delimited by double quotes

  • Parameters

    • "Name", "Author", "Version", "About" are informative

    • "Start Symbol" specifies the initial / start rule in the grammar


Parameters

Parameters


Example parameters

Example Parameters

"Name"    = 'My Programming Language'

"Version" = '1.0 beta'

"Author"  = 'John Q. Public'

"About"   = 'This is a test declaration.'

| 'Multiple lines are available'

| 'by using the "pipe" symbol'

"Case Sensitive" = 'False'

"Start Symbol" = <Statement>


Defining sets

Defining Sets

  • Character sets are used to aid the construction of regular expressions used to define terminals

  • Literal sets of characters are delimited using ‘[’ and ‘]’

  • Names of user-defined sets are delimited by ‘{’ and ‘}’

  • Sets can be defined by adding and subtracting previously declared sets


Example sets

Example Sets


Pre defined character sets

Pre-defined Character Sets

  • There are many sets of characters which are not accessible via keyboard, or so commonly used that it would be repetitive and time-consuming to redefine in each grammar

  • GOLD meta-language contains a collection of useful pre-defined sets

  • These include sets often used for defining terminals as well as characters not accessible via keyboard


Individual characters

Individual Characters

  • Some control characters that cannot be specified on a standard keyboard


Commonly used character sets

Commonly used Character Sets

{Digit}

{Letter}

{Alphanumeric}

{Printable}

{Whitespace}

{Letter Extended}

{Printable Extended}

{ANSI Mapped}

{ANSI Printable}


Unicode character sets

Unicode Character Sets

  • GOLD meta-language contains 43 pre-defined Unicode character sets

  • The names of those sets are based on standard names of the Unicode Consortium


Comments

Comments

  • GOLD meta-language allows both line comments and block comments


Defining terminals

Defining Terminals

  • Terminals are used to define reserved words, symbols, and recognized patterns (identifiers) in a grammar

  • Each terminal is defined using a regular expression which is used to construct the Deterministic Finite Automata used by the tokenizer

  • Implicit declaration of frequently used reserved words and symbols


Example terminals

Example Terminals


Defining rules

Defining Rules

  • Use Backus-Naur Form

  • Nonterminals are delimited by angle brackets < and >

  • Terminals are delimited by single quotes or not delimited at all


Example lists

Example: Lists

  • Lists are specified using recursive rules

Recursion


Example optional rules

Example: Optional Rules

  • Optional rules are specified with a production containing no terminals

  • This allows the developer to both specify a list containing 0 or more members

zero or more

Optional Rule


Example lisp grammar

Example: LISP Grammar


Example lisp grammar1

Example: LISP Grammar

Parameters

Initial Rule


Example lisp grammar2

Example: LISP Grammar

Set Definition

Set Literal


Example lisp grammar3

Example: LISP Grammar

Terminal Definition


Example lisp grammar4

Example: LISP Grammar

Rules

Recursive

Rule

Optional Rule


Compiled grammar table file

Compiled Grammar Table File

  • A file format designed to store parse tables and other information generated by the Builder

  • Design considerations

    • Easy to implement on different platforms

    • Flexibility for data structures to be added or expanded

    • Room for future growth (additional new types of data)


Cgt file structure

.cgt File Structure

  • The file consists of a number of records

  • Each record contains a number of entries


Cgt record

.cgt Record

  • The header contains name and version info

  • A record has the following format


Parameter record

Parameter Record

  • Parameter record which only occurs once in the .cgt file. It contains information about the grammar as well as attributes that affect how the grammar functions. The record is preceded by a byte field contains the value 80, the ASCII code for the letter 'P'.


Table size record

Table Size Record

  • Table size record : that appears before any records containing information about symbols, sets, rules or state table information. The first field of the record contains a byte with the value 84 - the ASCII  code for the letter 'T’ Each value contains the total number of objects for each of the listed tables


Other types of records

Other Types of Records

  • Character set table member

  • Symbol table member

  • Initial states (for both DFA and LALR)

  • Rule table member

  • DFA state table member

  • LALR state table member


An example cgt file

An Example cgt File

  • An example grammar

    "Name" = 'Example'

    "Version" = '1.0‘

    "Author" = 'Devin Cook'

    "About" = 'N/A'

    "Start Symbol" = <Stms>

    <Stms> ::= <Stm> <Stms>

    | <Stm>

    <Stm> ::= if <Exp> then <Stms> end

    | Read Id

    | Write <Exp>

    <Exp> ::= Id '+' <Exp>

    | Id '-' <Exp>

    | Id


Table content

Table Content

  • Symbol Table

    ========================================

    Symbol Table

    ========================================

    Index Name

    ----- ------------

    0 (EOF)

    1 (Error)

    2 (Whitespace)

    3 '-'

    4 '+'

    5 end

    6 Id

    7 if

    8 Read

    9 then

    10 Write

    11 <Exp>

    12 <Stm>

    13 <Stms>


Table content 2

Table Content (2)

  • Rules

    ========================================

    Rules

    ========================================

    Index Name ::= Definition

    ----- ------ --- ------------------------

    0 <Stms> ::= <Stm> <Stms>

    1 <Stms> ::= <Stm>

    2 <Stm> ::= if <Exp> then <Stms> end

    3 <Stm> ::= Read Id

    4 <Stm> ::= Write <Exp>

    5 <Exp> ::= Id '+' <Exp>

    6 <Exp> ::= Id '-' <Exp>

    7 <Exp> ::= Id


Table content 3

Table Content (3)

  • Character Set Table

    ========================================

    Character Set Table

    ========================================

    Index Characters

    ----- ---------------------------------

    0 {HT}{LF}{VT}{FF}{CR}{Space}{NBSP}

    1 +

    2 -

    3 Ee

    4 Ii

    5 Rr

    6 Tt

    7 Ww

    8 Nn

    9 Dd

    10 Ff

    11 Aa

    12 Hh


Table content 31

Table Content (3)

  • DFA states

    ========================================

    DFA States

    ========================================

    Index Description Character Set

    -------- ------------------- -------------

    0 Goto 1 0

    Goto 2 1

    Goto 3 2

    Goto 4 3

    Goto 7 4

    Goto 10 5

    Goto 14 6

    Goto 18 7

    1 Goto 1 0

    Accept (Whitespace)

    …………


Table content 4

Table Content (4)

  • LALR states

    ========================================

    LALR States

    ========================================

    Index Configuration/Action

    -------- ------------------------------------

    0 if Shift 1

    Read Shift 9

    Write Shift 11

    <Stm> Goto 13

    <Stms> Goto 17

    1 <Stm> ::= if · <Exp> then <Stms> end

    Id Shift 2

    <Exp> Goto 7

    …………


Cgt file for the grammar

cgt File for the grammar

  • To illustrate, only one of each record type is included


The remaining builder features

The Remaining Builder Features

  • Besides meta-language and .cgt file,

    • Skeleton program creation for the Engine from program templates

    • Interactive source string testing

    • Display of various parse table information

    • Export parse tables to a web page, XML file, or formatted text


Application layout

Application Layout

Online Help

Toolbar

Grammar Editor

Next Button

Status Message


Program templates

Program Templates

  • When developing the Engine which is interacting with tables of rules and symbols in the .cgt file, manually typing constant definitions can be tedious and problematic

  • Program templates are designed to help automate the Engine development

  • The Builder can use a program template to create a “skeleton program” for an implementation of the Engine


Program templates contd

Program Templates (contd.)

  • Skeleton program contains

    • Necessary declarations of constants and variables

    • Function calls

    • Case statements, pre-processor statements

    • Ready-to-use programs

  • Notation designed to not conflict with known languages

  • Program templates are saved in a subfolder


Display of symbol table

Display of Symbol Table

  • Symbol table display


Display of rule table

Display of Rule Table

  • Rule table display


Display of log information

Display of Log Information

  • Log info: general information about the number of symbols, which ones were defined implicitly, table counts, and any errors that occur


Display of dfa state table

Display of DFA State Table

  • DFA state table


Display of lalr state table

Display of LALR State Table

  • LALR state table


Export parse tables

Export Parse Tables

  • Parse tables can be exported to a web page, formatted text, or an XML file


Web page export

Web Page Export

  • An example of webpage export


A short demo

A Short Demo

  • A simple grammar

  • ANSI C


The engine

The Engine

  • Different implementations of the Engine

  • Object-oriented approach

  • Its design is centered around the object of “GOLDParser”, which performs all the parsing logic

  • The remaining objects are used for storage or to support GOLDParser object

  • Available in: Visual Basic .NET, ANSI C, C#, C++ (MFC), Delphi 5 & 6, Java, Python, Visual Basic 6


Testing and development

Testing and Development

  • Extensive tests on the Builder’s algorithms to generate the LALR and DFA tables

    • Small grammars

    • Grammars for the real world programming languages (e.g., ANSI C, BASIC, COBOL, LISP, Smalltalk, SQL, Visual Basic .NET, HTML, XML)

  • A Visual Basic 6 version of the Engine was developed as an integral part of the GOLD system and was tested


Comparison

Comparison

  • Yacc: for C or C++ on UNIX platform

  • ANTLR: OO parser generator that works for C++, C#, and Java

  • Bison: Yacc compatible

  • Elkhound: parser generator that is based on generalied LR algorithm

  • GENOA: framework for code analysis tools that has a parsing front end


Free parsing systems

Free Parsing Systems


Benefits of gold

Benefits of GOLD

  • It supports development of multiple programming languages and the full Unicode character set

  • It has a set of development tools

  • Its meta-language is easy to understand and its Builder GUI is easy to use


Contributors to different engines

Contributors to Different Engines


Website

Website

  • The URL for the GOLD website

    http://www.devincook.com/goldparser

  • On average, approximately 3000 copies of the Builder application are downloaded per month

  • Latest news: known bugs, workarounds, new releases

  • Contributor section

  • Online documentation


Future work

Future Work

  • Port the Builder to UNIX and Linux

  • Enhancement to the meta grammar


  • Login