1 / 18

CS 3304 Comparative Languages

CS 3304 Comparative Languages. Lecture 3: Scanning 24 January 2012. Introduction. Syntax: the form or structure of the expressions, statements, and program units. Semantics: the meaning of the expressions, statements, and program units.

dominiquel
Download Presentation

CS 3304 Comparative Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 3304Comparative Languages • Lecture 3:Scanning • 24 January 2012

  2. Introduction • Syntax: the form or structure of the expressions, statements, and program units. • Semantics: the meaning of the expressions, statements, and program units. • Syntax and semantics provide a language’s definition. • Users of a language definition: • Other language designers. • Implementers. • Programmers (the users of the language). • Basic terminology: • A sentence is a string of characters over some alphabet. • A language is a set of sentences. • A lexeme is the lowest level syntactic unit of a language. • A token is a category of lexemes (e.g., identifier).

  3. Defining Languages • Recognizers: • A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language. • Example: syntax analysis part of a compiler (scanning). • Generators: • A device that generates sentences of a language. • One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator.

  4. Regular Expressions • A regular expression is one of the following: • A character. • The empty string, denoted by ε. • Two regular expressions concatenated. • Two regular expressions separated by | (i.e., or). • A regular expression followed by the Kleene star (concatenation of zero or more strings). • Numerical literals in Pascal may be generated by the following:

  5. Context-Free Grammars • Context-Free Grammars: • Developed by Noam Chomsky in the mid-1950s. • Language generators, meant to describe the syntax of natural languages. • Define a class of languages called context-free languages. • Backus-Naur Form (1959): • Invented by John Backus to describe Algol 58. • BNF is equivalent to context-free grammars (CFGs). • A CFG consists of: • A set of terminals T. • A set of non-terminals N. • A start symbol S (a non-terminal). • A set of productions.

  6. BNF Fundamentals • In BNF, abstractions are used to represent classes of syntactic structures: they act like syntactic variables (also called nonterminal symbols, or just terminals). • Terminals are lexemes or tokens. • A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals. • Nonterminals are often italic or enclosed in angle brackets. • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt> • Grammar: a finite non-empty set of rules. • A start symbol is a special element of the nonterminals of a grammar.

  7. Context-Free Grammar Example • Expression grammar with precedence and associativity:

  8. Parse Tree Example 1 • Parse tree for expression grammar (with precedence) for 3 + 4 * 5

  9. Parse Tree Example 2 • Parse tree for expression grammar (with left associativity) for10 - 4 - 3

  10. Using ANTLR • Syntax similar to CFG. • Non-terminal symbols: lower case letters. • Terminal symbols: upper case letters. • An example of rule syntax (parsing):expr: ID | NUMBER | '-' expr | '(' expr ')' | expr OP expr; • An example of rules used for tokens (scanning):OP: '+' | '-' | '*' | '/';

  11. ANTLR Grammar for Example 2.8 grammar Example2b; expr: term | expr ADD_OP term; term: factor | term MULT_OP factor; factor: ID | NUMBER | '-' factor | '(' expr ')' ; ADD_OP: '+' | '-' ; MULT_OP: | '*' | '/'; ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*; NUMBER: INTEGER | REAL; fragment INTEGER : '0'..'9'+; REAL : ('0'..'9')+ '.' ('0'..'9')* EXPONENT? | '.' ('0'..'9')+ EXPONENT? | ('0'..'9')+ EXPONENT ; EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

  12. Scanner Responsibilities • Tokenizing source. • Removing comments. • (Often) dealing with pragmas (i.e., significant comments). • Saving text of identifiers, numbers, strings. • Saving source locations (file, line, column) for error messages.

  13. Scanning Example I • Suppose we are building an ad-hoc (hand-written) scanner for Pascal: • We read the characters one at a time with look-ahead. • If it is one of the one-character tokens: { ( ) [ ] < > , ; = + - etc }we announce that token. • If it is a ., we look at the next character: • If that is a dot, we announce . • Otherwise, we announce . and reuse the look-ahead.

  14. Scanning Example II • If it is a <, we look at the next character • if that is a = we announce <= • otherwise, we announce < and reuse the look-ahead, etc. • If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore: • Then we check to see if it is a reserve word. • If it is a digit, we keep reading until we find a non-digit: • If that is not a . we announce an integer. • Otherwise, we keep looking for a real number. • If the character after the . is not a digit we announce an integer and reuse the . and the look-ahead.

  15. Deterministic Finite Automaton • Pictorial representation of a scanner for calculator tokens, in the form of a finite automaton. • This is a deterministic finite automaton (DFA): • Lex, scangen, ANTLR, etc. build these things automatically from a set of regular expressions. • Specifically, they construct a machine that accepts the language.

  16. The Longest Possible Token Rule • We scanover and over to get one token after another. • Nearly universal rule: always take the longest possible token from the input, thus:foobar is foobar and never f or foo or foob. • The rule means you return only when the next character can't be used to continue the current token: • The next character will generally be saved for the next token. • In some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed: • In Pascal, for example, when you have a 3 and you a see a dot • Do you proceed (in hopes of getting 3.14)? or • Do you stop (in fear of getting 3..5)? • Regular expressions "generate" a regular language. • DFAs "recognize” a regular language.

  17. Building Scanners • Scanners tend to be built three ways: • Ad-hoc. • Semi-mechanical pure DFA (usually as nested case statements). • Table-driven DFA. • Ad-hoc generally yields the fastest, most compact code by doing lots of special-purpose things, though good automatically-generated scanners come very close. • Writing a pure DFA as a set of nested case statements is a surprisingly useful programming technique (Figure 12.1): • It is often easier to use perl, awk, sed or similar tools. • Table-driven DFA is what lex and scangen produce: • lex (flex): C code • scangen: numeric tables and a separate driver (Figure 2.12). • ANTLR: Java code.

  18. Summary • BNF and context-free grammars are equivalent meta-languages that are well-suited for describing the syntax of programming languages. • Syntax analysis is a common part of language implementation • Scanners (lexical analyzers) use pattern matching to isolate small-scale parts of a program. • ANTLR provides supports for scanners (lexers), parsers, and tree-parsers.

More Related