Lexical Analysis

Lexical Analysis Cheng-Chia Chen

Outline • The goal and niche of lexical analysis in a compiler • Lexical tokens • Regular expressions (RE) • Use regularexpressions in lexical specification • Finite automata (FA) • DFA and NFA • from RE to NFA • from NFA to DFA • from DFA to optimized DFA • Lexical-analyzer generators

Source Tokens Interm. Language Parsing 1. The goal and niche of lexical analysis Lexical analysis (token stream) (char stream) Code Gen. Machine Code Optimization Goal of lexical analysis: breaking the input into individual words or “tokens”

Lexical Analysis • What do we want to do? Example: if (i == j) Z = 0; else Z = 1; • The input is just a sequence of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings • And determine the categories (token types) to which the substrings belong

2. Lexical Tokens • What’s a token ? • Token attributes • Normal token and special tokens • Example of tokens and special tokens.

What’s a token? • a sequence of characters that can be treated as a unit in the grammar of a PL. Output of lexical analysis is a stream of tokens • Tokens are partitioned into categories called token types. ex: • In English: • book, students, like, help, strong,… : token - noun, verb, adjective, … : token type • In a programming language: • student, var34, 345, if, class, “abc”… : token • ID, Integer, IF, WHILE, Whitespace, … : token type • Parser relies on the token type instead of token distinctions to analyze: • var32 and var1 are treated the same, • var32(ID), 32(Integer) and if(IF) are treated differently.

Token attributes • token type : • category of the token; used by syntax analysis. • ex: identifier, integer, string, if, plus, … • token value : • semantic value used in semantic analysis. • ex: [integer, 26], [string, “26”] • token lexeme (member, text): • textual content of a token • [while, “while”], [identifier, “var23”], [plus, “+”],… • positional information: • start/end line/position of the textual content in the source program.

Notes on Token attributes • Token types affect syntax analysis • Token values affect semantic analysis • lexeme and positional information affect error handling • Only token type information must be supplied by the lexical analyzer. • Any program performing lexical analysis is called a scanner (lexer, lexical analyzer).

Aspects of Token types • Language view: A token type is the set of all lexemes of all its token instances. • ID = {a, ab, … } – {if, do,…}. • Integer = { 123, 456, …} • IF = {if}, WHILE={while}; • STRING={“abc”, “if”, “WHILE”,…} • Pattern (regular expression): a rule defining the language of all instances of a token type. • WHILE: w h i l e • ID: letter (letters | digits )* • ArithOp: + | - | * | /

Lexical Analyzer: Implementation • An implementation must do two things: • Recognize substrings corresponding to lexemes of tokens • Determine token attributes • type is necessary • value depends on the type/application, • lexeme/positional information depends on applications (eg: debug or not).

Example • input lines: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Token-lexeme pairs returned by the lexer: • [Whitespace, “\t”] • [if, - ] • [OpenPar, “(“] • [Identifier, “i”] • [Relation, “==“] • [Identifier, “j”] • …

Normal Tokens and special Tokens • Kinds of tokens • normal tokens: needed for later syntax analysis and must be passed to parser. • special tokens • skipped tokens (or nontoken): • do not contribute to parsing, • discarded by the scanner. • Examples: Whitespace, Comments • why need them ? • Question: What happens if we remove all whitespace and all comments prior to scanning?

Lexical Analysis in FORTRAN • FORTRAN rule: Whitespace is insignificant • E.g., VAR1is the same as VA R1 • Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

A terrible design! Example • Consider • DO 5 I = 1,25 • DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • The second is DO5I = 1.25 • Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Lexical Analysis in FORTRAN. Lookahead. • Two important points: • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • “Lookahead” may be required to decide where one token ends and the next token begins • Even our simple example has lookahead issues ivs.if =vs. ==

Some token types of a typical PL

Some Special Tokens • 1,5 are skipped. 2,3 need preprocess, • 4 need to be expanded.

3. Regular expressions and Regular Languages

The geography of lexical tokens ID: var1, last5,… NUM 23 56 0 000 special tokens : \t \n /* … */ IF:if LPAREN ( REAL 12.35 2.4 e –10 … RPAREN ) the set of all strings

Issues • Definition problem: • how to define (formally specify) the set of strings(tokens) belonging to a token type ? • => regular expressions • (Recognition problem) • How to determine which set (token type) a input string belongs to? • => DFA!

Languages Def. Let S be a set of symbols (or characters). • A language over S is a set of strings of characters drawn from S • (S is called the alphabet )

Alphabet = English characters Language = English words Not every string on English characters is an English word likes, school,… beee,yykk,… Alphabet = ASCII Language = C programs Note: ASCII character set is different from English character set Examples of Languages

Regular Expressions • A language (metaLanguage) for representing (or defining) languages(sets of words) • Definition: If S is an alphabet. The set of regular expression(RegExpr) over S is defined recursively as follows: • (Atomic RegExpr) : 1. any symbol c  is a RegExpr. • 2. e (empty string) is a RegExpr. • (Compound RegExpr): if A and B are RegExpr, then so are 3. (A | B) (alternation) 4. (A  B) (concatenation) 5. A* (repetition)

Semantics (Meaning) of regular expressions • For each regular expression A, we use L(A) to express the language defined by A. • I.e. L is the function: L: RegExpr(S)  the set of Languages over S with L(A) = the language denoted by RegExpr A • The meaning of RegExpr can be made clear by explicitly defining L.

Atomic Regular Expressions • 1. Single symbol: c L(c) = { c } (for any c ) • 2. Epsilon (empty string): e L(e) = {e}

Compound Regular Expressions • 3. alternation ( or union or choice) L( (A | B) ) = { s | s  L(A) or s  L(B) } • 4. Concatenation: AB (where A and B are reg. exp.) L((A  B)) =L(A)  L(B) =def { ab | a  L(A) and b L(B) } • Note: • Parentheses enclosing (A|B) and (AB) can be omitted if there is no worries of confusion. • MN (set concatenation) and ab (string concatenation) will be abbreviated to AB and ab, respectively. • AA and L(A)  L(A) are abbreviated as A2 and L(A)2, respectively.

Examples • if | then | else  { if, then, else} • 0 | 1 | … | 9  { 0, 1, …, 9 } • (0 | 1) (0 | 1)  { 00, 01, 10, 11 }

More Compound Regular Expressions • 5. repetition ( or Iteration): A* L(A*) = { e }  L(A)  L(A)L(A)  L(A)3… • Examples: • 0* : {e, 0, 00, 000, …} • 10* : strings starting with 1 and followed by 0’s. • (0|1)* 0 : Binary even numbers. • (a|b)*aa(a|b)*: strings of a’s and b’s containing consecutive a’s. • b*(abb*)*(a|e) : strings of a’s and b’s with no consecutive a’s.

Example: Keyword • Keyword: else or if or begin… else | if | begin | …

Example: Integers Integer: a non-empty string of digits ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )* • problem: reuse complicated expression • improvement: define intermediate reg. expr. digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit* Abbreviation: A+ = A A*

Regular Definitions • Names for regular expressions • d1 =r1 • d2 =r2 • ... • dn =rn where ri over alphabet È {d1, d2, ..., d i-1} • note: Recursion is not allowed.

Example • Identifier: strings of letters or digits, starting with a letter digit = 0 | 1 | ... | 9 letter = A | … | Z | a | … | z identifier = letter (letter | digit) * • Is (letter* | digit*) the same ?

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, CRNL and tabs WS = (\ | \t | \n | \r\n )+

Example: Email Addresses • Consider chencc@cs.nccu.edu.tw  = letters [ { ., @ } name = letter+ address = name ‘@’ name (‘.’ name)*

Notational Shorthands • One or more instances • r+ = r r* • r* = (r+ | e) • Zero or one instance • r? = (r | e) • Character classes • [abc] = a | b | c • [a-z] = a | b | ... | z • [ac-f] = a | c | d | e | f • [^ac-f] = S – [ac-f]

Summary • Regular expressions describe many useful languages • Regular languages are a language specification • We still need an implementation • problem: Given a string s and a rexp R, is

4. Use Regular expressions in lexical specification

Goal • Specifying lexical structure using regular expressions

Regular Expressions in Lexical Specification • Last lecture: the specification of all lexemes in a token type using regular expression. • But we want a specification of all lexemes of all token types in a programming language. • Which may enable us to partition the input into lexemes • We will adapt regular expressions to this goal

Regular Expressions => Lexical Spec. (1) • Select a set of token types • Number, Keyword, Identifier, ... • Write a rexp for the lexemes of each token type • Number = digit+ • Keyword = if | else | … • Identifier = letter (letter | digit)* • LParen =‘(‘ • …

Regular Expressions => Lexical Spec. (2) • Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R1 | R2 | R3 + … Facts: If s  L(R) then s is a lexeme • Furthermore s  L(Ri) for some “i” • This “i” determines the token type that is reported

Regular Expressions => Lexical Spec. (3) 4. Let the input be x1…xn (x1 ... xnare symbols in the language alphabet) • For 1  i  n check x1…xi L(R) ? 5. It must be that x1…xi L(Rj) for some j 6. Remove t = x1…xi from input if t is normal token, then pass it to the parser // else it is whitespace or comments, just skip it! 7.go to (4)

Ambiguities (1) • There are ambiguities in the algorithm • How much input is used? What if • x1…xi L(R) and also • x1…xK L(R) for some i != k. • Rule: Pick the longest possible substring • The longest match principle !!

Ambiguities (2) • Which token is used? What if • x1…xi L(Rj) and also • x1…xi L(Rk) • Rule: use rule listed first (j iff j < k) • Earlier rule first! • Example: • R1 = Keyword and R2 = Identifier • “if” matches both. • Treats “if” as a keyword not an identifier

Error Handling • What if No rule matches a prefix of input ? • Problem: Can’t just get stuck … • Solution: • Write a rule matching all “bad” strings • Put it last • Lexer tools allow the writing of: R = R1 | ... | Rn | Error • Token Error matches if nothing else matches

Summary • Regular expressions provide a concise notation for string patterns • Use in lexical analysis requires small extensions • To resolve ambiguities • To handle errors • Efficient algorithms exist (next) • Require only single pass over the input • Few operationsper character (table lookup)

5. Finite Automata • Regular expressions = specification • Finite automata = implementation • A finite automaton consists of • An input alphabet  • A finite set of states S • A start state n • A set of accepting states F  S • A set of transitions state input state

Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input (or no transition possible) • If in accepting state => accept • Otherwise => reject

a Finite Automata State Transition Graphs • A state • The start state • An accepting state • A transition

1 A Simple Example • A finite automaton that accepts only “1”

Lexical Analysis