Lexical Analysis

Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University

Outline • Overview. • Token, Lexeme, and Pattern. • Lexical Analysis Specification. • Lexical Analysis Engine.

Front-End Front-End Components Group token. Scanner Source program (text stream) identifier main symbol ( m a i n ( ) { token next-token Construct parse tree. Symbol Table Parser parse-tree Check semantic/contextual. Semantic Analyzer Intermediate Representation (file or in memory)

Tasks for Scanner • Read input and group tokens for Parser. • Strip comments and white spaces. • Count line numbers. • Create an entry in the symbol table. • Preprocessing functions

Benefits • Simpler design • parser doesn’t worry about comments and white spaces. • More efficient scanner • optimize the scanning process only. • use specialize buffering techniques. • Portability • handle standard symbols on different platforms.

Basic Terminology • Token • a set of strings • Ex: token = identifier • Lexeme • a sequence of characters in the source program matched by the pattern for a token. • Ex: lexeme = counter

Basic Terminology • Pattern • a description of strings that can belong to a particular token set. • Ex: pattern= letter followed by letters or digit {A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

Token const if relation id num literal Lexeme const if <, <=, …, >= counter, x, y 12.53, 1.42E-10 “Hello World” Pattern const if comparison symbols letter (letter | digit)* any numeric constant characters between “

Language and Lexical Analysis • Fixed-format input • i.e. FORTRAN • must consider the alignment of a lexeme. • difficult to scan. • No reserved words • i.e. PL/I • keywords vs. id ? -- complex rules. ifif = thenthen then := else; else else := then;

Regular Expression Revisited • eis a regular expression that denotes {e}. • If a is an alphabet, a is a regular expression that denotes {a}. • Suppose r and s are regular expressions: • (r)|(s) denoting L(r) U L(s). • (r)(s) denoting L(r)L(s). • (r)* denoting (L(r))*

Precedence of Operator • Level of precedence • Kleene clusure (*) • concatenation • union (|) • All operators are left associative. • Ex: a*b | cd* = ((a*)b) | (c(d*))

Regular Definition • A sequence of definitions: d1ฎr1 d2ฎr2 ... dnฎrn • di is a distinct name • ri is a regular expression over: ๅ U {d1, …, di-1}

Examples letterฎ A | B | … | Z | a | b | … | z digitฎ 0 | 1 | … | 9 idฎ letter ( letter | digit )* digitsฎdigitdigit* opt_fractionฎ . digits | e opt_exponentฎ ( E ( + | - | e ) digits ) | e numฎ digits opt_fraction opt_exponent

Notational Shorthands • One or more instances • r+ = rr* • Zero or one instance • r? = r | e • (rs)? = rs | e • Character Class • [A-Za-z] = A | B | … | Z | a | b | … | z

Examples digitฎ [0-9] digitsฎdigit+ opt_fractionฎ ( . digits )? opt_exponentฎ ( E ( + | - )? digits )? numฎ digits opt_fraction opt_exponent idฎ [A-Za-z][A-Za-z0-9]*

Recognition of Tokens • Consider tokens from the grammar. • token • pattern • attribute • Draw NFAs with retracting options.

Example : Grammar stmt::=ifexprthenstmt | ifexprthenstmt elsestmt | expr expr::=termrelopterm | term term::=id | num

Example : Regular Definition ifฎ if thenฎ then elseฎ else relopฎ < | <= | = | <> | > | >= idฎletter (letter | digit)* numฎdigit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? delimฎblank | tab | newline wsฎdelim+

Attribute-Value - - - - Index in table Index in table LT LE EQ NE .. Regular Expression ws if then else id num < <= = <> ... Example: Pattern-Token-Attribute Token - if then else id num relop relop relop relop ...

Attributes for Tokens if count >= 0 then ... <if, > <id, indexfor count in symbol table> <relop, GE> <num, integer value 0> <then, >

2 3 4 5 8 7 NFA – Lexical Analysis Engine < = 0 1 return(relop, LE) > = return(relop, NE) other * return(relop, LT) > return(relop, EQ) = 6 return(relop, GE) * other return(relop, GT)

Handle Numbers • Pattern for number contains options. numฎdigit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? • 31, 31.02, 31.02E-15 • Always get the longest possible match. • match the longest first • if not match, try the next possible pattern.

27 19 24 Handle Numbers digit digit digit + or - . digit digit digit E 12 13 14 15 16 17 18 other digit E digit digit * . other * digit digit 20 21 22 23 digit * other digit return(num, getnum()) 25 26

11 Handle Keywords • Two approaches: • encode keywords into an NFA (if, then, etc.) • complex NFA (too many states). • use symbol table • simple. • require some tricks. * letter other 9 10 return(gettoken(), install_id()) letter or digit

Handle Keywords • Symbol table contains both lexeme and token type. • Initialize symbol table with all keywords and corresponding token types. lexeme: if token type: if lexeme: then token type: then lexeme: else token type: else

Lexeme Token Type … then else if else if then … … … Handle Keywords Scanner initial Symbol Table Parser 1 2 3 4 5

Handle Keywords • gettoken(): • If id is not found in the table, return token type ID. • Otherwise, return token type from the table.

Lexeme Token Type … if then else then else if … … … i f Handle Keywords i i f f c o u n t < = Source program (text stream) Scanner gettoken Symbol Table if next-token Parser 1 2 3 4 5

Handle Keywords • install_id(): • If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry. • If id is found and its type is ID, return pointer to that entry. • Otherwise, it’s a keyword. Return 0.

Lexeme Token Type … if then else if then else … … … i f Handle Keywords i i f f c o u n t < = Source program (text stream) Scanner install_id Symbol Table 0 if 0 token next-token Parser 1 2 3 4 5

c c o o u u n n t t Lexeme Token Type … if then else then else if … … … Handle Keywords i i f f c c o o u u n n t t < = Source program (text stream) Scanner gettoken Symbol Table id next-token Parser 1 2 3 4 5 Not found!

c c o o u u n n t t Lexeme Token Type … then count else if id else then if … … … … Handle Keywords i f c o u n t < = Source program (text stream) Scanner install_id Symbol Table 4 id 4 token next-token Parser 1 2 3 4 5

Lexical Analysis