1 / 32

Lexical Analysis

Lexical Analysis. Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University. Outline. Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine. Front-End. Front - End Components. Group token. Scanner. Source program

Download Presentation

Lexical Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn University

  2. Outline • Overview. • Token, Lexeme, and Pattern. • Lexical Analysis Specification. • Lexical Analysis Engine.

  3. Front-End Front-End Components Group token. Scanner Source program (text stream) identifier main symbol ( m a i n ( ) { token next-token Construct parse tree. Symbol Table Parser parse-tree Check semantic/contextual. Semantic Analyzer Intermediate Representation (file or in memory)

  4. Tasks for Scanner • Read input and group tokens for Parser. • Strip comments and white spaces. • Count line numbers. • Create an entry in the symbol table. • Preprocessing functions

  5. Benefits • Simpler design • parser doesn’t worry about comments and white spaces. • More efficient scanner • optimize the scanning process only. • use specialize buffering techniques. • Portability • handle standard symbols on different platforms.

  6. Basic Terminology • Token • a set of strings • Ex: token = identifier • Lexeme • a sequence of characters in the source program matched by the pattern for a token. • Ex: lexeme = counter

  7. Basic Terminology • Pattern • a description of strings that can belong to a particular token set. • Ex: pattern= letter followed by letters or digit {A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

  8. Token const if relation id num literal Lexeme const if <, <=, …, >= counter, x, y 12.53, 1.42E-10 “Hello World” Pattern const if comparison symbols letter (letter | digit)* any numeric constant characters between “

  9. Language and Lexical Analysis • Fixed-format input • i.e. FORTRAN • must consider the alignment of a lexeme. • difficult to scan. • No reserved words • i.e. PL/I • keywords vs. id ? -- complex rules. ifif = thenthen then := else; else else := then;

  10. Regular Expression Revisited • eis a regular expression that denotes {e}. • If a is an alphabet, a is a regular expression that denotes {a}. • Suppose r and s are regular expressions: • (r)|(s) denoting L(r) U L(s). • (r)(s) denoting L(r)L(s). • (r)* denoting (L(r))*

  11. Precedence of Operator • Level of precedence • Kleene clusure (*) • concatenation • union (|) • All operators are left associative. • Ex: a*b | cd* = ((a*)b) | (c(d*))

  12. Regular Definition • A sequence of definitions: d1ฎr1 d2ฎr2 ... dnฎrn • di is a distinct name • ri is a regular expression over: ๅ U {d1, …, di-1}

  13. Examples letterฎ A | B | … | Z | a | b | … | z digitฎ 0 | 1 | … | 9 idฎ letter ( letter | digit )* digitsฎdigitdigit* opt_fractionฎ . digits | e opt_exponentฎ ( E ( + | - | e ) digits ) | e numฎ digits opt_fraction opt_exponent

  14. Notational Shorthands • One or more instances • r+ = rr* • Zero or one instance • r? = r | e • (rs)? = rs | e • Character Class • [A-Za-z] = A | B | … | Z | a | b | … | z

  15. Examples digitฎ [0-9] digitsฎdigit+ opt_fractionฎ ( . digits )? opt_exponentฎ ( E ( + | - )? digits )? numฎ digits opt_fraction opt_exponent idฎ [A-Za-z][A-Za-z0-9]*

  16. Recognition of Tokens • Consider tokens from the grammar. • token • pattern • attribute • Draw NFAs with retracting options.

  17. Example : Grammar stmt::=ifexprthenstmt | ifexprthenstmt elsestmt | expr expr::=termrelopterm | term term::=id | num

  18. Example : Regular Definition ifฎ if thenฎ then elseฎ else relopฎ < | <= | = | <> | > | >= idฎletter (letter | digit)* numฎdigit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? delimฎblank | tab | newline wsฎdelim+

  19. Attribute-Value - - - - Index in table Index in table LT LE EQ NE .. Regular Expression ws if then else id num < <= = <> ... Example: Pattern-Token-Attribute Token - if then else id num relop relop relop relop ...

  20. Attributes for Tokens if count >= 0 then ... <if, > <id, indexfor count in symbol table> <relop, GE> <num, integer value 0> <then, >

  21. 2 3 4 5 8 7 NFA – Lexical Analysis Engine < = 0 1 return(relop, LE) > = return(relop, NE) other * return(relop, LT) > return(relop, EQ) = 6 return(relop, GE) * other return(relop, GT)

  22. Handle Numbers • Pattern for number contains options. numฎdigit+ ( . digit+ )? ( E (+ | -)? digit+ ) ? • 31, 31.02, 31.02E-15 • Always get the longest possible match. • match the longest first • if not match, try the next possible pattern.

  23. 27 19 24 Handle Numbers digit digit digit + or - . digit digit digit E 12 13 14 15 16 17 18 other digit E digit digit * . other * digit digit 20 21 22 23 digit * other digit return(num, getnum()) 25 26

  24. 11 Handle Keywords • Two approaches: • encode keywords into an NFA (if, then, etc.) • complex NFA (too many states). • use symbol table • simple. • require some tricks. * letter other 9 10 return(gettoken(), install_id()) letter or digit

  25. Handle Keywords • Symbol table contains both lexeme and token type. • Initialize symbol table with all keywords and corresponding token types. lexeme: if token type: if lexeme: then token type: then lexeme: else token type: else

  26. Lexeme Token Type … then else if else if then … … … Handle Keywords Scanner initial Symbol Table Parser 1 2 3 4 5

  27. Handle Keywords • gettoken(): • If id is not found in the table, return token type ID. • Otherwise, return token type from the table.

  28. Lexeme Token Type … if then else then else if … … … i f Handle Keywords i i f f c o u n t < = Source program (text stream) Scanner gettoken Symbol Table if next-token Parser 1 2 3 4 5

  29. Handle Keywords • install_id(): • If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry. • If id is found and its type is ID, return pointer to that entry. • Otherwise, it’s a keyword. Return 0.

  30. Lexeme Token Type … if then else if then else … … … i f Handle Keywords i i f f c o u n t < = Source program (text stream) Scanner install_id Symbol Table 0 if 0 token next-token Parser 1 2 3 4 5

  31. c c o o u u n n t t Lexeme Token Type … if then else then else if … … … Handle Keywords i i f f c c o o u u n n t t < = Source program (text stream) Scanner gettoken Symbol Table id next-token Parser 1 2 3 4 5 Not found!

  32. c c o o u u n n t t Lexeme Token Type … then count else if id else then if … … … … Handle Keywords i f c o u n t < = Source program (text stream) Scanner install_id Symbol Table 4 id 4 token next-token Parser 1 2 3 4 5

More Related