1 / 23

Lexical Analysis - An Introduction

Lexical Analysis - An Introduction. The Front End. The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler. Back End.

livia
Download Presentation

Lexical Analysis - An Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis - An Introduction

  2. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler Back End Machine code Source code IR Front End Errors

  3. The Front End Scanner Maps stream of characters into words Basic unit of syntax x = x + y ;becomes <id,x> <eq,=> <id,x> <pl,+> <id,y> <sc,; > Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments IR Source code tokens Parser Scanner Errors Speed is an issue in scanning  use a specialized recognizer

  4. The Front End Parser Checks stream of classified words(parts of speech) for grammatical correctness Determines if code is syntactically well-formed Guides checking at deeper levels than syntax Builds an IR representation of the code Source code tokens Scanner Errors IR Parser

  5. The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goalexpr 2. exprexpr op term 3. | term 4. termnumber 5. | id 6. op + 7. | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7}

  6. The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goalexpr 2. exprexpr op term 3. | term 4. termnumber 5. | id 6. op + 7. | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} Parts of speech, not words! No words here!

  7. The Big Picture source code parts of speech & words Scanner tables or code Represent words as indices into a global table specifications Scanner Generator Specifications written as “regular expressions”

  8. The Big Picture • Why study lexical analysis? • Goals: • To simplify specification & implementation of scanners • To understand the underlying techniques and technologies

  9. How to implement a scanner Regular Expressions NFA DFA

  10. Regular Expressions Lexical patterns form a regular language *** any finite language is regular *** Regular expressions (REs) describe regular languages Ever type “rm *.o a.out” ?

  11. Regular Expressions Regular Expression (over alphabet )  is a RE denoting the set {} If a is in , then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x |y is an RE denoting L(x)  L(y) xy is an RE denoting L(x)L(y) x* is an RE denoting L(x)*

  12. Set Operations (review)

  13. Examples of Regular Expressions Identifiers: Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit )* Numbers: Integer  (+|-|) (0| (1|2|3| … |9)(Digit *) ) Decimal  Integer . Digit * Real  ( Integer | Decimal ) E (+|-|) Digit * Complex  (Real,Real )

  14. Regular Expressions (the point) Regular expressions can be used to specify the words to be translated to parts of speech by a lexical analyzer Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions  We study REs and associated theory to automate scanner construction !

  15. Consider the problem of recognizing ILOC register names Register  r (0|1|2| … | 9)(0|1|2| … | 9)* Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example (0|1|2| … 9) (0|1|2| … 9) r S0 S1 S2 accepting state Recognizer for Register

  16. DFA operation Start in state S0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S2 ) So, r17 takes it through s0, s1, s2and accepts r takes it through s0, s1 and fails Example (0|1|2| … 9) r (0|1|2| … 9) S0 S1 S2 accepting state Recognizer for Register

  17. Example (continued) To be useful, recognizer must turn into code Char  next character State  s0 while (Char  EOF) State  (State,Char) Char  next character if (State is a final state) then report success else report failure Skeleton recognizer Table encoding RE

  18. Example (continued) Char  next character State  s0 while (Char  EOF) State  (State,Char) perform specified action Char  next character if (State is a final state) then report success else report failure Skeleton recognizer Table encoding RE

  19. Open a file to read from Open a file to write to Create a scanner object Call a method from Scanner class to scan, classify each token and write to the output file Algorithm Project 1

  20. Read a line from input file ci= first character of this line true ci==‘#’ || (ci==‘/’ && ci+1 ==‘/’ false Algorithm scanner Not the end of file Recognize the meta character Current Token = meta character Print out token with new line i+= len of token true Recognize the string (read until you reach another “) Current Token = string Print out the token Ci ==‘”’ i+= len of token false Ci is a digit true Recognize the number Current Token = number Print out the token i+= len of token false true true Ci is a letter Recognize the id token is not a keyword Token is an ID, print with tag false true false Ci is symbol Ci is symbol Recognize the symbol Print i+= len of token Token is a keyword Print the token false Print ci i++ i++ the end of line Read another line the end of file

  21. Sample:test1.c #include <stdio.h> void sample() { int b=4; printf("Helloworld %d",b); } int main() { sample(); }

  22. Token list #include <stdio.h> ---- meta void ---- keyword sample ---- id ( ---- symbol ) ---- symbol { ---- symbol int ---- 2 b ---- id = ---- symbol 4 ---- number ; ---- symbol printf ---- keyword ( ---- symbol "Helloworld %d" ---- string , ---- symbol b ---- id ) ---- symbol ; ---- symbol } ---- symbol int ---- keyword main ---- id ( ---- symbol ) ---- symbol { ---- symbol sample ---- symbol ( ---- symbol ) ---- symbol ; ---- symbol } ---- symbol

  23. Result #include <stdio.h> void CS322sample() { int CS322b=4; printf("Helloworld %d",CS322b); } int CS322main() { CS322sample(); }

More Related