1 / 28

Chapter 3. Lexical Analysis (1)

Chapter 3. Lexical Analysis (1). Interaction of lexical analyzer with parser. Lexical Analysis. Issues Simpler design is preferred Compiler efficiency is improved Compiler portability is improved Terms Tokens  terminal symbols in a grammar

rollin
Download Presentation

Chapter 3. Lexical Analysis (1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3.Lexical Analysis (1)

  2. Interaction of lexical analyzer with parser.

  3. Lexical Analysis • Issues • Simpler design is preferred • Compiler efficiency is improved • Compiler portability is improved • Terms • Tokens  terminal symbols in a grammar • Patterns  rules to describing strings of a token • Lexemes  a set of strings matched by the pattern

  4. Examples of tokens.

  5. Difficulties in implementing lexical analyzers • FORTRAN • No delimiter is used • DO 5 I=1.25  DO 5 I=1,25  DO 5 I= 1 25 • PL/I • Keywords are not reserved • IF THEN THEN THEN = ELSE; ELSE ELSE=THEN;

  6. Attributes for tokens • A lexical analyzer collects information about tokens into their associated attributes • Example • E = M * C ** 2 • <id, pointer to symbol-table entry for E> • <assign_op,> • <id, pointer to symbol-table entry for M> • <mult_op,_> • <id, pointer to symbol-table entry for C> • <exp_op,> • <num, integer value 2> generally stored in constant table

  7. Lexical Errors • Rules for error recovery • Deleting an extraneous character • Inserting a missing character • Replacing an incorrect character by a correct character • Transposing two adjacent characters • Minimum-distance erroneous correction • Example • Detectable : 2as3, 2#31, … • Undetectable : fi(a == f(x)) …

  8. Input Buffering • A single buffer could make a big difficulty • 두 버퍼 사이에 있는 word • Declare (arg1, …. , argn)  array or function • Buffer pairs • A good solution • Sentinels을 쓰면 매번 버퍼의 끝인지와 파일의 끝인지를 동시에 검사할 필요가 없음

  9. Sentinels at end of each buffer half.

  10. Specification of Tokens • Strings and languages • Alphabet or character class  finite set of symbols • String  sentence  word • |s|  length of a string s • ε : empty string, Ф ={ε} : empty set • x, y are strings  • xy : concatenation, εx = x ε = x • Operations on languages

  11. Terms for parts of a string.

  12. Definitions of operations on languages.

  13. Regular Expressions 1. is a regular expression that denotes {}, that is, the set containing the empty string. 2. If a is symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a. It will be clear from the context whether we are talking about a as a regular expression, string, or symbol. 3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then, a) (r)|(s) is a regular expression denoting L(r)  L(s). b) (r)(s) is a regular expression denoting L(r)L(s). c) (r)* is a regular expression denoting (L(r))*. d) (r) is a regular expression denoting L(r).

  14. Examples on operations in regular expressions • Σ ={a,b}  alphabets • a | b  {a,b} • (a|b)(c|d)  {ac, ad, bc, bd} • a*  {ε, a, aa, aaa, …} • (a|b)*  (a*|b*)* • aa* = a+, ε|a+ = a* • (a|b) = (b|a)

  15. Algebraic properties of regular expressions.

  16. Regular Definitions • Regular definition • d1  r1 d2  r2 …. dn  rn • 예 • letter  A|B| … |Z|a|b| … |z • digit  0|1| … | 9 • id  letter (letter|digit)*

  17. Unsigned numbers • Pascal digit  0|1| … |9 digits  digit digit* operational_fraction  . digits | ε optional_exponent  (E(+|-| ε) digits | ε num  digits operational_fraction optional_exponent

  18. Notational Shorthands (1/2) • One or more instances. The unary postfix operator + means “one or more instances of.” If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denotes the language (L(r))+. Thus, the regular expression a+ denotes the set of all strings of one or more a’s. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators. • Zero or one instance. The unary postfix operator ? means “zero or one instance of.” The notation r? is a shorthand for r|. If r is a regular expression, then, (r)? is a regular expression that denotes the language L(r)  {}. For example, using the + and ? operators, we can rewrite the regular definition for num in Example 3.5 as

  19. Notational Shorthands (2/2) • Character classes. The notation [abc] where a, b, and c are alphabet symbols denotes the regular expression a | b | c. An abbreviated character class such as [a – z] denotes the regular expression a | b | ··· | z. Using character classes, we can describe identifiers as being strings generated by the regular expression [A – Za – z][A – Za – z0 – 9]*

  20. Nonregular set • {wcw-1|w is a string of a’s and b’s}  context-free grammar is required to represent the string

  21. Regular-expression patterns for tokens.

  22. Transition diagram • Finite-state automata • states and edges • 몇 가지 예를 보여줌 …. • 다음 페이지, • 그림3.14는 앞의 예를 바탕으로 그림

  23. Transition diagram for identifiers and keywords.

  24. Lex에 의한 구현 • Regular definition  finite automata, transition diagram • C프로그램으로 출력 • Lexical analysis, pattern matching, …

  25. Creating a lexical analyzer with Lex.

  26. Lex program for the tokens of Fig. 3. 10. (1/2) %{ /*definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /*regular definitions */ delim [ \ t \ n ] ws { delim }+ letter [ A-Za-z ] digit [ 0 – 9 ] id { letter } ( { letter } | { digit } )* number { digit } + ( \ .{ digit } + ) ? ( E [ + \ - ] ? { digit } + ) ?

  27. Lex program for the tokens of Fig. 3. 10. (2/2) %% { ws } { /* no action and no return */ } if { return(IF); } then { return(THEN); } else { return(ELSE); } { id } { yylval = install_id(); return(ID); } { number } { yylval = install_num(); return(NUMBER); } “<” { yylval = LT; return(RELOP); } “<=” { yylval = LE; return(RELOP); } “=” { yylval = EQ; return(RELOP); } “<>” { yylval = NE; return(RELOP); } “>” { yylval = GT; return(RELOP); } “>=” { yylval = GE; return(RELOP); } %% install_id() { /* procedure to install the lexeme, whose first character is pointed to by yytext and whose length is yyleng, into the symbol table and return a pointer thereto */ } install_num() { /* similar procedure to install a lexeme that is a number */ }

  28. Lookahead operator • DO 5 I = 1.25  DO 5 I=1,25 • DO/({letter | digit})* = ({letter} | {digit})*, • DO/{id}* = {digit}*, • IF(I,J)=3  IF(condition) statement • IF/ \( .* \) {letter}

More Related