1 / 37

Automata and Regular Expression

Automata and Regular Expression. Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn. Nonterminal symbols. Terminal symbols. MiniC Formal Grammar. prog -> stm prog | stm stm -> id = exp ; | print ( exp ); exp -> exp + exp | exp - exp

leejulia
Download Presentation

Automata and Regular Expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua bjhua@ustc.edu.cn

  2. Nonterminal symbols Terminal symbols MiniC Formal Grammar prog -> stm prog | stm stm -> id= exp; | print(exp); exp -> exp + exp | exp - exp | exp * exp | exp / exp | id | num | (exp)

  3. ident and intconst • For most programming languages, the terminal symbols represent the basic punctuation symbols, keywords, and operators • <id> and <num> are special in that they represent (infinite) sets of terminal symbols id ::= letter idRest letter ::= _ | a |…| z | A |… | Z idRest ::=  |letteridRest|digitidRest digit::= 0 | 1 | 2 | … | 9 <exp> ::= … | <ident> | <intconst>

  4. Lexical Analysis • The lexical analyzer translates the source program into a stream of lexical tokens • Source program: • stream of (ASCII or Unicode) characters • Lexical token: • internal data structure that represents the occurrence of a terminal symbol

  5. Example x = 11; y = 2; z = x + y; print (z); lexical analysis IDENT(x) ASSIGN INTCONST(11) SEMICOLON NEWLINE IDENT(y) ASSIGN INTCONST(2) SEMICOLON NEWLINE IDENT(z) ASSIGN IDENT(x) PLUS IDENT(y) SEMICOLON NEWLINE PRINT LPAREN IDENT(z) RPAREN SEMICOLON EOF

  6. Practice Issues • What are tokens? • Recall the tagged union in our previous slides • What if the input characters are illegal? • Limited checking of the grammatical structure of input • only checks that input stream can be viewed as a stream of terminal symbols

  7. Lexical Errors Not lexical errors x = 11 y = = = = = = 2; z =@ x + y; print (#z); Lexical errors

  8. Position Info • For the purpose of later phases, it is useful to attach position information to each token • we’d see how to make use of such kind of info in later slides LPRREN(1,4) IDENT(x,4,5) MINUS(6,7) …

  9. Tokens in C #ifndef TOKEN_H #define TOKEN_H enum tokenKind {ID, NUM, ASSIGN, LPAREN, …}; typedef tokenStruct *token; struct tokenStruct { enum tokenKind kind; union {…} u; int line; int column; }; #endif

  10. Lexer Interface #ifndef LEXER_H #define LEXER_H #include “token.h” token nextToken (char *fileName); #endif

  11. Client Code #include “lexer.h” int main() { // we want to analysis file “test.c” token t = nextToken (“test.c”); while (t!=EOF); { … t = nextToken (“test.c”); … } return 0; }

  12. Finite-state Automata

  13. M Input String {Yes, No} Finite-state Automata (FAs) M = (, S, s0, F, f) Transition function Input alphabet State set Final states Initial state

  14. A deterministic finite automaton (DFA) Transition Functions f:S    S which can be extended to f’:S  *  S and or in an inductive form: • f’(q, ) = q • f’(q, a) = f’(f(q, a), )

  15. a a 0 1 2 b b a,b DFA Example • Which strings of as and bs are accepted? • Transition function: • { (s0,a)s1, (s0,b)s0, (s1,a)s2, (s1,b)s1, (s2,a)s2, (s2,b)s2 }

  16. Nondeterministic FAs (NFAs) • NFAs can transition to more than one state on any input • f:S   (S) • As before, can extend: • f’:S *  (S) • Inductively: f’(q, ) = {q} f’(q, a) = pf(q, a)f’(p, )

  17. a,b 0 1 b a b NFA Example • Transition function: { (s0,a){s0,s1}, (s0,b){s1}, (s1,a), (s1,b){s0,s1} }

  18. Regular Expression

  19. Regular Expressions • A regular language can always be described using a regular expression. • Examples • (01)* • 00 •  • (a|b)*ab • this | that | theother • 0*1*2* • 01*|0 =01* • 00*11*22* =0+1+2+ • (1|0)*00(0|1)*

  20. Regular Expressions and Tokens • Regular expressions are convenient for describing lexical tokens • intconst: [0-9][0-9]* • ident: [_a-zA-Z][_a-zA-Z0-9_]* • others: = | print | + | …

  21. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {}

  22. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {}

  23. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} a

  24. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a}

  25. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are

  26. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS

  27. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS R S

  28. Regular Expressions • Let  = {a,b}. •  is a regular expression • L = {} •  is a regular expression • L = {} • a is a regular expression • L = {a} • R|S is a regular expression if R and S are • LR+S = LRU LS  R    S

  29. Regular Expressions • Let  = {a,b}. •  is a regular expression •  is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • LRS = {uv | u LR & v  LS }  S R

  30. Regular Expressions • Let  = {a,b}. •  is a regular expression •  is a regular expression • a is a regular expression • R|S is a regular expression if R and S are • RS is a regular expression if R and S are • R* is a regular expression if R is • LR* = U0 i LRi    R 

  31. Regular Expressions • The language described by a regular expression can be accepted by an FA. RE NFA  NFA  DFA • A regular grammar can always be described using a regular expression. RG  RE

  32. Building FAs • An FA is a directed graph • How large is the input alphabet? • How many states? • How fast must it run? • How to get the lowest constant factor? • How to minimize space? • Representations • Matrix • Array of lists • Hashtable • Switch statement • For simplicity, we recommended this method in the assignment

  33. Lex -- Automatic Lexer Generation Tools

  34. History • Lexical analysis was once a performance bottleneck • certainly not true today! • As a result, early research investigated methods for efficient lexical analysis • While the performance concerns are largely irrelevant today, the tools resulting from this research are still in wide use

  35. History: A long-standing Goal • In this early period, a considerable amount of study went into the goal of creating an automatic lexer generator (aka compiler-compiler) declarative compiler specification compiler

  36. History: Unix and C • In the mid-1960’s at Bell Labs, Ritchie and others were developing Unix • A key part of this project was the development of C and a compiler for it • Johnson, in 1968, proposed the use of finite state machines for lexical analysis and developed Lex [CACM 11(12), 1968] • Lex realized a part of the compiler-compiler goal by automatically generating fast lexical analyzers

  37. The Lex tool lexical analyzer specification fast lexical analyzer Lex • The original Lex generated lexers written in C. Today every major language has its own lex tool(s): • flex, sml-lex, ocamllex, JLex, JFlex, C#Flex, …

More Related