Lexical Analysis Part 1

Lexical AnalysisPart 1 Professor Yihjia Tsai Tamkang University

Definitions • lex·i·cal : 1. Of or relating to the vocabulary, words, or morphemes of a language. 2. Of or relating to lexicography or a lexicon. • lex·i·con: 1. A dictionary. 2. A stock of terms used in a particular profession, subject, or style; a vocabulary: the lexicon of surrealist art. 3. Linguistics. The morphemes of a language considered as a group.

Lexical Analyzer • The lexical analyzer takes a stream of characters and produces a stream of names, keywords, and punctuation marks; it discards white space and comments between the tokens. • A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language.

Lexical Analysis –What’s to come • Programs could be made from characters, and parse trees would go down to the character level • Machine specific, obfuscates parsing, cumbersome • Lexical analysis is firewall between program representation and parsing actions • Prior lexical analysis phase obtains tokens consisting of a type (ID) and value (the lexeme matched) • In Principle – simple transition diagrams (finite state automata) characterize each of the “things” that can be recognized • In Practice – a program combines the multiple automata definitions into an efficient state machine

Lexical Phase • Simple (non-recursive) • Efficient (special purpose code) • Portable (ignore character-set and architecture differences) • Use JavaCC, lex , flex , etc • Used in practice with Bison/Yacc , etc.

Lexical Processing • Token: terminal symbols in a grammar. At the lexical level this is a symbol constant, and in “print” is represented in bold • Pattern: set of matching strings. For a keyword it is a constant. For a variable or value it can be represented by a regular expression • Lexeme: character sequence matched by an instance of the token

Lexical Processing • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Languages may have special rules (i.e., PL/1 does not have “Reserved words” and Fortran allows spaces in variables; both are obscure design choices)

Lexical Analysis – sequences • Expression • Base * base - 0x4 * height * width • Token sequence • Name:base operator:times name:base operator:minus hexConstant:4 operatort:imes name:height operator:times name:width • Lexical phase returns token and value (yylval , yytext, etc)

Tokens • Token attributes: pointer to a symbol-table entry, may include the lexeme, scope information, etc. • Formal specification of tokens by regular expressions, define alphabet, strings, languages

Regular Expression Definitions • A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. • Regular expressions is the term used for a codified method of searching 'invented' or 'defined' by the American mathematician Stephen Kleene. • (1) A mechanism to select specific strings from a set of character strings. (2) A set of characters, metacharacters, and operators that define a string or group of strings in a search pattern. (3) A string containing wildcard characters and operations that define a set of one or more possible strings.

Regular Expression Notation • Σ: alphabet, a set of symbols to be used in the language • a: an ordinary letter from our alphabet a  Σ • ε: the empty string • r1 | r2: choosing from r1 or r2 • r1r2 : concatenation of r1 and r2 • r*: zero or more times (Kleene closure) • r+: one or more times • r?: zero or one occurrence • [a-zA-Z] character class (choice) • . period stands for any single char exc. newline

Semantics of Regular Expressions • L(e) = {e} • L(a) = {a} for all a in S • L (r1 | r2) = L(r1) U L (r2) • L (r1 r2) = {x,y) | x in L(r1 ), y in L(r2 )} • L (R*) = { e } U { x in L(R )} , { x1 x2 | x1 ,x2 in L(R ) } … { x1 . . .xn | x1. … xn in L(R ) }

For Homework • Suppose S is {a ,b} What is the regular expression for: • All strings beginning and ending in a? • All strings with an odd number of a’s? • All strings without two consecutive a’s? • All strings with an odd number of b’s followed by an even number of a’s • What’s the description for a Java floating point number? • What’s the description of variable name in Java?

NFA Regular expressions DFA Lexical Specification of Tokens Table-driven Implementation of DFA Why we care about Regular Expressions For every regular expression, there is a deterministic finite-state machine that defines the same language, and vice versa

Finite State MachineDefinition • A model of computation consisting of a set of states, a start state, an input alphabet, and a transition function that maps input symbols and current states to a next state. Computation begins in the start state with an input string. It changes to new states depending on the transition function. There are many variants, for instance, machines having actions (outputs) associated with transitions (Mealy machine) or states (Moore machine), multiple start states, transitions conditioned on no input symbol (a null) or more than one transition for a given symbol and state (nondeterministic finite state machine), one or more states designated as accepting states (recognizer), etc.

Regular Expressions • Automaton is a good “visual” aid • but is not suitable as a specification (its textual description is too clumsy) • However regular expressions are a suitable specification • a compact way to define a language that can be accepted by an automaton.

RegExp Use and Construction • Used as the input to a scanner generator like lex or flex or JavaCC • define each token, and also • define white-space, comments, etc • these do not correspond to tokens, but must be recognized and ignored. • A NFA can be constructed from a RegExp via Thompson’s Construction

Thompson’s Construction • There are building blocks for each regular expression operator • More complex RegExps are constructed by composing smaller building blocks • Assumes that the NFAs at each step of the construction will have a single accepting state

M  a Regular Expressions to NFA (1) • For each kind of rexp, define an NFA • Notation: NFA for rexp M • For  • For input a

 A B B     A Regular Expressions to NFA (2) • For A B • For A | B

 A    Regular Expressions to NFA (3) • For A*

Others • What would be representation for A+? • What would be representation for A?? • What about for[a-z]?

Example of RegExp -> NFA conversion • Consider the regular expression (1|0)*1 • The NFA is  1   C E 1 B A G H  I J 0    D F  

More Homework Problems • What is the NFA for the following RE? (a(b+c))* a • What is the NFA for the following RE? ((a|b)*c) | (a b c*)

Lexical Analyzer • Can be programmed in a high-level language. • Can be generated using tools like LEX/Flex • Integrate these tools with C/C++ or Java code • In Java there are other tools Jflex for example

How can a tool like LEX or JAVACC work? • Translate regular expressions to Non-deterministic Finite Automata (NFA) • Easier expressive form than the DFA • Automata theory tells us how to optimize • Run the automata • Simulate NFA, or • Translate NFA to DFA: a new DFA where each state corresponds to a set of NFA states (see pages 28-29 in Appel for set construction) • Have DFA move between states in simulation of the NFAs states

Non-deterministic FA • NFA is modified to allow zero, one or MOREtransitions from a state on the same input symbol • Easier to express complex patterns as NFA • Harder to mechanically simulate NFS: what transition do we make on input (simulate all of them, then confirm it worked) • DFA and NFA are functionally equivalent.

DFA with null moves • The model of NFA can be extended to include transitions on <null> input. • Change the state without reading any symbol from the input stream. • e-closure(q) : set of all states reachable from q without reading any input symbol (following the null edges)

eClosure Operator • The eClosure operator is defined as eClosure(s) = { s } U states reachable from s using e transitions. • Example: eClosure(1) = {1,3} a  start 1 5 3 a a/b b 2 4

RE to FA • If we write expression as RE (easy for people) how do we turn it into an FA (something a machine can simulate) • Use Thompson’s Construction • At most twice as many states as there are symbols and operators in the regular expression. • Results in a NFA (needs a non-deterministic computer to run most efficiently, hmm….)

NFA to DFA • Build “super states” in a DFA where each “super state” represents the set of transitions that the NFA could make from a state on a symbol • e-closure(q) : states that can be arrived at from q with just null transitions • move(S, a) : states that can be reached on scanning a symbol a (from the input) • e-closure(S) : states that can be reached with E transitions from states in S

NFA to DFA (cont….) • Subset Construction (alg 3.2) Find e-closure(q0) while ( S in FAStates is unmarked) { mark S for each a in alphabet { T = e-closure ( move(S, a) ) ; if (T  FAStates) FAStates.include( T ); FATran[S, a] = T ; } }

FA v.s. NFA • NFA is smaller O(|r|) space but more time for simulation O(|r|*|x|) time even with the nice properties of Thompson’s construction • DFA is faster O(|x|) time, but is not space efficient, O(2|r|) space

NFA t DFA • What is the difference between the two? • Is there a single DFA for a corresponding NFA? • Why do we want to do this anyway?

Subset Construction for NFA-> DFA • Compute A = eClosure(start) • Compute the set of states reachable from A on transition a, call this new set S’ • Compute eClosure(S’) – this is the new state and label it with the next available label • Continue for all possible transitions from the current state for all applicable elements of S • Repeat steps 2-4 for each new state

Example: a c*b e a c e e b 1 2 3 4 6 5 e

References • Aho, A.V., R. Sethi, and J.D. Ullman, Compilers Principles, Techniques and Tools, Addison-Wesley, 1988. ISBN 0-201-10088-6. Chapter 3 • Appel, A., Modern Compiler Implementation In Java (2nd Ed), Cambridge University Press, 2002. ISBN 052182060X.

Lexical Analysis Part 1

Lexical Analysis Part 1

Presentation Transcript

Lexical Analysis

Lexical Analysis Part 1

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis Part 2

Lexical Analysis

Lexical Analysis

Lexical Analysis – Part II

Lexical Analysis

LEXICAL ANALYSIS

Lexical Analysis

Lexical Analysis

Lexical Analysis

Chapter 3. Lexical Analysis (1)

Lexical Analysis Part 2

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis