CS375

CS375 Compilers Lexical Analysis 4th February, 2010

Outline • Overview of a compiler. • What is lexical analysis? • Writing a Lexer • Specifying tokens: regular expressions • Converting regular expressions to NFA, DFA • Optimizations.

How It Works Program representation Source code (character stream) if (b == 0) a = b; Lexical Analysis Tokenstream if ( b == 0 ) a = b ; Syntax Analysis (Parsing) if == = Abstract syntaxtree (AST) b 0 a b Semantic Analysis if boolean int == = Decorated AST intb int0 inta lvalue intb

What is a lexical analyzer • What? • Reads in a stream of characters and groups them into “tokens” or “lexemes”. • Language definition describes what tokens are valid. • Why? • Makes writing the parser a lot easier, parser operates on “tokens”. • Input dependent functionality such as character codes, EOF, new line characters.

First Step: Lexical Analysis Source code (character stream) if (b == 0) a = b; Lexical Analysis Token stream if ( b == 0 ) a = b ; Syntax Analysis Semantic Analysis

What it should do? Token Description ? Input String Token? {Yes/No} w We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.

Starting off

Tokens • Logical grouping of characters. • Identifiers: x y11 elsen _i00 • Keywords: if else while break • Constants: • Integer: 2 1000 -500 5L 0x777 • Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10 • String: ”x” ”He said, \”Are you?\”\n” • Character: ’c’ ’\000’ • Symbols: + * { } ++ < << [ ] >= • Whitespace (typically recognized and discarded): • Comment: /** don’t change this **/ • Space: <space> • Format characters: <newline> <return>

Ad-hoc Lexer • Hand-write code to generate tokens • How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); } } • Problems • How to start? • What to do with following character? • How to avoid quadratic complexity of repeated concatenation? • How to recognize keywords?

Look-ahead Character • Scan text one character at a time • Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; … while (identifierChar(next)) { id = id + String(next); next = input.read (); } e l s e n next (lookahead)

Ad-hoc Lexer: Top-level Loop class Lexer { InputStream s; char next; Lexer(InputStream _s) { s = _s; next = s.read(); } Token nextToken( ) { if (identifierFirstChar(next))//starts with a char return readIdentifier(); //is an identifier if (numericFirstChar(next)) //starts with a num return readNumber(); //is a number if (next == ‘\”’) return readStringConst(); … } }

Problems • Might not know what kind of token we are going to read from seeing first character • if token begins with “i’’ is it an identifier? (what about int, if ) • if token begins with “2” is it an integer constant? • interleaved tokenizer code hard to write correctly, harder to maintain • in general, unbounded look-ahead may be needed

Problems (cont.) • How to specify (unambiguously) tokens. • Once specified, how to implement them in a systematic way? • How to implement them efficiently?

Problems (cont. ) • For instance, consider. • How to describe tokens unambiguously 2.e0 20.e-01 2.0000 “” “x” “\\” “\”\’” • How to break up text into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; • How to tokenize efficiently • tokens may have similar prefixes • want to look at each character ~1 time

Principled Approach • Need a principled approach • Lexer Generators • lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator • Your own Lexer • Describe programming language’s tokens with a set of regular expressions • Generate scanning automaton from that set of regular expressions

Top level idea… • Have a formal language to describe tokens. • Use regular expressions. • Have a mechanical way of converting this formal description to code. • Convert regular expressions to finite automaton (acceptors/state machines) • Run the code on actual inputs. • Simulate the finite automaton. 16

An Example : Integers • Consider integers. • We can describe integers using the following grammar: • Num -> ‘-’ Pos • Num -> Pos • Pos ->0 | 1 |…|9 • Pos ->0 | 1 |…|9 Pos • Or in a more compact notation, we have: • Num-> -? [0-9]+ 17

An Example : Integers • Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0. • We can also represent above regular expression as a state machine. • This would be useful in simulation of the regular expression. 18

An Example : Integers -  0-9 0 3 1 2  0-9 • The Non-deterministic Finite Automaton is as follows. • We can verify that -123, 65, 0 are accepted by the state machine. • But which path to take? -paths? 19

An Example : Integers {2,3} - {0,1} {1} 0-9 0-9 0-9 • The NFA can be converted to an equivalent Deterministic FA as below. • We shall see later how. • It accepts the same tokens. • -123 • 65 • 0 20

An Example : Integers • The deterministic Finite automaton makes implementation very easier, as we shall see later. • So, all we have to do is: • Express tokens as regular expressions • Convert RE to NFA • Convert NFA to DFA • Simulate the DFA on inputs 21

The larger picture… Regular Expression describing tokens RE  NFA Conversion R NFA  DFA Conversion Yes, if w is valid token DFA Simulation Input String w No, if not 22

Quick Language Theory Review…

Language Theory Review • Let  be a finite set •  called an alphabet • a   called a symbol • * is the set of all finite strings consisting of symbols from  • A subset L  * is called a language • If L1 and L2 are languages, then L1 L2 is the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively

Language Theory Review, ctd. • Let L  * be a language • Then • L0 = {} • Ln+1= L Ln for all n  0 • Examples • if L = {a, b} then • L1 = L = {a, b} • L2 = {aa, ab, ba, bb} • L3 = {aaa, aab, aba, aba, baa, bab, bba, bbb} • …

Syntax of Regular Expressions • Set of regular expressions (RE) over alphabet  is defined inductively by • Let a  and R, S  RE. Then: • a  RE • ε RE •   RE • R|S RE • RS RE • R*  RE • In concrete syntactic form, precedence rules, parentheses, and abbreviations

Semantics of Regular Expressions • Regular expression T  RE denotes the language L(R)  * given according to the inductive structure of T: • L(a) ={a} the string “a” • L(ε) = {“”} the empty string • L()= {} the empty set • L(R|S) = L(R)  L(S) alternation • L(RS) = L(R) L(S) concatenation • L(R*) = {“”}  L(R)  L(R2)  L(R3)  L(R4)  … Kleene closure

Simple Examples • L(R) = the “language” defined by R • L( abc ) = { abc } • L( hello|goodbye ) = {hello, goodbye} • OR operator, so L(a|b) is the language containing either strings of a, or strings of b. • L( 1(0|1)* ) = all non-zero binary numerals beginning with 1 • Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.

Convienent RE Shorthand R+ one or more strings from L(R): R(R*) R? optional R: (R|ε) [abce] one of the listed characters: (a|b|c|e) [a-z] one character from this range: (a|b|c|d|e|…|y|z) [^ab] anything but one of the listed chars [^a-z] one character notfrom this range ”abc” the string “abc” \( the character ’(’ . . . id=R named non-recursive regular expressions

More Examples Regular Expression RStrings in L(R) digit = [0-9] “0” “1” “2” “3” … posint = digit+ “8” “412” … int = -? posint “-42” “1024” … real = int ((. posint)?) “-1.56” “12” “1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε) [a-zA-Z_][a-zA-Z0-9_]* C identifiers else the keyword “else”

Historical Anomalies • PL/I • Keywords not reserved • IF IF THEN THEN ELSE ELSE; • FORTRAN • Whitespace stripped out prior to scanning • DO 123 I = 1 • DO 123 I = 1 , 2 • By and large, modern language design intentionally makes scanning easier

Writing a lexer

Writing a Lexer • Regular Expressions can be very useful in describing languages (tokens). • Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification. • Have a systematic way of writing a Lexer from a specification such as regular expressions.

Writing your own lexer

How To Use Regular Expressions • Given R RE and input string w, need a mechanism to determine if w  L(R) • Such a mechanism is called an acceptor R RE (that describes a token family) Yes, if w is a token ? No, if w not a token Input string w (from the program)

Acceptors • Acceptor determines if an input string belongs to a language L • Finite Automata are acceptors for languages described by regular expressions L Description of language Finite Automaton Yes, if w  L Acceptor No, if w  L Input String w

Finite Automata • Informally, finite automaton consist of: • A finiteset of states • Transitions between states • An initial state (start state) • A set of final states (accepting states) • Two kinds of finite automata: • Deterministic finite automata (DFA): the transition from each state is uniquely determined by the current input character • Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input

DFA Example • Finite automaton that accepts the strings in the language denoted by regular expression ab*a • Can be represented as a graph or a transition table. • A graph. • Read symbol • Follow outgoing edge b a a 2 0 1

DFA Example (cont.) a b 0 1 Error 1 2 1 2 ErrorError • Representing FA as transition tables makes the implementation very easy. • The above FA can be represented as : • Current state and current symbol determine next state. • Until • error state. • End of input.

Simulating the DFA • Determine if the DFA accepts an input string transition_table[NumSTATES][NumCHARS] accept_states[NumSTATES] state = INITIAL while (state != Error) {c = input.read();if (c == EOF) break;state = trans_table[state][c]; } return (state!=Error) && accept_states[state]; b a a 2 0 1

RE Finite automaton? • Can we build a finite automaton for every regular expression? • Strategy: build the finite automaton inductively, based on the definition of regular expressions ε a  a

RE Finite automaton? ? ? ? • Alternation R|S • Concatenation: RS • Recall ? implies optional move. R automaton S automaton R automaton S automaton

NFA Definition b a e a a b e • A non-deterministic finite automaton (NFA) is an automaton where: • There may be ε-transitions (transitions that do not consume input characters) • There may be multiple transitions from the same state on the same input character Example:

RE NFA intuition -?[0-9]+ -  0-9  0-9 When to take the -path?

NFA construction (Thompson) • NFA only needs one stop state (why?) • Canonical NFA form: • Use this canonical form to inductively construct NFAs for regular expressions

Inductive NFA Construction ε ε R R|S ε ε S ε RS R S ε ε ε R* R ε

Inductive NFA Construction ε ε R R|S ε ε S RS R S ε ε ε R* R ε

DFA vs NFA • DFA: action of automaton on each input symbol is fully determined • obvious table-driven implementation • NFA: • automaton may have choice on each step • automaton accepts a string if there is any way to make choices to arrive at accepting state • every path from start state to an accept state is a string accepted by automaton • not obvious how to implement!

Simulating an NFA • Problem: how to execute NFA? “strings accepted are those for which there is some corresponding path from start state to an accept state” • Solution: search all paths in graph consistent with the string in parallel • Keep track of the subset of NFA states that search could be in after seeing string prefix • “Multiple fingers” pointing to graph

Example -  0-9 0 3 1 2  0-9 • Input string: -23 • NFA states: • Start:{0,1} • “-” :{1} • “2” :{2, 3} • “3” :{2, 3} • But this is very difficult to implement directly.

CS375

CS375

Presentation Transcript

CS375 – Introduction to Algorithms

CS375

CS375 Compilers. UT-CS.