Compiler Construction

Compiler Construction 2주 강의 Lexical Analysis

token LexicalAnalyzer Parser SourceProgram get next token SymbolTable Lexical Analysis • “get next token” is a command sent from the parser to the lexical analyzer. • On receipt of the command, the lexical analyzer scans the input until it determines the next token, and returns it.

Other jobs of the lexical analyzer • We also want the lexer to • Strip out comments and white space from the source code. • Correlate parser errors with the source code location (the parser doesn’t know what line of the file it’s at, but the lexer does)

Tokens, patterns, and lexemes • A TOKEN is a set of strings over the source alphabet. • A PATTERN is a rule that describes that set. • A LEXEME is a sequence of characters matching that pattern. • E.g. in Pascal, for the statement const pi = 3.1416; • The substring pi is a lexeme for the token identifier

Example tokens, lexemes, patterns

Tokens • Together, the complete set of tokens form the set of terminal symbols used in the grammar for the parser. • In most languages, the tokens fall into these categories: • Keywords • Operators • Identifiers • Constants • Literal stirings • Punctuation • Usually the token is represented as an integer. • The lexer and parser just agree on which integers are used for each token.

Token attributes • If there is more than one lexeme for a token, we have to save additional information about the token. • Example: the token number matches lexemes 10 and 20. • Code generation needs the actual number, not just the token. • With each token, we associate ATTRIBUTES. Normally just a pointer into the symbol table.

Example attributes • For C source code E = M * C * C • We have token/attribute pairs <ID, ptr to symbol table entry for E> <Assign_op, NULL> <ID, ptr to symbol table entry for M> <Mult_op, NULL> <ID, ptr to symbol table entry for C> <Mult_op, NULL> <ID, ptr to symbol table entry for C>

Lexical errors • When errors occur, we could just crash • It is better to print an error message then continue. • Possible techniques to continue on error: • Delete a character • Insert a missing character • Replace an incorrect character by a correct character • Transpose adjacent characters

Token specification • REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification. • Every pattern specifies a set of strings, so an RE names a set of strings. • Definitions: • The ALPHABET (often written ∑) is the set of legal input symbols • A STRING over some alphabet ∑ is a finite sequence of symbols from ∑ • The LENGTH of string s is written |s| • The EMPTY STRING is a special 0-length string denoted ε

More definitions: strings and substrings • A PREFIX of s is formed by removing 0 or more trailing symbols of s • A SUFFIX of s is formed by removing 0 or more leading symbols of s • A SUBSTRING of s is formed by deleting a prefix and a suffix from s • A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix, suffix, or substring of s but with x ≠ s.

More definitions • A LANGUAGE is a set of strings over a fixed alphabet ∑. • Example languages: • Ø (the empty set) • { ε } • { a, aa, aaa, aaaa } • The CONCATENATION of two strings x and y is written xy • String EXPONENTIATION is written si, where s0 = ε and si = si-1s for i>0.

Operations on languages We often want to perform operations on sets of strings (languages). The important ones are: • The UNION of L and M: L ∪ M = { s | s is in L OR s is in M } • The CONCATENATION of L and M:LM = { st | s is in L and t is in M } • The KLEENE CLOSURE of L: • The POSITIVE CLOSURE of L:

Regular expressions • REs let us precisely define a set of strings. • For C identifiers, we might use ( letter | _ ) ( letter | digit | _ )* • Parentheses are for grouping, | means “OR”, and * means Kleene closure. • Every RE defines a language L(r).

Regular expressions • Here are the rules for writing REs over an alphabet ∑ : • ε is an RE denoting { ε }, the language containing only the empty string. • If a is in ∑, then a is a RE denoting { a }. • If r and s are REs denoting L(r) and L(s), then • (r)|(s) is a RE denoting L(r) ∪ L(s) • (r)(s) is a RE denoting L(r) L(s) • (r)* is a RE denoting (L(r))* • (r) is a RE denoting L(r)

Additional conventions • To avoid too many parentheses, we assume: • * has the highest precedence, and is left associative. • Concatenation has the 2nd highest precedence, and is left associative. • | has the lowest precedence and is left associative.

Example REs • a | b • ( a | b ) ( a | b ) • a* • (a | b )* • a | a*b

Equivalence of REs

Regular definitions • To make our REs simpler, we can give names to subexpressions. A REGULAR DEFINITION is a sequence d1 -> r1 d2 -> r2 … dn -> rn

Regular definitions • Example for identifiers in C: letter -> A | B | … | Z | a | b | … | z digit -> 0 | 1 | … | 9 id -> ( letter | _ ) ( letter | digit | _ )* • Example for numbers in Pascal: digit -> 0 | 1 | … | 9 digits -> digitdigit* optional_fraction -> . digits | ε optional_exponent -> ( E ( + | - | ε ) digits ) | ε num -> digits optional_fraction optional_exponent

Notational shorthand • To simplify out REs, we can use a few shortcuts: • 1. + means “one or more instances of”a+ (ab)+ • 2. ? means “zero or one instance of”Optional_fraction -> ( . digits ) ? • 3. [] creates a character class[A-Za-z][A-Za-z0-9]* • You can prove that these shortcuts do not increase the representational power of REs, but they are convenient.

Token recognition • We now know how to specify the tokens for our language. But how do we write a program to recognize them? if -> if then -> then else -> else relop -> < | <= | = | <> | > | >= id -> letter ( letter | digit )* num -> digit ( . digit )? ( E (+|-)? digit )?

Token recognition • We also want to strip whitespace, so we need definitions delim -> blank | tab | newline ws -> delim+

Attribute values

Transition diagrams • Transition diagrams are also called finite automata. • We have a collection of STATES drawn as nodes in a graph. • TRANSITIONS between states are represented by directed edges in the graph. • Each transition leaving a state s is labeled with a set of input characters that can occur after state s. • For now, the transitions must be DETERMINISTIC. • Each transition diagram has a single START state and a set of TERMINAL STATES. • The label OTHER on an edge indicates all possible inputs not handled by the other transitions. • Usually, when we recognize OTHER, we need to put it back in the source stream since it is part of the next token. This action is denoted with a * next to the corresponding state.

Automated lexical analyzer generation • Next time we discuss Lex and how it does its job: • Given a set of regular expressions, produce C code to recognize the tokens.

Lexical Analysis

Lexical Analysis Example

Lexical Analysis With Lex

Lexical analysis with Lex

Lex source program format • The Lex program has three sections, separated by %%: declarations %% transition rules %% auxiliary code

Declarations section • Code between %{ and }% is inserted directly into the lex.yy.c. Should contain: • Manifest constants (#define for each token) • Global variables, function declarations, typedefs • Outside %{ and }%, REGULAR DEFINITIONS are declared.Examples: delim [ \t\n] ws {delim}+ letter [A-Za-z] Each definition is a name followed by a pattern. Declared names can be used in later patterns, if surrounded by { }.

Translation rules section Translation rules take the form p1 { action1 } p2 { action2 } …… pn { actionn } Where pi is a regular expression and actioni is a C program fragment to be executed whenever pi is recognized in the input stream. In regular expressions, references to regular definitions must be enclosed in {} to distinguish them from the corresponding character sequences.

Auxiliary procedures • Arbitrary C code can be placed in this section, e.g. functions to manipulate the symbol table. • 이미 설명했음

Special characters Some characters have special meaning to Lex. • ‘.’ in a RE stands for ANY character • ‘*’ stands for Kleene closure • ‘+’ stands for positive closure • ‘?’ stands for 0-or-1 instance of • ‘-’ produces a character range (e.g. in [A-Z]) When you want to use these characters in a RE, they must be “escaped” e.g. in RE {digit}+(\.{digit}+)? ‘.’ is escaped with ‘\’

Lex interface to yacc • The yacc parser calls a function yylex() produced by lex. • yylex() returns the next token it finds in the input stream. • yacc expects the token’s attribute, if any, to be returned via the global variable yylval. • The declaration of yylval is up to you (the compiler writer). In our example, we use a union, since we have a few different kinds of attributes.

Lookahead in Lex Sometimes, we don’t know until looking ahead several characters what the next token is. Recognition of the DO keyword in Fortran is a famous example. DO5I=1.25 assigns the value 1.25 to DO5I DO5I=1,25 is a DO loop Lex handles long-term lookahead with r1/r2: DO/({letter}|{digit})*=({letter}|{digit})*, (if it’s followed by letters & digits, ‘=’, more letters & digits, followed by a ‘,’) Recognize keyword DO

Finite Automata for Lexical Analysis

Automatic lexical analyzer generation • How do Lex and similar tools do their job? • Lex translates regular expressions into transition diagrams. • Then it translates the transition diagrams into C code to recognize tokens in the input stream. • There are many possible algorithms. • The simplest algorithm is RE -> NFA -> DFA -> C code.

Finite automata (FAs) and regular languages • A RECOGNIZER takes language L and string x as input, and responds YES if x∈L, or NO otherwise. • The finite automaton (FA) is one class of recognizer. • A FA is DETERMINISTIC if there is only one possible transition for each <state,input> pair. • A FA is NONDETERMINISTIC if there is more than one possible transition some <state,input> pair. • BUT both DFAs and NFAs recognize the same class of languages: REGULAR languages, or the class of languages that can be written as regular expressions.

NFAs • A NFA is a 5-tuple < S, ∑, move, s0, F > • S is the set of STATES in the automaton. • ∑ is the INPUT CHARACTER SET • move( s, c ) = S is the TRANSITION FUNCTIONspecifying which states S the automaton can move to on seeing input c while in state s. • s0 is the START STATE. • F is the set of FINAL, or ACCEPTING STATES

NFA example and recognizes the language L = (a|b)*abb (the set of all strings of a’s and b’s ending with abb) The NFA has move() function:

The language defined by a NFA • An NFA ACCEPTS string x iff there exists a path from s0 to an accepting state, such that the edge labels along the path spell out x. • The LANGUAGE DEFINED BY a NFA N, written L(N), is the set of strings it accepts.

Another NFA example This NFA accepts L = aa*|bb*

Deterministic FAs (DFAs) The DFA is a special case of the NFA except: • No state has an ε-transition • No state has more than one edge leaving it for the same input character. The benefit of DFAs is that they are simple to simulate: there is only one choice for the machine’s state after each input symbol.

Algorithm to simulate a DFA Inputs: string x terminated by EOF; DFA D = < S, ∑, move, s0, F > Outputs: YES if D accepts x; NO otherwise Method: s = s0; c = nextchar; while ( c != EOF ) { s = move( s, c ); c = nextchar; } if ( s ∈ F ) return YES else return NO

DFA example This DFA accepts L = (a|b)*abb

RE -> DFA • Now we know how to simulate DFAs. • If we can convert our REs into a DFA, we can automatically generate lexical analyzers. • BUT it is not easy to convert REs directly into a DFA. • Instead, we will convert our REs to a NFA then convert the NFA to a DFA.

Converting a NFA to a DFA

NFA -> DFA • NFAs are ambiguous: we don’t know what state a NFA is in after observing each input. • The simplest conversion method is to have the DFA track the SUBSET of states the NFA MIGHT be in. • We need three functions for the construction: • ε-closure(s): the set of NFA states reachable from NFA state s on ε-transitions alone. • ε-closure(T): the set of NFA states reachable from some state s ∈ T on ε-transitions alone. • move(T,a): the set of NFA states to which there is a transition on input a from some NFA state s ∈ T

Compiler Construction