- 95 Views
- Uploaded on
- Presentation posted in: General

Compiler Construction

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Compiler Construction

2주 강의

Lexical Analysis

token

LexicalAnalyzer

Parser

SourceProgram

get next token

SymbolTable

- “get next token” is a command sent from the parser to the lexical analyzer.
- On receipt of the command, the lexical analyzer scans the input until it determines the next token, and returns it.

- We also want the lexer to
- Strip out comments and white space from the source code.
- Correlate parser errors with the source code location (the parser doesn’t know what line of the file it’s at, but the lexer does)

- A TOKEN is a set of strings over the source alphabet.
- A PATTERN is a rule that describes that set.
- A LEXEME is a sequence of characters matching that pattern.
- E.g. in Pascal, for the statement
const pi = 3.1416;

- The substring pi is a lexeme for the token identifier

- Together, the complete set of tokens form the set of terminal symbols used in the grammar for the parser.
- In most languages, the tokens fall into these categories:
- Keywords
- Operators
- Identifiers
- Constants
- Literal stirings
- Punctuation

- Usually the token is represented as an integer.
- The lexer and parser just agree on which integers are used for each token.

- If there is more than one lexeme for a token, we have to save additional information about the token.
- Example: the token number matches lexemes 10 and 20.
- Code generation needs the actual number, not just the token.
- With each token, we associate ATTRIBUTES. Normally just a pointer into the symbol table.

- For C source code
E = M * C * C

- We have token/attribute pairs
<ID, ptr to symbol table entry for E>

<Assign_op, NULL>

<ID, ptr to symbol table entry for M>

<Mult_op, NULL>

<ID, ptr to symbol table entry for C>

<Mult_op, NULL>

<ID, ptr to symbol table entry for C>

- When errors occur, we could just crash
- It is better to print an error message then continue.
- Possible techniques to continue on error:
- Delete a character
- Insert a missing character
- Replace an incorrect character by a correct character
- Transpose adjacent characters

- REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification.
- Every pattern specifies a set of strings, so an RE names a set of strings.
- Definitions:
- The ALPHABET (often written ∑) is the set of legal input symbols
- A STRING over some alphabet ∑ is a finite sequence of symbols from ∑
- The LENGTH of string s is written |s|
- The EMPTY STRING is a special 0-length string denoted ε

- A PREFIX of s is formed by removing 0 or more trailing symbols of s
- A SUFFIX of s is formed by removing 0 or more leading symbols of s
- A SUBSTRING of s is formed by deleting a prefix and a suffix from s
- A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix, suffix, or substring of s but with x ≠ s.

- A LANGUAGE is a set of strings over a fixed alphabet ∑.
- Example languages:
- Ø (the empty set)
- { ε }
- { a, aa, aaa, aaaa }

- The CONCATENATION of two strings x and y is written xy
- String EXPONENTIATION is written si, where s0 = ε and si = si-1s for i>0.

We often want to perform operations on sets of strings (languages). The important ones are:

- The UNION of L and M: L ∪ M = { s | s is in L OR s is in M }
- The CONCATENATION of L and M:LM = { st | s is in L and t is in M }
- The KLEENE CLOSURE of L:
- The POSITIVE CLOSURE of L:

- REs let us precisely define a set of strings.
- For C identifiers, we might use( letter | _ ) ( letter | digit | _ )*
- Parentheses are for grouping, | means “OR”, and * means Kleene closure.
- Every RE defines a language L(r).

- Here are the rules for writing REs over an alphabet ∑ :
- ε is an RE denoting { ε }, the language containing only the empty string.
- If a is in ∑, then a is a RE denoting { a }.
- If r and s are REs denoting L(r) and L(s), then
- (r)|(s) is a RE denoting L(r) ∪ L(s)
- (r)(s) is a RE denoting L(r) L(s)
- (r)* is a RE denoting (L(r))*
- (r) is a RE denoting L(r)

- To avoid too many parentheses, we assume:
- * has the highest precedence, and is left associative.
- Concatenation has the 2nd highest precedence, and is left associative.
- | has the lowest precedence and is left associative.

- a | b
- ( a | b ) ( a | b )
- a*
- (a | b )*
- a | a*b

- To make our REs simpler, we can give names to subexpressions. A REGULAR DEFINITION is a sequence
d1 -> r1

d2 -> r2

…

dn -> rn

- Example for identifiers in C:
letter -> A | B | … | Z | a | b | … | z

digit -> 0 | 1 | … | 9

id -> ( letter | _ ) ( letter | digit | _ )*

- Example for numbers in Pascal:
digit -> 0 | 1 | … | 9

digits -> digitdigit*

optional_fraction -> . digits | ε

optional_exponent -> ( E ( + | - | ε ) digits ) | ε

num -> digits optional_fraction optional_exponent

- To simplify out REs, we can use a few shortcuts:
- 1. + means “one or more instances of”a+ (ab)+
- 2. ? means “zero or one instance of”Optional_fraction -> ( . digits ) ?
- 3. [] creates a character class[A-Za-z][A-Za-z0-9]*

- You can prove that these shortcuts do not increase the representational power of REs, but they are convenient.

- We now know how to specify the tokens for our language. But how do we write a program to recognize them?
if -> if

then -> then

else -> else

relop -> < | <= | = | <> | > | >=

id -> letter ( letter | digit )*

num -> digit ( . digit )? ( E (+|-)? digit )?

- We also want to strip whitespace, so we need definitions
delim -> blank | tab | newline

ws -> delim+

- Transition diagrams are also called finite automata.
- We have a collection of STATES drawn as nodes in a graph.
- TRANSITIONS between states are represented by directed edges in the graph.
- Each transition leaving a state s is labeled with a set of input characters that can occur after state s.
- For now, the transitions must be DETERMINISTIC.
- Each transition diagram has a single START state and a set of TERMINAL STATES.
- The label OTHER on an edge indicates all possible inputs not handled by the other transitions.
- Usually, when we recognize OTHER, we need to put it back in the source stream since it is part of the next token. This action is denoted with a * next to the corresponding state.

- Next time we discuss Lex and how it does its job:
- Given a set of regular expressions, produce C code to recognize the tokens.

Lexical Analysis With Lex

- The Lex program has three sections, separated by %%:
declarations

%%

transition rules

%%

auxiliary code

- Code between %{ and }% is inserted directly into the lex.yy.c. Should contain:
- Manifest constants (#define for each token)
- Global variables, function declarations, typedefs

- Outside %{ and }%, REGULAR DEFINITIONS are declared.Examples:
delim [ \t\n]

ws {delim}+

letter [A-Za-z]

Each definition is a name followed by a pattern.

Declared names can be used in later patterns, if surrounded by { }.

Translation rules take the form

p1 { action1 }

p2 { action2 }

……

pn { actionn }

Where pi is a regular expression and actioni is a C program fragment to be executed whenever pi is recognized in the input stream.

In regular expressions, references to regular definitions must be enclosed in {} to distinguish them from the corresponding character sequences.

- Arbitrary C code can be placed in this section, e.g. functions to manipulate the symbol table.
- 이미 설명했음

Some characters have special meaning to Lex.

- ‘.’ in a RE stands for ANY character
- ‘*’ stands for Kleene closure
- ‘+’ stands for positive closure
- ‘?’ stands for 0-or-1 instance of
- ‘-’ produces a character range (e.g. in [A-Z])
When you want to use these characters in a RE, they must be “escaped”

e.g. in RE {digit}+(\.{digit}+)? ‘.’ is escaped with ‘\’

- The yacc parser calls a function yylex() produced by lex.
- yylex() returns the next token it finds in the input stream.
- yacc expects the token’s attribute, if any, to be returned via the global variable yylval.
- The declaration of yylval is up to you (the compiler writer). In our example, we use a union, since we have a few different kinds of attributes.

Sometimes, we don’t know until looking ahead several characters what the next token is. Recognition of the DO keyword in Fortran is a famous example.

DO5I=1.25 assigns the value 1.25 to DO5I

DO5I=1,25 is a DO loop

Lex handles long-term lookahead with r1/r2:DO/({letter}|{digit})*=({letter}|{digit})*,

(if it’s followed by letters & digits, ‘=’,

more letters & digits, followed by a ‘,’)

Recognize keyword DO

Finite Automata for Lexical Analysis

- How do Lex and similar tools do their job?
- Lex translates regular expressions into transition diagrams.
- Then it translates the transition diagrams into C code to recognize tokens in the input stream.

- There are many possible algorithms.
- The simplest algorithm is RE -> NFA -> DFA -> C code.

- A RECOGNIZER takes language L and string x as input, and responds YES if x∈L, or NO otherwise.
- The finite automaton (FA) is one class of recognizer.
- A FA is DETERMINISTIC if there is only one possible transition for each <state,input> pair.
- A FA is NONDETERMINISTIC if there is more than one possible transition some <state,input> pair.
- BUT both DFAs and NFAs recognize the same class of languages: REGULAR languages, or the class of languages that can be written as regular expressions.

- A NFA is a 5-tuple < S, ∑, move, s0, F >
- S is the set of STATES in the automaton.
- ∑ is the INPUT CHARACTER SET
- move( s, c ) = S is the TRANSITION FUNCTIONspecifying which states S the automaton can move to on seeing input c while in state s.
- s0 is the START STATE.
- F is the set of FINAL, or ACCEPTING STATES

and recognizes the language L = (a|b)*abb

(the set of all strings of a’s and b’s ending with abb)

The NFA

has move() function:

- An NFA ACCEPTS string x iff there exists a path from s0 to an accepting state, such that the edge labels along the path spell out x.
- The LANGUAGE DEFINED BY a NFA N, written L(N), is the set of strings it accepts.

This NFA accepts L = aa*|bb*

The DFA is a special case of the NFA except:

- No state has an ε-transition
- No state has more than one edge leaving it for the same input character.
The benefit of DFAs is that they are simple to simulate: there is only one choice for the machine’s state after each input symbol.

Inputs: string x terminated by EOF; DFA D = < S, ∑, move, s0, F >

Outputs: YES if D accepts x; NO otherwise

Method:

s = s0;

c = nextchar;

while ( c != EOF ) {

s = move( s, c );

c = nextchar;

}

if ( s ∈ F ) return YES

else return NO

This DFA accepts L = (a|b)*abb

- Now we know how to simulate DFAs.
- If we can convert our REs into a DFA, we can automatically generate lexical analyzers.
- BUT it is not easy to convert REs directly into a DFA.
- Instead, we will convert our REs to a NFA then convert the NFA to a DFA.

Converting a NFA to a DFA

- NFAs are ambiguous: we don’t know what state a NFA is in after observing each input.
- The simplest conversion method is to have the DFA track the SUBSET of states the NFA MIGHT be in.
- We need three functions for the construction:
- ε-closure(s): the set of NFA states reachable from NFA state s on ε-transitions alone.
- ε-closure(T): the set of NFA states reachable from some state s ∈ T on ε-transitions alone.
- move(T,a): the set of NFA states to which there is a transition on input a from some NFA state s ∈ T

- Inputs: a NFA N = < SN, ∑, tranN, n0, FN >
- Outputs: a DFA D = < SD, ∑, tranD, d0, FD >
- Method:
add a state d0 to SD corresponding to ε-closure(n0) while there is an unexpanded state di ∈ SD{

for each input symbol a ∈ ∑ {

dj = ε-closure(move(di,a))

if dj ∉SD,

add dj to SD

tranN( di, a ) = dj

}

}

a)

b)

Converting a RE to a NFA

- The construction is bottom up.
- Construct NFAs to recognize ε and each element a ∈ ∑.
- Recursively expand those NFAs for alternation, concatenation, and Kleene closure.
- Every step introduces at most two additional NFA states.
- Therefore the NFA is at most twice as large as the regular expression.

Inputs: A RE r over alphabet ∑

Outputs: A NFA N accepting L(r)

Method: Parse r.

If r = ε, then N is

If r = a ∈ ∑ , then N is

If r = s | t, construct N(s) for s and N(t) for t then N is

If r = st, construct N(s) for s and N(t) for t then N is

If r = s*, construct N(s) for s, then N is

If r = ( s ), construct N(s) then let N be N(s).

Use the NFA construction algorithm to build a NFA forr = (a|b)*abb