Commandment VI

Commandment VI If a function be advertised to return an error code in the event of difficulties, thou shalt check for that error code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest “it cannot happen to me”, the gods shall surely punish thee for thy arrogance.

Commandment VI Strikes Back • Linux kernel bug released this morning • Allows normal users to become root • Details are a bit complex (depend on Linux virtual memory system), but gist is: do_munmap(current->mm,new_addr,new_len); • Should be: ret=do_munmap(current->mm,new_addr,new_len); if (ret && new_len) goto out; • For details, check out: http://www.isec.pl/vulnerabilities04.html

A Little Lex

Tokenization and Lexical Analysis • Your tokenizers are functions that transform character streams into token streams • “Good call! I’ve added” -> [goo], [ood], [odc], [dca], [cal], [all], ... • This is an example of lexical analysis: the general process of transforming one data stream into another data stream • Caveat: Lexical analysis specifically refers to transformations based on regular languages (regexps) • Can be done in linear time • Doesn’t have to be just chars->toks!

The Big Picture View of Lexing Char Stream Lexer/Tokenizer getChar() getTok() Here be monsters

Character Stream Transformations Char Stream Xform1 Xform1 getChar() getChar()

Transformations as Filters • Stream transforms are kinds of filters • Can change one char into another char (e.g., upcase or downcase) • Can selectively omit stuff (punctuation, whitespace, entire lines) • Can be chained together -- total transform is the composition of functions ftot(x)=(f2 •f1)(x) • In LISP: (map xform2 (map xform1 data))

For Spam Filter Tokenization Char Stream MBox handler Msg handler Tokeniz getChar() getChar() getChar() getChar()

The MBox Handler • A very simplified and specialized form of parser • Input from raw char stream (Reader) via read() • “Transforms” the char stream by: • Recognizing start/end of messages • Omitting POSTMARK line from its output stream • Emitting “null” (or other End of Stream marker) at end of message • Has to provide • read/getChar() -- next char in current msg • “Is there another message?” method • “Reset and start next message” method

The Message Handler • Input from MBox handler via read/getChar() • “Transforms” that stream by • Recognizing headers/body • Omitting most headers • Omitting “To:”, “From:” FIELD-NAMEs • Providing rest of content of headers directly • Providing body • Emitting null/EOS marker at end of message • Has to provide • read/getChar() • “Reset and read next message”

The Parse/Tokenize Loop Reader r=new FileReader(“blah”); MBoxHandler mbh=new MBoxHandler(r); MessageHandler mh=new MessageHandler(mbh); Tokenizer tok=new NGramTokenizer(tok); while (mbh.hasMoreMessages()) { while (tok.hasMoreTokens()) { String next=tok.getTok(); // handle new token } // handle end of message mbh.startNextMessage(); mh.startNextMessage(); }

Lexing in More Detail • Can think of a lexer/tokenizer as a little machine • Reads in chars one-by-one • Glues them together until a “full token” is ready • Returns the full token • Specifically, all lexers can be seen as finite state machines • “States”==“different recognizable tokens/token classes” • Token class might be “all whitespace-separated words”

3 Ways to Think of Lexers • 3 (mathematically) equivalent ways to represent lexers • Large nested set of conditionals • Finite state machine • Regular expression • Key similarities: • All look at each char only once • All depend on “remembering” only a finite amount of info about history (state) • Don’t need arbitrary amt of memory/stack/etc.

A Small Lexer Example • Consider a lexer whose job is: return integers and non-numeric words; discard whitespace and punctuation • Equivalent to the regular expressions: • NUM:=[0-9]+ • WORD:=[a-zA-Z]+ • DISCARD:=[^0-9a-zA-Z] • Three states here: • “Working on a number” • “Working on a word” • Discarding/waiting for input

The State Machine Picture [0-9]/buf Num [0-9]/buf [^0-9a-zA-Z]/ drop&return start [a-zA-Z]/ buf&return [0-9]/ buf&return [^0-9a-zA-Z]/ drop&return [^0-9a-zA-Z]/ drop Word [a-zA-Z]/buf [a-zA-Z]/buf

Turning it into code state=ST_START; StringBuffer buf=new StringBuffer(“”); while ((c=nextInputChar())!=EOF) { if (state==ST_START) { if (c>=‘0’ && c<=‘9’) { buf.append((char)c); state=ST_NUM; next; } if ((c>=‘a’ && c<=‘z’) || (c>=‘A’ && c<=‘Z’)) { buf.append((char)c); state=ST_WORD; next; }

Turning it into code, cont’d... if (state==ST_WORD) { if (c>=‘0’ && c<=‘9’) { state=ST_NUM; String rval=buf.toString(); buf=new StringBuffer((char)c); return rval; } // etc...

Coolnesses • Pretty fast -- only sees each char once • Can turn any regexp into this kind of code • (Alt: anything you can write a diagram for) • Essentially, anything lexable in linear time can be done this way • Mostly cookbook -- given a diagram/regex, just have to write out all the rules/cases longhand.

Bogusnesses • Getting all the rules right is a big pain -- easy to miss some • For moderately complex regexps, can be very many states/rules • Easy to make bugs in middle of “if” statements; very hard to debug • Not as fast as it could be -- each char may have to pass through many “if” statements

In Practice... • There are tools to compile your regexps into a lexer automatically for you (flex, ANTLR, etc.) • They work by building large tables that compactly encode the rules: Actions[][] actArr= new Actions[N_STATE][N_CHAR]; for (s=0;s<N_STATE;++s) { for (c=0;c<N_CHAR;++c) { actArr[s][c]= // action to take with // char c in state s... } }

Excercise • Build the state/action diagram (finite state machine), and the action table representation for the following set of tokens: • EVEN_NUMBER:=[0-9]*[02468] • LC_WORD:=[a-z]+ • SPAM_WORDS:=(nigeria|nightmare)

Commandment VI

Commandment VI

Presentation Transcript

Major commandment firms

The Fifth Commandment

The 5th Commandment

The First Commandment

The Seventh Commandment

The Fourth Commandment:

The TENTH COMMANDMENT

The Third Commandment

The First Commandment

The “ New ” Commandment

A New Commandment

Chapter 4 2nd Commandment

The First Commandment?

The Third Commandment

The First Commandment

The Eighth Commandment

Lesson 13 6th Commandment, 7 th Commandment

The First Commandment

The Eighth Commandment

The Fourth Commandment

Second Commandment

Conrad’s Commandment …