1 / 21

Commandment VI

Commandment VI.

mare
Download Presentation

Commandment VI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Commandment VI If a function be advertised to return an error code in the event of difficulties, thou shalt check for that error code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest “it cannot happen to me”, the gods shall surely punish thee for thy arrogance.

  2. Commandment VI Strikes Back • Linux kernel bug released this morning • Allows normal users to become root • Details are a bit complex (depend on Linux virtual memory system), but gist is: do_munmap(current->mm,new_addr,new_len); • Should be: ret=do_munmap(current->mm,new_addr,new_len); if (ret && new_len) goto out; • For details, check out: http://www.isec.pl/vulnerabilities04.html

  3. A Little Lex

  4. Tokenization and Lexical Analysis • Your tokenizers are functions that transform character streams into token streams • “Good call! I’ve added” -> [goo], [ood], [odc], [dca], [cal], [all], ... • This is an example of lexical analysis: the general process of transforming one data stream into another data stream • Caveat: Lexical analysis specifically refers to transformations based on regular languages (regexps) • Can be done in linear time • Doesn’t have to be just chars->toks!

  5. The Big Picture View of Lexing Char Stream Lexer/Tokenizer getChar() getTok() Here be monsters

  6. Character Stream Transformations Char Stream Xform1 Xform1 getChar() getChar()

  7. Transformations as Filters • Stream transforms are kinds of filters • Can change one char into another char (e.g., upcase or downcase) • Can selectively omit stuff (punctuation, whitespace, entire lines) • Can be chained together -- total transform is the composition of functions ftot(x)=(f2 •f1)(x) • In LISP: (map xform2 (map xform1 data))

  8. For Spam Filter Tokenization Char Stream MBox handler Msg handler Tokeniz getChar() getChar() getChar() getChar()

  9. The MBox Handler • A very simplified and specialized form of parser • Input from raw char stream (Reader) via read() • “Transforms” the char stream by: • Recognizing start/end of messages • Omitting POSTMARK line from its output stream • Emitting “null” (or other End of Stream marker) at end of message • Has to provide • read/getChar() -- next char in current msg • “Is there another message?” method • “Reset and start next message” method

  10. The Message Handler • Input from MBox handler via read/getChar() • “Transforms” that stream by • Recognizing headers/body • Omitting most headers • Omitting “To:”, “From:” FIELD-NAMEs • Providing rest of content of headers directly • Providing body • Emitting null/EOS marker at end of message • Has to provide • read/getChar() • “Reset and read next message”

  11. The Parse/Tokenize Loop Reader r=new FileReader(“blah”); MBoxHandler mbh=new MBoxHandler(r); MessageHandler mh=new MessageHandler(mbh); Tokenizer tok=new NGramTokenizer(tok); while (mbh.hasMoreMessages()) { while (tok.hasMoreTokens()) { String next=tok.getTok(); // handle new token } // handle end of message mbh.startNextMessage(); mh.startNextMessage(); }

  12. Lexing in More Detail • Can think of a lexer/tokenizer as a little machine • Reads in chars one-by-one • Glues them together until a “full token” is ready • Returns the full token • Specifically, all lexers can be seen as finite state machines • “States”==“different recognizable tokens/token classes” • Token class might be “all whitespace-separated words”

  13. 3 Ways to Think of Lexers • 3 (mathematically) equivalent ways to represent lexers • Large nested set of conditionals • Finite state machine • Regular expression • Key similarities: • All look at each char only once • All depend on “remembering” only a finite amount of info about history (state) • Don’t need arbitrary amt of memory/stack/etc.

  14. A Small Lexer Example • Consider a lexer whose job is: return integers and non-numeric words; discard whitespace and punctuation • Equivalent to the regular expressions: • NUM:=[0-9]+ • WORD:=[a-zA-Z]+ • DISCARD:=[^0-9a-zA-Z] • Three states here: • “Working on a number” • “Working on a word” • Discarding/waiting for input

  15. The State Machine Picture [0-9]/buf Num [0-9]/buf [^0-9a-zA-Z]/ drop&return start [a-zA-Z]/ buf&return [0-9]/ buf&return [^0-9a-zA-Z]/ drop&return [^0-9a-zA-Z]/ drop Word [a-zA-Z]/buf [a-zA-Z]/buf

  16. Turning it into code state=ST_START; StringBuffer buf=new StringBuffer(“”); while ((c=nextInputChar())!=EOF) { if (state==ST_START) { if (c>=‘0’ && c<=‘9’) { buf.append((char)c); state=ST_NUM; next; } if ((c>=‘a’ && c<=‘z’) || (c>=‘A’ && c<=‘Z’)) { buf.append((char)c); state=ST_WORD; next; }

  17. Turning it into code, cont’d... if (state==ST_WORD) { if (c>=‘0’ && c<=‘9’) { state=ST_NUM; String rval=buf.toString(); buf=new StringBuffer((char)c); return rval; } // etc...

  18. Coolnesses • Pretty fast -- only sees each char once • Can turn any regexp into this kind of code • (Alt: anything you can write a diagram for) • Essentially, anything lexable in linear time can be done this way • Mostly cookbook -- given a diagram/regex, just have to write out all the rules/cases longhand.

  19. Bogusnesses • Getting all the rules right is a big pain -- easy to miss some • For moderately complex regexps, can be very many states/rules • Easy to make bugs in middle of “if” statements; very hard to debug • Not as fast as it could be -- each char may have to pass through many “if” statements

  20. In Practice... • There are tools to compile your regexps into a lexer automatically for you (flex, ANTLR, etc.) • They work by building large tables that compactly encode the rules: Actions[][] actArr= new Actions[N_STATE][N_CHAR]; for (s=0;s<N_STATE;++s) { for (c=0;c<N_CHAR;++c) { actArr[s][c]= // action to take with // char c in state s... } }

  21. Excercise • Build the state/action diagram (finite state machine), and the action table representation for the following set of tokens: • EVEN_NUMBER:=[0-9]*[02468] • LC_WORD:=[a-z]+ • SPAM_WORDS:=(nigeria|nightmare)

More Related