Parsing

Parsing • Obtain text from somewhere (file, user input, web page, ..) • Analyze text: split it into meaningful tokens • Extract relevant information, disregard irrelevant information • ‘Meaningful’, ‘relevant’ depend on application: what are we looking for? • Search phone book for all people named “Ole Hansen” • Search phone book for all phone numbers starting with 86 • Search phone book for all people living in Ny Munkegade

Example: Torleif game Sort of like Master Mind with words and letters: • Two players, each finds 5-letter noun • Take turns in guessing • Score each guess by • Number of correctly placed letters also present in the hidden word • Number of incorrectly placed letters also present in hidden word sport trofæ 1 correct, 2 incorrect frygt 1 correct, 1 incorrect ..

Let’s write a computer player: • Pick random word (from homepage of Dansk Sprognævn). • Ask for a guess • Was the guess correct? • Otherwise score the guess • Go to 2.

Dansk Sprognævn, dictionary web page Ask for all words starting with .. Page displays at most 50 words at a time We are looking for 5-letter strings in bold followed by the string “sb” in italics

Parsing the web page The source code of the dynamically generated web page has 370 lines. Some of it looks like this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <TITLE>Retskrivningsordbogen på nettet fra Dansk Sprognævn</TITLE> <META name="Author" content="Erik Bo Krantz Simonsen, www.progresso.dk"> <META name="Description" content="The official Danish orthography dictionary on the web"> <META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary, orthography, Dansk Sprognævn"> <LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css"> <SCRIPT language="JavaScript" type="text/javascript"> <!-- self.focus(); // frame focus if (document.searchForm && document.searchForm.P) src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> <TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390> <TR BGCOLOR="#d0e0d0"><TD> spondæisk adj., itk. d.s. </TD></TR> <<TR><TD> sporstof sb., -fet, -fer. </TD></TR> <TR BGCOLOR="#d0e0d0"><TD> sport sb., -en, i sms. sports-, fx sportsstævne. </TD></TR> </HTML>

Algorithm for picking a random word • Pick a random initial letter x (weighted – count total number of words beginning with each letter) • Pick random index (in the list of all words starting with x) • Ask website for webpage with next 50 x-words starting at chosen index • Parse webpage and look for first 5-letter noun • If none is found, ask for next 50 (wrap-around)

get_random_word.py module import urllib import sys import re import random def getRandom5letterNoun(): # q has weight 0 since there are no 5-letter Danish nouns starting with q! hyppighed = (3194, 4540, 759, 2651, 1556, 5221, 2658, 3141, 1890, 526,4979, 2327, 3086, 1665, 2074, 3480, 0, 2455, 8460, 3845, 2315, 2230, 78, 20, 102, 77, 262, 252, 175) # sum: 64018 r = random.randrange(0, 64018) sum = hyppighed[0] startbogstav = 0 while sum<r: # pick random (weighted) starting letter startbogstav+=1 sum+=hyppighed[startbogstav] bogstavhyppighed = hyppighed[startbogstav] startindex = random.randrange(0, bogstavhyppighed) # pick random index if startbogstav == 26: # translate from chosen character code into actual letter startbogstav = 'æ' elif startbogstav == 27: startbogstav = 'ø' elif startbogstav == 28: startbogstav = 'å' else: startbogstav = chr(startbogstav+97)

found_word = 0 whilenot found_word: try: # get next 50 words, starting from chosen index, from website: myurl = "http://www.dsn.dk/cgi-bin/ordbog/ronet?M=1&P=%s&L=50&F=%d&T=%d” \ %(startbogstav, startindex, bogstavhyppighed) tempfile = urllib.urlopen(myurl) tekst = tempfile.read() tempfile.close() except IOError: print "Kan ikke få fat på Dansk Sprognævn" sys.exit(1) tekst = tekst.replace("æ", "æ") # replace special codes with corresponding letters tekst = tekst.replace("ø", "ø") tekst = tekst.replace("å", "å") wordRE = "([a-zæøå]{5}) sb" # look for 5-letter noun compiled_word = re.compile( wordRE ) resultat = compiled_word.search( tekst ) if resultat: word = resultat.group(1) found_word = 1 else: # get next 50 words from website startindex += 50 if startindex > bogstavhyppighed: startindex = 0 return word fatwa pligt areal intet synål ceder tvist

Game program Dit 5-bogstavers bud? sport sport 1r 1f Dit 5-bogstavers bud? stang sport 1r 1f stang 1r 2f Dit 5-bogstavers bud? satin sport 1r 1f stang 1r 2f satin 3r 0f Dit 5-bogstavers bud? salon sport 1r 1f stang 1r 2f satin 3r 0f salon 5r 0f from get_random_word import getRandom5letterNoun ord = getRandom5letterNoun() g = "" svar = "\n" while ord != g: g = "" while len(g) != 5: g = raw_input("Dit 5-bogstavers bud? ").strip() guess = g kopi = ord r = 0 # number of correctly placed matching letters f = 0 # number of incorrectly placed matching letters for b in range(5): if guess[b] == kopi[b]: r += 1 kopi = kopi[0:b] + '*' + kopi[b+1:] guess = guess[0:b] + '@' + guess[b+1:] for b in range(5): index = kopi.find(guess[b]) if index >= 0: f += 1 kopi = kopi[0:index] + '*' + kopi[index+1:] guess = guess[0:b] + '@' + guess[b+1:] svar = svar + "%s %dr %df\n"%(g, r, f) print svar

Intermezzo 1 – find it on the web: http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.1.html • Copy the get-random-word module:/users/chili/CSS.E03/ExamplePrograms/get_random_word.py • Make a new program that imports this module and prints out 5 random words. • Make a new version of the get_random_word module so that it returns a random noun of between 5 and 10 letters. Import this module and print out 5 random words. • Make a new version of the get_random_word module so that it finds a random word which has an alternative spelling and returns a tuple of both versions. E.g. (sponsering, sponsorering). Import this module and print out 5 random such word pairs using e.g. print "%s or %s" %getWords() (Hint: See this sample webpage generated by Dansk Sprognævn's website and find the word sponsorering which has an alternative spelling ("el." is short for "eller" which means or). Then look at the source code of this page which you can find here:/users/chili/CSS.E03/ExamplePrograms/dsn_page.txt.Check how exactly the words sponsering and sponsorering appear in the html. Use that example to write a new regular expression.)

solution # 5-10 letter nouns: wordRE = "([a-zæøå]{5,10}) sb" # words with alternative spelling (look at html first): wordRE = "([a-zæøå]+) \(el. ([a-zæøå]+)" compiled_word = re.compile(wordRE) resultat = compiled_word.search(tekst) if resultat: word = resultat.group(1) word2 = resultat.group(2) .. return (word, word2) .. <TR><TD> sponsere (el. sponsorere) vb., -ede. </TD></TR> ..

Sequence formats Say we get this sequence in fasta format from some database: >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate: Phylip Format: The first line of the input file contains the number of species, the number of sequences and their length (in characters) separated by blanks. The next line contains the sequence name, followed by the sequence in blocks of 10 characters.

Sequence formats fasta phylip >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL So we copy and paste and translate the sequence: and all is well. Then our boss says “Do it for these 5000 sequences.” 1 1 338 FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL

We need automatic filter! • Need a program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter) • Program structure: • Open fasta file • Parse file to extract needed information • Create and save phylip file • We will use this definition for the fasta format (and assume only one sequence per file): • The description line starts with a greater than symbol (">"). • The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description. • The "ID" and the description are optional. • All lines of text should be shorter than 80 characters. • The sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins.

Pseudo-code fasta→phylip filter • Open fasta file • Find line starting with > • Parse this line and extract first word after the > (sequence name) • Read the sequence (count its length) • Open phylip file • Write “1 1” followed by seq. length • Write seq. name • Write sequence in blocks of 10 • Close files

The other way too: pseudo-code phylip→fasta filter • Open phylip file • Find first non-empty line, ignore! • Parse next line and extract first word (sequence name) • Read rest of line and following lines to get the the sequence • Open fasta file • Write “>” followed by seq. name • Write sequence in lines of 80 • Close files

More formats? phylip-fasta • Boss: “Great! What about EMBL and GDE formats?” Coding, coding,.. : 12 filters! phylip fasta fasta - phylip

More formats? • Boss: “Super. And Genebank and ClustalW..?” • Coding, coding, coding, ..: 30 filters  • Next new format: 12 new filters! I.e., this doesn’t scale.

Intermediate format • Use our own internal format as intermediate step: • Two formats: four filters phylip-internal phylip internal-phylip internal internal-fasta fasta - internal fasta

Intermediate format • Six formats: 12 filters (not 30) • New format: always two new filters only i-format

Let’s build a structured program! • Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format • Each internal2y filter module: save each i-format sequence in separate file in y format • Example: Overall phylip-fasta filter: • import phylip2i and i2fasta modules • obtain filenames to load from and save to from command line • call parse_file method of the phylip2i module • call the save_to_files method of the i2fasta module

Our internal format revisited Isequence: """Definition of abstract data type representing a sequence in I-format - internal format""" def__init__(self, t = "unknown“, n = "unknown“, i = "unknown“ ): """Initialize fields to given values""" self.type = t self.name = n self.id = i self.sequence = ""# represent the sequence itself as a string Thus, the information we keep about a sequence is type, name, id; all other information is disregarded

Example: fasta/phylip filter • Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format from Isequence import Isequence class Parser: # loads and parses fasta file into list of i-sequences def init__(self): self.iseqlist = [] # initialize empty list def parse_file(self, loadfilename): <<load file, save content in variablelines>> for line in lines: if line[0] == '>': # new sequence starts items = line.split() # assume: dna, first word after > is the id, next two words are the name. self.iseq = Isequence("dna", " ".join(items[1:3]), items[0][1:]) self.iseqlist.append(self.iseq) #put new Isequence object in list elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence(line.strip()) # skip trailing newline return self.iseqlist

Each internal2y filter module: save each i-format sequence in separate file in y format from Isequence import Isequence class SaveToFiles: # save i-sequences in phylip format def save_to_files(self, iseqlist, savefilename): try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w") seqstring = seq.get_sequence() print >> savefile, "1 1 %d" %len( seqstring ) prefix = "%-10s " %seq.get_name() # write name savefile.write( prefix ) prefix = " " * len( prefix ) # on remaining lines write spaces instead of name counter = 1 for char in seqstring: savefile.write( char ) if counter%10 == 0: savefile.write( " " ) if counter%50 == 0: savefile.write( "\n%s" %prefix ) counter += 1 savefile.close() except IOError, message: sys.exit(message)

Command-line arguments • Python stores command-line arguments in a list called sys.argv • The first argument is the name of the program that the user is running from the command-line • # filename: command_line_arguments.py • import sys • print "first argument is program name:", sys.argv[0] • print "arguments for the program start at index 1:" • for arg in sys.argv[1:]: • print arg threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq first argument is program name: command_line_arguments.py arguments for the program start at index 1: 1 2 3 qq

Overall fasta/phylip filter • import phylip2i and i2fasta modules • obtain filenames to load from and save to from command line • call parse_file method of the phylip2i module • call the save_to_files method of the i2fasta module import Isequence from i2phylip import SaveToFiles from fasta2i import Parser import sys # Now SaveToFiles is a class that can save i-format sequences in phylip format, # and Parser is a class that reads a fasta file and parses it into i-format. # load a fasta file, save each sequence in its own file in phylip format if len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save phylip sequences in.""") loadfilename = sys.argv[1] savefilename = sys.argv[2] # parse file and store each sequence in Isequence object: input_parser = Parser() iseq_list = input_parser.parse_file(loadfilename) # save each Isequence in required format in separate files: save_object = SaveToFiles() save_object.save_to_files(iseq_list, savefilename) NB: nothing about phylip and fasta below this point..

i2embl filter module..? from Isequence import Isequence class SaveToFiles: # same class name def save_to_files(self, iseqlist, savefilename): # same method name try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w") <<convert i-sequence to embl format and write to file>> savefile.close() except IOError, message: sys.exit(message)

Fasta/embl filter..? import Isequence fromi2emblimport SaveToFiles # import same method name from different module from fasta2i import Parser import sys # Now SaveToFiles is a class that can save i-format sequences in embl format, # and Parser is a class that reads a fasta file and parses it into i-format. # load a fasta file, save each sequence in its own file in phylip format if len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save emblsequences in.""") loadfilename = sys.argv[1] savefilename = sys.argv[2] # parse file and store each sequence in Isequence object: input_parser = Parser() iseq_list = input_parser.parse_file(loadfilename) # save each Isequence in required format in separate files: save_object = SaveToFiles() save_object.save_to_files(iseq_list, savefilename)

Intermezzo 2, on the web: http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.2.html Oh no, the phylip format has been changed by its designers! • The first line of a file with a sequence in the new phylip format is a comment line and begins with "@@". In this comment line the name of the author of the file should appear, the year of creation, and the name of the author's favorite football player, separated by commas. • In the next lines the sequence is written, starting with "##". • In the final line (starting with "!!"), the sequence name is written. Thus, a phylip format file might look like this: @@Jakob Fredslund, 2003, Zinedine Zidane ##cgactaagcttagcacggatcgatcggaattctagagcgacgacgtctagcagcgcgtaacgtatagctcgcgaggaaagctctgtaggggactg cgagaagatgg !!Tyrannosaurus Rex Rewrite the fasta/phylip filter to incorporate the changed phylip format. Find all needed files here. I.e.: • Copy the needed files – remember Isequence.py • Run the overall fasta/phylip filter on the given example fasta file and check the resulting phylip files to see how it works. • Make the necessary changes in the right places.

solution • We only need to modify the i2phylip module (neat! – another good reason to use an intermediate format). # save each sequence in separate file: try: for seq in iseqlist: suffix = ".phylip" if len(iseqlist) > 1: suffix = "_" + seq.get_id() + suffix savefile = open(savefilename + suffix, "w") seqstring = seq.get_sequence() print >> savefile, "@@Jakob Fredslund, 2003, Zidane" print >> savefile, "##%s" %seqstring print >> savefile, "!!%s" %seq.get_name() savefile.close()

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing