1 / 47

Control flow File handling

Control flow File handling. Karin Lagesen karin.lagesen@medisin.uio.no. Indentation and scope. Python does not use brackets or other symbols to delineate a block of code Python uses indentation – either tab or space

kort
Download Presentation

Control flow File handling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Control flowFile handling Karin Lagesen karin.lagesen@medisin.uio.no

  2. Indentation and scope • Python does not use brackets or other symbols to delineate a block of code • Python uses indentation – either tab or space • Note: variables can only be seen and used within the block of code it is in – this is called scope

  3. Statements • Statements: small stand-alone pieces of code • Simple statements: • print 42 • my_list = [1, 4, 8] • largest = max(my_list) • Can combine statements – often done with control flow statements

  4. Boolean expressions – True or False • Comparisons • Other values • True: non-empty lists, numbers != 0, and more • False: 0 and None

  5. Doing comparisons Simple comparisons >>> a = 1 >>> b = 3 >>> a == b False >>> a < b True >>> a = "Hello" >>> b = "World" >>> a != b True >>> a < b True >>>

  6. Python logical operators Boolean expression <LOGICAL OPERATOR> Boolean expression

  7. Combining comparisons Comparisons canbe combined Simple comparisons >>> a = 1 >>> b = 3 >>> a == b False >>> a < b True >>> a = "Hello" >>> b = "World" >>> a != b True >>> a < b True >>> >>> a = 1 >>> b = 4 >>> c = "Hello" >>> d = "World" >>> a < b and c < d True >>> a < b and c > d False >>> a < b or c > d True >>> a < b and not c > d False >>>

  8. Control flow • Control flow determines which blocks of code that will be executed • One conditional statement • If - elif - else • Two iteration statements • For: iterate over group of elements • While: do until something is true

  9. If – elif - else • Structure: IF <boolean expression>: code block 1ELIF <boolean expression>: code block 2ELSE: code block 3 • Optional: elif and else • Can have more than one elif • Only one of these code blocks are executed • Executed block: the one whose expression first evaluates to True (else always True) Note the : (colon)has to be there!Note: indentationis mandatory!

  10. Basic if example • Basic test: Note indentation, done by typing Tab >>> fakevariable = 12 >>> if fakevariable > 10: ... print "Var is greater than 10" ... Var is greater than 10 >>>

  11. Adding an else • Can add an else to the if statement – executed as default if nothing else is true >>> fakevariable = 12 >>> if fakevariable > 10: ... print "Variable greater than 10" ... else: ... print "Variable not greater than 10" ... Variable greater than 10 >>> First case: If conditional true - first statement executed >>> fakevariable = 9 >>> if fakevariable > 10: ... print "Variable greater than 10" ... else: ... print "Variable not greater than 10" ... Variable not greater than 10 >>> Second case: If conditional not true - else statement executed

  12. Adding elifs • Can test on multiple conditions by introducing one or more elifs >>> fakevariable = 7 >>> if fakevariable > 10: ... print "Variable greater than 10" ... eliffakevariable > 5: ... print "Variable greater than 5" ... else: ... print "Variable smaller than 5" ... Variable greater than 5 >>> IF statements are order sensitive – thecode inside of the first boolean expressionthat evaluates as True gets executed!

  13. Testing sequence length • Have DNA string in variable DNA • Calculate the length of the DNA string • Write if statement that does the following: • If string longer than 10 nts: • prints “DNA string is longer than 10 nts” • Else if string is between 5 and 10 • prints “DNA string is between 5 and 10 nts” • Otherwise, just • prints “DNA string is shorter than 5 nts”

  14. Testing sequence length # The variable DNA contains the DNA string in question textlength = len(DNA) if textlength > 10: print "DNA string is longer than 10 nts" elif textlength > 5 and textlength < 10: print "DNA string is between 5 and 10 nts" else: print "DNA string is shorter than 5 nts”

  15. For • Structure: for VAR in ITERABLE: code block • Code block executed for each element in ITERABLE • VAR takes on value of current element • Iterables are: • Strings, lists and other data types Note the : (colon)has to be there!

  16. For loop on lists • Can use a for loop to access each element in the list: >>> a = [1,2,3,4,5,6,7,8,9] >>> for var in a: ... print var ... 1 2 3 4 5 6 7 8 9 >>> Note indentation, done by typing Tab

  17. For loops and manipulation • Can manipulate data inside the for loop: >>> a = [1,2,3,4,5,6,7,8,9] >>> for var in a: ... print var*var ... 1 4 9 16 25 36 49 64 81 >>>

  18. For on strings... • Strings are also iterable • What happens if we for over a string? >>> a = "ATGGCGGA" >>> for var in a: ... print var ... A T G G C G G A >>>

  19. Print words and their lengths • Have variable called list containing words words = ["red", "green", "blue", "yellow”] • Iterate over list and: print word, length of word >>> words = ["red", "green", "blue", "yellow"] >>> for word in words: ... print word, len(word) ... red 3 green 5 blue 4 yellow 6 >>>

  20. While loop • Structure while BOOLEAN EXPRESSION == True: code block • Important: code block MUST change truth value of expression, otherwise infinite loop

  21. While example • Iterate over all codons in DNA string >>> DNA = "AAAGGGCCCTTT" >>> i = 0 >>> while i < len(DNA): ... print DNA[i:i+3] ... i = i + 1 ... AAA AAG AGG GGG GGC GCC CCC CCT CTT TTT TT T >>> Note i = i + 1, increment the counter so that expressionwill at some point be false

  22. Slicing DNA into codons • Have DNA string in variable DNA • Goal: print out all codons in the string >>> DNA = "123123" >>> i = 0 >>> while i < len(DNA) - 2: ... print DNA[i:i+3] ... i = i + 3 ... 123 123 >>> Length = 6, last number that is smaller than 6-2 is then 3.

  23. Working with files • Reading – get info into your program • Parsing – processing file contents • Writing – get info out of your program

  24. Reading and writing • Three-step process • Open file • create file handle – reference to file • Read or write to file • Close file • File will be automatically closed on program end, but bad form to not close

  25. Opening a file • How to open a file: fh = open(“filename”, “mode”) • fh = filehandle, reference to a file, NOT the file itself • Opening modes: • “r” - read file • “w” - write file • “a” - append to end of file

  26. Reading a file • Three ways to read • read([n]), n = bytes to read, default is all • readline(), read one line, incl. newline • readlines(), read file into a list, one element per line, including newline

  27. Reading example >>> fh = open("reading_file.txt", "r") >>> fh <open file 'reading_file.txt', mode 'r' at 0x1027e4540> >>> lines = fh.readlines() >>> lines ['This is a test file.\n', 'This file contains \n', 'three lines of text.\n'] >>> Opening file gives reference to file, notcontents. Contents must be read explicitly. .readlines() gives a list with strings, each string ending in a newline – a \n

  28. Parsing • Getting information out of a file • Commonly used string methods oneline: contains text • oneline.split(“character”) – splits line into list on character, default is whitespace • oneline.replace(“in string”, “put into instead”) • slicing

  29. Type conversions • Everything that comes from a file is a string • Everything that will be written to a file has to be a string. • Conversions: • int(X) • string cannot have decimals • floats will be floored • float(X) • str(X)

  30. Parsing example protein_name at_content prot1 0.4 prot2 0.5 prot3 0.2 prot4 0.8 Goal: calculate the average AT content of the proteins

  31. Parsing example – take 1 >>> fh = open("parsing_file.txt", "r") >>> lines = fh.readlines() >>> fh.close() >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> for line in lines: ... print line ... protein_name at_content prot1 0.4 prot2 0.5 prot3 0.2 prot4 0.8 >>> Open the file, read inthe text, close it Iterate through eachof the lines in the file

  32. Parsing example – take 2 Removing the newline, so we only get the line itself. >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> for line in lines: ... print line.replace("\n", "") ... protein_name at_content prot1 0.4 prot2 0.5 prot3 0.2 prot4 0.8 >>> Replace the newlinewith nothing

  33. Parsing example – take 3 Skipping the first line (i.e. skipping the first element of the list): >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> for line in lines[1:]: ... print line.replace("\n", "") ... prot1 0.4 prot2 0.5 prot3 0.2 prot4 0.8 >>> Cut out the first line inthe file – no at contentpresent in that line

  34. Parsing example – take 4 Getting at only the AT content value >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> for line in lines[1:]: ... text = line.replace("\n", "") ... fields = text.split() ... print fields ... print fields[1] ... ['prot1', '0.4'] 0.4 ['prot2', '0.5'] 0.5 ['prot3', '0.2'] 0.2 ['prot4', '0.8'] 0.8 >>> Split each line on newline,print out all fields, thenonly the second field

  35. Parsing example – take 5 Putting the AT content value into a list >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> at_content_list = [] >>> for line in lines[1:]: ... text = line.replace("\n", "") ... fields = text.split() ... this_at_content = fields[1] ... at_content_list.append(this_at_content) ... >>> print at_content_list ['0.4', '0.5', '0.2', '0.8'] >>> Create list to hold the at contents. Created outside of loop, otherwise outof scope Get only at content value,append it to the list Print out the resulting list

  36. Parsing example – take 6 Putting the AT content value into a list as NUMBERS >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> at_content_list = [] >>> for line in lines[1:]: ... text = line.replace("\n", "") ... fields = text.split() ... this_at_content = float(fields[1]) ... at_content_list.append(this_at_content) ... >>> print at_content_list [0.4, 0.5, 0.2, 0.8] >>> Converting the string withthe at content in to float

  37. Parsing example – take 6 Calculating the average >>> print lines ['protein_name\tat_content\n', 'prot1\t\t0.4\n', 'prot2\t\t0.5\n', 'prot3\t\t0.2\n', 'prot4\t\t0.8\n'] >>> at_content_list = [] >>> for line in lines[1:]: ... text = line.replace("\n", "") ... fields = text.split() ... this_at_content = float(fields[1]) ... at_content_list.append(this_at_content) ... >>> print at_content_list [0.4, 0.5, 0.2, 0.8] >>> sum(at_content_list)/len(at_content_list) 0.47500000000000003 >>> Calculating the average

  38. Parsing blast output • Columns tab separated • Headings: • Query, Subject, % id, aln length,# mismatches, # gap openings, q.start,q.end, s.start, s.end, e-value, bit score Isotig13419 contig698252 99.79 472 1 0 1538 2009 1187 716 0.0 928 Isotig13419 contig698252 100.00 114 0 0 1369 1482 1356 1243 1e-56 226 Isotig13419 contig698252 100.00 100 0 0 1247 1346 2243 2144 2e-48 198 Isotig13419 contig698252 98.95 95 1 0 2088 2182 637 543 5e-43 180 isotig13419 contig889828 99.72 361 1 0 570 930 1631 1271 0.0 708 isotig13419 contig889828 99.60 251 0 1 321 570 2064 1814 2e-119 434 Isotig13419 contig889828 100.00 193 0 0 981 1173 266 74 2e-82 311 isotig13419 contig889828 100.00 63 0 0 1185 1247 63 1 3e-26 125 Isotig13419 contig362216 99.72 361 1 0 570 930 364 4 0.0 708 Isotig13419 contig362216 99.60 251 0 1 321 570 797 547 2e-119 434 Isotig13419 contig362215 100.00 193 0 0 981 1173 266 74 2e-82 311

  39. Length of blast matches • Goal: figure out how long each of the matches in the subject is • Output: subject name, length of match • First: read in file • Second: access columns 2 (subject name), columns 9 and 10 (s.start, s.end) • Remember: python is zero-based! • Third: convert 9 and 10 to int, subtract and print results

  40. Parsing blast file fh = open(“blastout2.txt”, “r”) lines = fh.readlines() fh.close() for line in lines: text = line.replace(“\n”,””) fields = text.split() name = fields[1] start = int(fields[8]) stop = int(fields[9]) print name, stop - start python script, called parse_blast.py contig698252 -471 contig698252 -113 contig698252 -99 [skippingsomelines of output here] contig539930 90 contig127790 78 contig710791 33 Results from runningthe script – some output in the middleis skipped

  41. How to get absolute length? • Results from previous slide – some lengths were negative • Examine input file and figure out why

  42. Absolute lengths • Some matches are on the reverse strand, i.e. stop < start • Solution: reverse that • But: only in cases where the match is actually on the reverse strand • How to detect: see if stop < start

  43. Print absolute lengths, also for reverse strand matches • Modify code on the slide earlier so that in cases where stop < start, we print start – stop instead fh = open("blastout2.txt", "r") lines = fh.readlines() fh.close() for line in lines: text = line.replace("\n","") fields = text.split() name = fields[1] start = int(fields[8]) stop = int(fields[9]) if stop < start: print name, start - stop else: print name, stop - start If statement that lets us print outstart – stop if stop < starts, andstop – start otherwise

  44. Writing to files • Similar procedure as for read • Open file, mode is “w” or “a” • fo.write(string) • Note: one single string • Newlines have to be added specifically • fo.close()

  45. Modify script to print to file • Changes to do: • Open output file • Write to output file • Close output file

  46. Print to file fh = open("blastout2.txt", "r") lines = fh.readlines() fh.close() fo = open("results_file.txt", "w") for line in lines: text = line.replace("\n","") fields = text.split() name = fields[1] start = int(fields[8]) stop = int(fields[9]) difference = 0 if stop < start: difference = start - stop else: difference = stop – start outstring = name + "\t" + str(difference) + "\n" fo.write(outstring) fo.close() Open output file for writing. Putting the difference in a separatevariable makes it easier to print Creating output string, and writingit. Notice we added a newline! Closing the output file

  47. Results contig698252 471 contig698252 113 contig698252 99 contig698252 94 contig889828 360 contig889828 250 contig889828 192 contig889828 62 contig362216 360 contig362216 250 contig362215 192 contig362215 62 contig855826 139 contig701635 138 contig747126 141 contig280145 115 contig560635 99 contig279861 85 contig539930 90 contig127790 78 contig710791 33

More Related