Python Dictionaries

Python Dictionaries “mappings”

Python Dictionaries • Sequences are accessed by position – that is, by an index indicating how many entries they are past the first entry. • There is another class of collections known generally as mappings • The items in a mapping are accessed by a key and no particular ordering may be assumed • Python has only one built-in mapping type, namely dictionaries with type name dict • Aside or the CS students in the class: python dictionaries are implemented using hash tables • Dictionaries are mutable: • They can be changed in place and can grow and shrink on demand, like lists.

Python Dictionaries • The literals used in directly defining a dictionary are a sequence of “key:value” pairs enclosed in curly braces. D = {'food': 'Spam', 'quantity': 4, 'color':'pink'} • We “index” the dictionary by key to fetch and change the key's associated value: • >>> D['food'] 'Spam' >>> D['quantity'] += 1 >>> D {'food': 'Spam', 'color': 'pink', 'quantity': 5}

Python Dictionaries • A dictionary can also be built up one item at a time • First, create an empty dictionary: >>> D = {} • Then create keys by assignment: >>> D['name'] = 'Bob' >>> D['job'] = 'dev' >>> D['age'] = 40 >>> D {'age':40,'job':'dev','name':'Bob'} >>> print(D['name'] Bob

Dictionary Methods may be used like a sequence; to convert to an actual sequence, use list(d.keys())

Dictionaries are the natural Python representation of tabular data. • We will illustrate this by representing the codon table showing hoq the codons determine amino acids during translation.

Codon Table as Dictionary >>> RNA_codon_table = { # Second Base # U C A G # U 'UUU': 'Phe', 'UCU': 'Ser', 'UAU': 'Tyr', 'UGU': 'Cys', # UxU 'UUC': 'Phe', 'UCC': 'Ser', 'UAC': 'Tyr', 'UGC': 'Cys', # UxC 'UUA': 'Leu', 'UCA': 'Ser', 'UAA': '---', 'UGA': '---', # UxA 'UUG': 'Leu', 'UCG': 'Ser', 'UAG': '---', 'UGG': 'Urp', # UxG # C 'CUU': 'Leu', 'CCU': 'Pro', 'CAU': 'His', 'CGU': 'Arg', # CxU 'CUC': 'Leu', 'CCC': 'Pro', 'CAC': 'His', 'CGC': 'Arg', # CxC 'CUA': 'Leu', 'CCA': 'Pro', 'CAA': 'Gln', 'CGA': 'Arg', # CxA 'CUG': 'Leu', 'CCG': 'Pro', 'CAG': 'Gln', 'CGG': 'Arg', # CxG # A 'AUU': 'Ile', 'ACU': 'Thr', 'AAU': 'Asn', 'AGU': 'Ser', # AxU 'AUC': 'Ile', 'ACC': 'Thr', 'AAC': 'Asn', 'AGC': 'Ser', # AxC 'AUA': 'Ile', 'ACA': 'Thr', 'AAA': 'Lys', 'AGA': 'Arg', # AxA 'AUG': 'Met', 'ACG': 'Thr', 'AAG': 'Lys', 'AGG': 'Arg', # AxG # G 'GUU': 'Val', 'GCU': 'Ala', 'GAU': 'Asp', 'GGU': 'Gly', # GxU 'GUC': 'Val', 'GCC': 'Ala', 'GAC': 'Asp', 'GGC': 'Gly', # GxC 'GUA': 'Val', 'GCA': 'Ala', 'GAA': 'Glu', 'GGA': 'Gly', # GxA 'GUG': 'Val', 'GCG': 'Ala', 'GAG': 'Glu', 'GGG': 'Gly' # GxG }

Codon Table as Dictionary • >>> RNA_codon_table • {'ACC': 'Thr', 'GUC': 'Val', 'ACA': 'Thr', 'AAA': 'Lys', 'GUU': 'Val', 'AAC': 'Asn', 'CCU': 'Pro', 'UGG': 'Urp', 'AGC': 'Ser', 'AUC': 'Ile', 'CAU': 'His', 'AAU': 'Asn', 'AGU': 'Ser', 'ACU': 'Thr', 'GUG': 'Val', 'CAC': 'His', 'ACG': 'Thr', 'CAA': 'Gln', 'CCA': 'Pro', 'CCG': 'Pro', 'CCC': 'Pro', 'GGU': 'Gly', 'UCU': 'Ser', 'GCG': 'Ala', 'UGC': 'Cys', 'CAG': 'Gln', 'UGA': '---', 'UAU': 'Tyr', 'CGG': 'Arg', 'UCG': 'Ser', 'AGG': 'Arg', 'GGG': 'Gly', 'UCC': 'Ser', 'UCA': 'Ser', 'GAA': 'Glu', 'UAA': '---', 'GGA': 'Gly', 'UAC': 'Tyr', 'CGU': 'Arg', 'UGU': 'Cys', 'AUA': 'Ile', 'GCA': 'Ala', 'CUU': 'Leu', 'GGC': 'Gly', 'AUG': 'Met', 'CUG': 'Leu', 'GAG': 'Glu', 'CUC': 'Leu', 'AGA': 'Arg', 'CUA': 'Leu', 'GCC': 'Ala', 'AAG': 'Lys', 'GAU': 'Asp', 'UUU': 'Phe', 'GAC': 'Asp', 'GUA': 'Val', 'CGA': 'Arg', 'GCU': 'Ala', 'UAG': '---', 'AUU': 'Ile', 'UUG': 'Leu', 'UUA': 'Leu', 'CGC': 'Arg', 'UUC': 'Phe'}

“Pretty Printing” >>> pprint(RNA_codon_table) {'AAA': 'Lys', 'AAC': 'Asn', 'AAG': 'Lys', 'AAU': 'Asn', 'ACA': 'Thr', 'ACC': 'Thr', 'ACG': 'Thr', 'ACU': 'Thr', 'AGA': 'Arg', . . . 'UCU': 'Ser', 'UGA': '---', 'UGC': 'Cys', 'UGG': 'Urp', 'UGU': 'Cys', 'UUA': 'Leu', 'UUC': 'Phe', 'UUG': 'Leu', 'UUU': 'Phe'}

Using the RNA_codon_table >>> deftranslate_RNA_codon(codon): return RNA_codon_table[codon] >>> translate_RNA_codon('AGA') 'Arg'

Streams • A streamis a temporally ordered sequence of indefinite length • Usually limited to one type • Two ends: • source, provides the elements • sink, absorbs the elements • Examples of Python stream sources: • files • network connections • output of special functions called generators • Examples of stream sinks: • files • network sources

Streams • Input to a command-line shell or the Python interpreter becomes a stream of characters • When Python prints to the terminal, also a stream of characters • Illustrates the temporal nature of streams • Keystrokes don't "come from" anywhere • They are events that happen in time • Implementation detail: buffering

Files • Depending on a parameter used on creation, the elements “flowing” to/from the file stream are either bytes or Unicode characters • Some methods treat files as streams of bytes or characters, others as lines of bytes or characters • Most of the time, a file is a one-way sequence – it can be read from or written to • While it is possible to create a two-way file object, better to think of it as two separate streams • Files opened for reading are assumed to already exist • An attempt to open a non-existent file for reading results in an error • Files can also be opened for appending to an existing file • When a file is opened for writing it is created if it does not exist • If the file does exist, its contents are erased as the result of being opened

Creating File Objects • File objects are created by a call to the built-in function open(path,mode) • path is a string that specifies the location for the physical file represented by the Python file object • mode is a string of length one or two or three which specifies the type of file interaction ' • A substring specifying the intended use of the file is mandatory • The use options are 'r', 'w', 'a', 'r+', 'w+', 'a+'; 'r' is the default • An optional single-character specifies the file object's value type • The value options are: 't', 'b'; 't' is the default • The meaning of the mode string contents are given in the following tables

File Modes (Unicode) Correction

Creating File Objects • Simple use of the open function:f = open('C:\Users\rtindell\myfile','r') • When you are finished using the file object, you must close it:f.close() • There are hazards to using this approach, which, although relative rare, can have bad results • If your script crashes before the close statement is executed, there may be writes whose data was not actually written to the physical file • This is because transfers to and from external drives are done in fairly large chunks (blocks) because of the way the underlying hardware works • The chunks are kept in special pieces of computer memory called buffers • Requests for reading or writing are satisfied by the buffers until all the contents of the buffer has been used • At that point, entire blocks are transferred between main memory and the drive • If the buffer was half-full when your script crashed, the buffer data would never be written to the disk

The with Statement • Python provides a way to make sure files are closed regardless of other events

File Read Methods In the following, f is a file object • f.read(count) • Treats the file as an input stream of characters • Reads up to count bytes from the current file position into a string and returns that string • The file position is then the next byte in the file • If there were fewer than count bytes left in the file, returns just the remaining bytes • If the file position is the end of the file, returns the empty string • f.read() • Reads the bytes from the current position to the end of the file into a string and returns that string • The file position is then the end of the file

File Read Methods In the following, f is a file object • f.readline() • Treats the file as an input stream of lines • Reads one line from the file and returns the entire line, including the end-of-line character • The file position is then the beginning of the next line of the file or the end of file if all bytes in the file were exhausted • If the initial file position is the end of the file, returns the empty string • f.readline(count) • Same as f.readline(), but limits the number of bytes read to count.

File Write Methods In the following, f is a file object , s is a string and seq is a sequence object • f.write(s) • Treats the file as an output stream of characters • Writes the string s to the file represented by f • f.writelines(seq) • Treats the file as an output stream of lines • All elements of seq must be strings • Writes each element of seq to the file • Despite the name, it does not add newline characters to the elements • The print statement also provides a mechanism for writing to a file • You just use a keyword argument as the last argument • Example: print('Hello','young','bioinformaticians',file=f)

FASTA Format • FASTA formatted files are widely used in bioinformatics • They consist of one or more base or amino acid sequences broken up into fixed size lines, each preceded by a single header line • The header line starts with a " >" symbol. • The first word on this line is the name of the sequence. The rest of the line optionally provides a description of the sequence. • Meaning of the header line entries is given in the first line below and does not exist in the FASTA file. The second line is the actual entry for our example. IdentifierMolecule TypeGene NameSequence Length FOSB_MOUSE Protein fosB 338 bp

FASTA Format • FASTA sequence identifiers are usually more complex than previously shown and distinguish various possible sources for the sequence • Below is a table of identifier formats In a genomic contect, locus refers to position on a chromosome. It may, therefore, refer to a marker, a gene, or any other landmark that can be described. accession = Accession Number

FASTA Example >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Creating a Sequence Dictionary from a FASTA File • We next present three ways to read the contents of a file containing FASTA format data into a dictionary whose keys are the sequence identifiers • A dictionary value will be a length-two list of strings • The first string will contain the sequence description if present in the FASTA file, otherwise it will be the empty string • The second string will be a single string containing the sequence itself • If Dwere the name of our dictionary and seqidthe identifier of a sequence from the file, we could access the sequence of that name asD['seqid'][1]

Method 1: Reading the Entire File into a String # file fasta_dict1.py deffasta_to_dictionary(fpath): D = {} with open(fpath,'r') as f: # Separate entries S = f.read() J0 = S.split('>') J = [j for j in J0 if j != ''] # Eliminate empty lines # J is now a list of strings, each of which contains one of # the sequence specifications from the FASTA file

# (fasta_to_dictionary definition continued) for B in J: C = [k for k in B.splitlines() if k != ''] # C[0] is the first line of B # and is thus the name-description line comps = C[0].split() key = comps[0] # First word is the identifier for the sequence # Remaining words in comps are sequence description components if len(comps) > 1: descr = ' '.join(comps[1:]) else: descripion = '' # Remaining lines of B contain the split-up sequence # so join them into a single line seq = ''.join(C[1:]) D[key] = [descr,seq] f.close() return D

Main Body of fasta_dict1.py # file fasta_dict1.py continued # Test the function D = fasta_to_dictionary('fdata') if len(D) == 0: print('No FASTA data found') else: for k in D: print('Sequence Identifier:\t\t',k) print('Sequence Description:\t',D[k][0]) print('Sequence:') print(D[k][1],'\n')

Contents of Test File fdata >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL >FOSB_RAT Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Output of fasta_dict1.py Sequence Identifier: FOSB_MOUSE Sequence Description: Protein fosB. 338 bp Sequence: MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Sequence Identifier: FOSB_RAT Sequence Description: Protein fosB. 338 bp Sequence: MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Possible Problems with Script fasta_dict1.py • If we were trying to process a very large file, reading its entire contents into a string in memory might not be possible • We will modify the script so that it processes one line at a time using the readline method • Of course, this is only one part of a correction since the dictionary we build would be at least as large as the file! • We will address that problem later • We will use the with statement so that you can see it in use • Note that all statements that use the function f must be in the block of the with statement • Why? After you leave the with block, f has been closed.

Method 2: Reading the File One Line at a Time • Since the two scripts only differ in the fasta_to_dictionary function, we only show that part here. # file fasta_dict2.pydeffasta_to_dictionary(fpath): D = {} with open(fpath,'r') as f: key = '' descr = '' seq = ''

Method 2: Reading the File Line at a Time for line in f: line = line.strip() if line.startswith('>'): if key != '': # Finished with a sequence D[key] = [descr,seq] comps = line.split() key = comps[0] if len(comps) > 1: descr= ' '.join(comps[1:]) else: descr= '' seq= '' else: seq+= line # If there are lines preceding the first # '>' line they will accumulate here and # be discarded when we start processing # the first '>' line # END OF with SUITE # Save the final sequence, which was terminated by file's end if key != '': D[key] = [descr,seq] return D

Exploring the Preceding Examples • The files fasta_dict1.py and fasta_dict2.py have been posted in the Practice Problems page of the course website (not Blackboard) • One way to get an understanding of the scripts is to insert print statements to print intermediate data that appear in the script

Exploring the Preceding Examples • For example, in fasta_dict1.py, you could replace with open(fpath,'r') as f: S = f.read() J0 = S.split('>') J = [j for j in J0 if j != ''] with with open(fpath,'r') as f: S = f.read() print(S) J0 = S.split('>') print(J0) J = [j for j in J0 if j != ''] print(J)

Python Dictionaries