html5-img
1 / 22

Advanced File Parsing: Regular Expressions

Advanced File Parsing: Regular Expressions. BCHB524 2008 Lecture 6 . Outline. Review Lecture 4 exercises Regular Expressions Protein active sites / functional domains Restriction / digestion enzymes Specialized text parsing Exercises. Review. Basic data-types: immutable

freya
Download Presentation

Advanced File Parsing: Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced File Parsing: Regular Expressions BCHB5242008Lecture 6 BCHB524 - 2008 - Edwards

  2. Outline • Review • Lecture 4 exercises • Regular Expressions • Protein active sites / functional domains • Restriction / digestion enzymes • Specialized text parsing • Exercises BCHB524 - 2008 - Edwards

  3. Review • Basic data-types: immutable • Integers, floats, strings, tuples, booleans, None • Statements: • Assignment, if statements, for statements • Compound data-structures: mutable • Lists, dictionaries, sets, arrays, files • Lists ↔ Strings • Reading sequences from files, parsing NCBI tax names • Advanced iteration: • Iterables, comprehensions, generators, sorting keys • Modules: • BioPython, parsing Fasta, RefSeq, and UniProt files BCHB524 - 2008 - Edwards

  4. Lecture 4 exercise discussion BCHB524 - 2008 - Edwards

  5. Regular Expressions • Good HOWTO • http://py-howto.sourceforge.net/regex/regex.html • Andrew Dalke's lecture on this is superb • See link to courses on "Lecture 1 Links" post. • Useful for many, many string tasks • Most string methods can be implemented using re • Parsing, picking apart text-based formats • DNA sequence motifs • Protein sequence motifs BCHB524 - 2008 - Edwards

  6. Regular Expressions • "Look" ugly! • Esoteric syntax • Are used (overused?) for everything in perl, linux • Can be very hard to get right • Constant source of frustration and bugs • So powerful, you just can’t afford not to know and use them. BCHB524 - 2008 - Edwards

  7. Protein function signatures • Protein sequence suggests structure / shape • …which suggests function. • Functional protein domains have similar sequences • some very, very similar • others quite dissimilar • ProSite is a database of protein signatures • Signatures represented as consensus pattern BCHB524 - 2008 - Edwards

  8. p53 tumor antigen protein • Many contain the string:MCNSSCMGGMNRR • Others contain the string:MCNSSCVGGMNRR import Bio.SeqIO handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if 'MCNSSCMGGMNRR' in seq or 'MCNSSCVGGMNRR' in seq: print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards

  9. p53 tumor antigen protein • A better way: • Match MCNSSC, then M or V, then GGMNRR • [...] is list of matching residues • So, match MCNSSC[MV]GGMNRR instead. import Bio.SeqIO import re handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if re.search(r'MCNSSC[MV]GGMNRR',seq): print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards

  10. Antennapedia signature • 'Homeobox' antennapedia-type protein signature is more interesting: • [LIVMFE] - [FY] - P - W - M - [KRQTA] • As a regular expression: • [LIVMFE][FY]PWM[KRQTA] • Some matches in human proteins: • EYPWMK, IFPWMK, VYPWMK, IYPWMR, VYPWMQ, IYPWMR, EFPWMK, IFPWMK, VYPWMR, IFPWMR, VYPWMQ, IYPWMR, LFPWMR, VYPWMK, IYPWMT, IYPWMQ, MFPWMR, IFPWMK, VYPWMK, MFPWMR BCHB524 - 2008 - Edwards

  11. N-Glycosylation site • Pattern is N, not P, S or T, not P. • Could use (for not P): [ACDEFGHIKLMNQRSTVWY] • Better: [^P] • Caveat: [^P] includes B, J, O, Z, %, $, a, c, … if re.search(r'N[^P][ST][^P]',seq): print "glycosylation site!" BCHB524 - 2008 - Edwards

  12. Trypsin digest site • Pattern is K or R, not P. if re.search(r'[KR][^P]',seq): print "typtic digest site!" BCHB524 - 2008 - Edwards

  13. Barwin domain signature • Signature:C - G - [KR] - C - L - x - V - x - N • '.' (period) matches any character/residue • As a regular expression: CG[KR]CL.V.N • Matches BCHB524 - 2008 - Edwards

  14. Repeated Residues • For example, 3 hydrophobic residues [FILAPVM][FILAPVM][FILAPVM] • Regular expression: [FILAPVM]{3} - exactly 3 hydrophobic res. [FILAPVM]{3,5} - between 3 and 5 [FILAPVM]{,3} - at most 3 [FILAPVM]{3,} - at least 3 • .{10} matches exactly 10 characters, residues • domain signatures often have spacers BCHB524 - 2008 - Edwards

  15. Aspartic acid and asparagine hydroxylation site • Consensus pattern:C - C - x(13) - C - x(2) - [GN] - x(12) - C - x - C - x(2,4) - C • As regular expression: CC.{13}C.{2}[GN].{12}C.C.{2,4}C • . is same as .{1}, of course • Special repeat ranges: • Optional: ? is same as {0,1} • 0 or more: * is same as {0,} • 1 or more: + is same as {1,} BCHB524 - 2008 - Edwards

  16. N- and C- terminals • We can match at start or end of sequence only • ^ matches at start of sequence • $ matches at end of sequence • Starts with methionine: • re.search(r'^M',seq) • Ends with proline codon: • re.search(r'CC.$',seq) BCHB524 - 2008 - Edwards

  17. Regular expressions in Python • re module, • re.search(regex,string) to find a match • returns a "match" object, or None • Match objects store information about a successful match. m = re.search(r'[KR][^P]',seq) if m != None: print "typtic digest site at",(m.start()+1) BCHB524 - 2008 - Edwards

  18. Regular expressions in Python • Groups store part of a match for later. • Indicate with (…) • Particularly useful with variable length matches pattern = r'[ASD]{3,5}([LI])[^P]{2,5}' seq = "EASALWTRD" m = re.search(pattern,seq) if m != None: print m.start(),m.end() print m.start(1),m.end(1) print m.group(1) BCHB524 - 2008 - Edwards

  19. Groups are great for parsing • Check for a match, and then pick out the piece you need if match succeeds dbxrefs = ['EMBL:CR940353', 'RefSeq:XP_953099.1', 'GeneID:3863060', 'KEGG:tan:TA08425', 'GO:GO:0005886', 'InterPro:IPR007480', 'Pfam:PF04385'] for r in dbxrefs: m = re.search(r'^RefSeq:([NX]P_[0-9]+)\.[0-9]+$',r) if m != None: print "RefSeq accession is",m.group(1) BCHB524 - 2008 - Edwards

  20. Lab exercises • Try each of the examples shown in these slides. • Read through the Python Regular Expression HOWTO and Andrew Dalke's lecture "Searching and Regular Expressions" • Write a regular expression to match the codons that code for each amino-acid. • Note: S, R and L are hard! BCHB524 - 2008 - Edwards

  21. Lab exercises • Construct regular expressions for the restriction enzyme motifs: • GANTC, where N represnts A,C,T, or G • CCWGG, where W represents A or T • Write a program to chop a protein sequence into tryptic peptides. • Print out each tryptic peptide, as well as its start and end position. BCHB524 - 2008 - Edwards

  22. Lab exercises • The GN "line" in a SwissProt entry lists various types of gene names for the protein • A BioPython seq_record object stores this in the dictionary seq_record.annotations, with key 'gene_name'. • Find which SwissProt entries with a gene name denoted "Name" using a regular expression • For those with a "Name" gene name, extract the gene name and print out the protein's id, and the gene name. • Try the above without BioPython. • Try the above without regular expressions! BCHB524 - 2008 - Edwards

More Related