1 / 57

P a t t e r n d a t a b a s e s

P a t t e r n d a t a b a s e s. Gopalan Vivek. Pattern databases - topics. Definition Applications Classifications Common Databases Conclusions. Pattern databases. Definition Applications Classifications Common Databases Conclusions. Pattern databases – definition.

woody
Download Presentation

P a t t e r n d a t a b a s e s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patterndatabases Gopalan Vivek

  2. Pattern databases - topics • Definition • Applications • Classifications • Common Databases • Conclusions

  3. Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions

  4. Pattern databases – definition • Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc patterns

  5. Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern Extraction - Multiple sequence alignment Pattern databases Thousands of patterns

  6. Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions

  7. Pattern Databases - Applications • Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). • Useful for classification of protein sequences into families. • It takes less time to search the pattern than the primary database. • Since “patterns” is the compact representation of features of many sequences.

  8. Pattern databases • Definition • Applications • Classifications • Common Databases • Conclusions

  9. Family based databases – considers full MSA Multiple Sequence Alignment (MSA) Motif -3 Motif -1 Motif based databases – considers local regions in MSA

  10. Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS Pattern Databases – Protein

  11. InterPro - Integrated resources of protein families and sites • PROSITE • PRINTS • BLOCKS • Pfam • ProDom InterPro

  12. Pattern databases • Definition • Applications • Classifications • Common Databases • PROSITE, PRINTS, BLOCKS & SMART (motif based) • MetaFam, InterPro (Integrated databases) • Conclusions

  13. Databases – General Tips • Source • Input formats & parameters • Output formats • Quality of the data • Other details – updates, coverage, speed, download, reference, methods etc.

  14. Focus • To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. • To analyze the quality of results from each of these database • Sensitivity, specificity. • Sequence & Pattern searches - In the afternoon’s practical.

  15. PROSITE http://www.expasy.org/prosite/ • consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. • Based on SWISSPROT/TrEMBL

  16. http://www.expasy.org/prosite/ ID and text Search Text Search Sequence Scanner

  17. Result: PROSITE Documentaion page PROSITE ID Details about the pattern/profile PROSITE Pattern [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]

  18. Detailed View - page 1 PROSITE Pattern Numerical Results

  19. Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)

  20. Raw Text Format – PROSITE Format

  21. ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line

  22. PROSITE Profiles

  23. Highly degenerate protein structural and functional domains • immunoglobulin domains, SH2 and SH3 domains. • Consensus sequences of repetitive DNA elements • SINEs, LINEs • Basic gene expression signals • promoter elements, RNA processing signals, translational initiation sites. • DNA-binding protein motifs. • Protein and nucleic acid compositional domains • glutamine-rich activation domains, CpG islands.

  24. PROSITE - features • Completeness • High specificity • Documentation • Periodic reviewing • Parallel update with SWISS-PROT(primary database)

  25. motif Multiple Sequence Alignment cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt Find 4-5 functionally conserved residues C-Y-x2-[DG]-G-x-[ST] CORE PATTERN Increase the sequence length of the pattern SWISS-PROT More FALSE POSITIVES ? PROSITE DB YES NO

  26. http://bioinf.man.ac.uk/dbbrowser/PRINTS/ • Protein fingerprint database • Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. • Improved diagnostic reliability than single motif methods • Source – SWISSPROT/TrEMBL

  27. motif Multiple Sequence Alignment xxxxxxx xxxxxxx xxxxxxx xxxxxxx cydeggis cyedggis cyeeggit cyhgdggs xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx xxxxxxx Identification of ALL the conserved regions fingerprint Frequency matrices Creation of frequency matrices SWISS-PROT / Tr-EMBL Iterative database scanning of the frequency matrices with protein databases till convergence PRINTS DB

  28. Database ID , no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database) http://bioinf.man.ac.uk/dbbrowser/PRINTS/

  29. Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation,Links & references

  30. Page 2 Fingerprint details Sequence Summary

  31. Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint

  32. BLOCKS http://blocks.fhcrc.org/blocks/ • Blocksare multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins • The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.

  33. Blocks Making • Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.

  34. http://blocks.fhcrc.org/blocks/blocksdiag.jpg

  35. Sequence, no. of blocks and text Searches Blocks Maker http://blocks.fhcrc.org/blocks/

  36. Page 1 Summary Search methods using blocks

  37. Page 2 BLOCK - 1 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 SWISSPROT ID Represent start position of the block

  38. http://smart.embl-heidelberg.de/ • Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. • Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.

  39. ID & sequence Search Domain & GO search ID and text Search Alkaline Phosphatase

  40. Results – Alkaline phosphatase “Signatures” • PROSITE • Represented as a single motif. • PRINTS • Represented as 5motif regions. • BLOCKS • Represented as 6 block regions • SMART • Represented as a single profile

  41. Composite Pattern Databases • MetaFam • InterPro • CDD (conserved Domain Database) • IProClass

  42. Metafam & PANAL • Metafam - http://metafam.ahc.umn.edu/ • PANAL – Protein ANALysis tool page of Metafam http://mgd.ahc.umn.edu/panal/ • Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.

  43. PANAL

  44. Interpro • http://www.ebi.ac.uk/interpro • Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS-PROT and TrEMBL • Text- and sequence-based searches.

  45. http://www.ebi.ac.uk/interpro/

More Related