1 / 65

Exploring 3D Molecular Structures Using NCBI Tools

Exploring 3D Molecular Structures Using NCBI Tools ... By Structure Similarity: VAST. By Conserved Function: RPS-BLAST and CDD. Finding a Structural Template for a Query Protein ...

PamelaLan
Download Presentation

Exploring 3D Molecular Structures Using NCBI Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Exploring 3D Molecular Structures Using NCBI Tools

    A Field Guide June 17, 2004

    Slide 2:NCBI Structure Resources

    Overview of Structural Informatics at NCBI How 3D Macromolecular Structures are Determined Indexing Structural Data at NCBI Finding Homologous Structures By Sequence Similarity: BLAST By Structure Similarity: VAST By Conserved Function: RPS-BLAST and CDD Finding a Structural Template for a Query Protein

    Slide 3:The National Center for Biotechnology Information

    Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

    Slide 4:Structural Informatics

    Chemical Formula 3D Conformation Function ARKLMPQSCSW… Modifications Ions Ligands Binding Sites Catalytic Residues Kinetics Thermodynamics Substrates Intermediates Structure Dynamics Active States Folding

    Slide 5:Structural Informatics

    Chemical Formula 3D Conformation Function GenPept NCBI RefSeq SWISS-PROT PIR PRF Multiple Sequence Alignments: Pfam, SMART, COGs, CDD PDB

    Slide 6:Structural Informatics at NCBI

    Chemical Formula 3D Conformation Function GenPept NCBI RefSeq SWISS-PROT PIR PRF Multiple Sequence Alignments: Pfam, SMART, COGs, CDD PDB 4,818,495 25,003 11,382 103,820

    Slide 7:The Entrez System

    Entrez Nucleotide PubMed Protein Taxonomy Structure Domains 3D Domains Books Journals PMC OMIM UniSTS PopSet Genome SNP UniGene Gene GEO GEO Datasets MeSH

    Slide 9:More About Resolution

    1EJG: Crambin at 0.54 Å 2TMA: Tropomyosin at 15 Å protons!! only alpha carbons!! 3 Å “Temperature”

    Slide 11:PDB

    Slide 12:PDB File: Header

    HEADER ISOMERASE/DNA 01-MAR-00 1EJ9 TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; COMPND 5 EC: 5.99.1.2; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.60 ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER … REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK 290 ...

    Slide 14:From PDB to Entrez

    Structure 3D Domains Protein Domains

    Slide 15:From Coordinates to Models

    1EJ9: Human topoisomerase I

    Slide 16:Building the Structure Summary

    Taxonomy Pubmed Protein 3D Domains Domains Nucleotide

    Slide 17:Indexing into MMDB

    Structure Import only experimentally determined structures Convert to ASN.1 Verify sequences inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } , id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } , Add secondary structure Add chemical bonds Create “backbone” model (Ca, P only) Create single-conformer model

    Slide 18:Structure Indexing

    Entrez MMDB-ID MMDB entry date EC number Organism PDB Accession Release date Class Source Description Comment Ligands PDB code PDB name PDB description Literature Article title Author Journal Publication date Experimental Method Resolution Counters Ligand types Modified amino acids Modified nucleotides Modified ribonucleotides Protein chains DNA chains RNA chains topoisomerase AND 2[dnachaincount] AND human[organism]

    Slide 19:Creating Sequence Records

    Protein Nucleotide Nucleotide 1EJ9A 1EJ9C 1EJ9D One record per chain

    Slide 20:Building the Structure Summary

    Slide 21:Building the Structure Summary

    Slide 22:Annotating Secondary Structure

    1EJ9: Human topoisomerase I a-Helices ß-strands coils/loops

    Slide 23:Creating 3D Domains

    3D Domain 0: 1EJ9A0 = entire polypeptide

    Slide 24:Creating 3D Domains

    1EJ9A1 1EJ9A3 1EJ9A2 1EJ9A4 1EJ9A5 < 3 Secondary Structure Elements

    Slide 25:Building the Structure Summary

    Slide 26:Building the Structure Summary

    Slide 27:3D Domain Indexing

    Entrez SDI MMDB-ID Accession MMDB entry date Organism Domain number Cumulative number PDB Accession Release date Class Source Description Comment Literature Article title Author Publication date Counters Modified amino acids a-Helices ß-Strands Residues Molecular weight REMEMBER: 3D Domain 0 is the entire polypeptide chain! 4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism] Find all viral four helix bundles

    Slide 28:Conserved Domains

    Weakly conserved serine Active site serine Sequences Aligned by Function

    Slide 29:Linking Sequence to Function

    The PSSM Position Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile

    Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT COG SMART CD

    Slide 30:Entrez Domains (CDD v2.00)

    HMM based models originally concentrating on eukaryotic signaling domains, now expanding BLAST based alignments derived from complete proteomes of prokaryotes NCBI curated domains based on sequence and structural alignments Pfam pfam01234 smart00123 cd01234 COG0123 NCBI NCBI Sanger EMBL Single Domains Protein Families A database of Position Specific Score Matrices (PSSMs)

    Slide 31:CD-Search Output

    CD SMART Pfam COG Click on a colored bar to align your sequence to the CD

    Slide 32:CD Summary

    Alignment view controls Cn3D launch PSSM created Aligned query

    Slide 33:Building the Structure Summary

    Slide 34:Building the Structure Summary

    Cn3D

    Slide 36:Links to CDs

    CD-Search / RPS-BLAST 1EJ9A Query: protein sequence Database: PSSMs pre-computed in Entrez Protein Enter accession, GI, or FASTA sequence into RPS-BLAST

    Slide 37:Finding Homologous Structures

    By sequence similarity: BLAST By structural similarity: VAST By conserved function: CD-Search Entrez Protein

    Slide 38:BLAST: Sequence Neighbors

    BLAST Related Structures Displays a graphical and text alignment between a query sequence and a similar sequence with structure Accessed from Blink Any protein BLAST search ? GVKWKYLEHKGPVFAPPYDPLP GIKWKFLEHKGPVFAPPYEPLP

    Slide 39:BLink Neighbors

    EAA05377: ENSANGP00000011118 from A. gambiae Related Structures

    Slide 40:Related Structures from BLASTp

    Slide 41:Related Structures

    Cn3D

    Slide 42:VAST: Searching by Structure

    Why search for similar structures? To find homologs that sequence searches cannot: distant protein homologs often conserve structure more strongly than sequence To explore protein evolution: similar protein folds can be used to support different functions To identify conserved core elements of a protein fold that can be used to model related proteins of unknown structure

    Slide 43:VAST: Structure Neighbors

    Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 structure: 1RCBstructure: 1RCB

    Slide 45:VAST: Calculate (rik, zik)

    3 1 z For both the query and target structures, For each SSE k, set the origin at the midpoint of k. Then calculate rik and zik for the endpoints of SSEs i ? k. Vector position relative to the xy plane xy z13 r13 structure: 1RCBstructure: 1RCB

    Slide 47:VAST: Refinement

    Aligned residues are red Alignment extended to the end of this strand Ca atoms are added to the aligned SSEs Alignments are allowed to extend beyond SSE boundaries All atoms are added to the models, and the detailed backbone and sidechain positions are refined

    Slide 48:VAST: Alignment of Sequence

    Aligned blocks represent structural core elements Aligned blocks have no internal gaps Aligned residues occupy the same position in space Aligned residues are shown in CAPITAL letters Helix 1 Helix 2 Helix 3 Helix 4

    Slide 49:VAST: Scoring

    p = d P(s > s0, n) c(n, P1, P2) P(s > s0, n) Probability of observing an alignment of n SSEs with a score greater than s0 by chance. c(n, P1, P2) Search space: Number of possible alignments of n SSEs between vector sets P1 and P2. d Number of structures searched (set to 500) The probability that the VAST alignment occurred by chance.

    Slide 51:Query by Chain vs 3D Domain

    Query by whole chain Query by domain 5 Not found using whole chain query! c(n, P1, P2) is smaller for a 3D domain!

    Slide 52:VAST: Multiple Alignments

    Cn3D

    Slide 53:nr-PDB Sets

    Choose criteria for inclusion in a set Non-redundant set of sequence similar clusters VAST reports one representative from each cluster

    Slide 54:Submitting a PDB File to VAST

    Pick the correct file format Remove all records except ATOM This is the best way to convert PDB into MMDB format!

    Slide 55:Blocks in CD Alignments

    Alignment view controls Aligned query Cn3D launch Block 1 Block 2 Block 3 Consensus sequence created PSSM created

    Slide 56:Curating CD Alignments

    smart00235 VAST cd00203 Cn3D Cn3D

    Slide 57:Curated CD Summary

    List of annotated features Customized view of the selected feature in Cn3D Residues comprising the selected feature Cn3D

    Slide 58:

    CD-Curation: Effect on model alignment accuracy A. Marchler-Bauer

    Slide 59:CDART

    Only available for single domain records: cd, pfam, smart

    Slide 60:Finding a Structural Template

    Overall Strategy: For a query protein sequence, construct a block alignment representing conserved core SSEs of the most sequence similar structures to the query, and then align the query sequence to this template. Construct the block alignment Curated CD: Locate using CD-Search and use the sequences most similar to the query VAST: Find the most sequence similar structure and find its VAST neighbors Align the query to the template: Use Cn3D PSI-BLAST: Aligns sequence using PSSM of current alignment BLOCKER: Aligns sequence to an existing block alignment: use where sequence similarity is high Threader: Aligns sequence to a structure and a block alignment: use where sequence similarity is low

    Slide 61:BLOCKER: The Block Aligner

    PSSM Creates alignments that match the existing block structure Matches are scored from a PSSM generated from the block alignment An entire block must be matched with no internal gaps There are no penalties for gaps between blocks up to a set gap length Can perform both local and global alignments Generally used after BLAST or PSI-BLAST The Block Aligner tests the existing block structure

    Slide 63:The NCBI Threader

    L R L S L E Q L Q V I A I A N Input Structure Block alignment Sequence Attempts to find matches based on chemical contacts, mainly buried hydrophobic interactions Useful on blocks for which sequence alignment methods fail Should be iterated with varying block structures Cn3D

    Slide 64:The Future

    More curated CDs: they keep coming… Pre-computed Related Structures for all sequences in Entrez Protein CD “children”: subfamilies of large CD records based on sequence and structure similarity Improved mapping of SNP data onto 3D structures Further linking of structural and genomic biology

    Slide 65:What comes next…

    Workshop I Working with Structures Workshop II Working with Alignments All exercises and other resources will remain on the course web pages info@ncbi.nlm.nih.gov NCBI Handbook, Ch. 3

More Related