760 likes | 990 Views
Take home. The internet is a powerful resource containing a large volume of data and tools to manipulate them? unfortunately, connecting data between them can sometimes be tricky.. Overview. Whirlwind tour of Web databasesThe Rat Genome Database ? data, tools, and operations. Bioinformatic databa
E N D
1. Bioinformatic Databases Norie de la Cruz, PhD
2. Take home The internet is a powerful resource containing a large volume of data and tools to manipulate them
unfortunately, connecting data between them can sometimes be tricky.
3. Overview Whirlwind tour of Web databases
The Rat Genome Database data, tools, and operations
4. Bioinformatic databases on the WWW Loose definition of database here
Vary widely in terms of offerings, data, tools and specialization
Vary widely in terms of data collection methodologies
5. Some classifications per NAR Major sequence repositories
Gene Expression
Comparative genomics
Gene Identification and Structure
Genetic and physical maps
Genomic Databases
Intermolecular interactions
Metabolic Pathways and Cellular Regulation
Mutation Databases
Pathology
6. Some classifications per NAR Protein Databases
Protein sequence Motifs
Proteome Resources
Retrieval systems
RNA Sequences
Structure
Transgenics
Varied Biomedical Content
7. Major Sequence Repositories GenBank
RefSeq
DDBJ
Ensemble
Unigene Collection of sequence data
Genomic
Markers
Genes
Proteins
Some provide tools to expedite access
Blast Search
Alignment tools
Translation tools etc.
Varying degrees of quality control
Machine data upload
Human curation and QC
8. Major Sequence Repositories: Genbank All know nucleotide and protein sequences
Provides submission system for various authors
Little QC
9. Major Sequence Repositories: RefSeq Non redundant collection of naturally occurring biological molecules
Human QC
Comprehensive, integrated set of sequences for major research organisms
Provides a stable reference for further characterization of sequences including comparative analyses, mutations, expression, etc.
10. Major Sequence Repositories: Unigene Attempts to cluster GenBank sequences into gene-oriented clusters
Each cluster contains sequences that represent one gene
Provides a stable reference for further characterization of sequences including comparative analyses, mutations, expression, etc.
11. Major Sequence Repositories: DDBJ (DNA Data Bank of Japan) Japanese equivalent to NCBI efforts
Attempting to gather all known nucleotide and protein sequences
Part of the International Nucleotide Sequence Collaboration
12. Major Sequence Repositories: EMBL Nucleotide Sequence Database European equivalent to NCBI efforts
Attempting to gather all known nucleotide and protein sequences
Part of the International Nucleotide Sequence Collaboration
13. Major Sequence Repositories: UCSC Genome Browser Visual representation of genome and sequence data
Run by University of California at Santa Cruz
14. Comparative Genomics
15. Comparative Genomics: Microbial Genome Database for Comparative Analysis
16. Comparative Genomics: Some specialized sites
17. Comparative Genomics: Clusters of Orthologous Groups Phylogenetic classification of the proteins encoded in complete genomes
Proteins grouped according to sequence by a program called COGNITOR
Must be represented in at least three species in a group of 43 species representing phylogenetic lineages
Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.
18. Gene Expression
19. Gene Expression: Array Express
20. Gene Expression: Edinburgh Mouse Atlas Project
21. Gene Expression: HugeIndex (Human Gene Expression Index)
22. Gene Expression: Other specialized sites
23. Gene Identification and Structure
24. Gene Identification and Structure:SNP Consortium database
25. Gene Identification and Structure:Alternative Splicing Annotation Project (ASAP)
26. Gene Identification and Structure:PromEC
27. Gene Identification and Structure:Some other specialized sites
28. Genetic and physical maps Repository for marker information
Data on gene locations within the genome
Map of cloned sequences
Tools to integrate information across genomes
29. Genetic and Physical Maps:HuGeMap
30. Genetic and Physical Maps:GeneMap99
31. Genomic Databases Data repositories for research results on various model organisms
Rat
Human
Fruit fly
Worm
Arabidopsis
Some other rodent
Linking information across databases
Tools to organize and integrate information
32. Genomic Databases:The Rat Genome Database
33. Genomic Databases:FlyBase
34. Genomic Databases:EcoGene
35. Genomic Databases:Some other examples
36. Mutation Databases Allele distributions in populations
Inherited genetics diseases
Mutations in proteins implicated in disease development
37. Mutation Databases: ALFRED designed to make allele frequency data on anthropologically defined human population samples readily available to the scientific community
link these polymorphism data to the molecular genetics-human genome databases
38. Mutation Databases: Human Gene Mutation Database an attempt to collate known (published) gene lesions responsible for human inherited disease
provides information of practical diagnostic importance to
researchers and diagnosticians in human molecular genetics
physicians interested in a particular inherited condition in a given patient or family
genetic counsellors.
39. Mutation Databases:Online Mendelian Inheritance in Man (OMIM) catalog of human genes and genetic disorders
contains textual information, pictures, and reference information
40. Mutation Databases: Other examples Atlas of Genetics and Cytogenetics in Oncology and Haematology
Database of Germline p53 Mutations
SV40 Large T-Antigen Mutant Database
KinMutBase Disease causing kinase mutations
41. Protein Databases Protein sequences collection
Clustering of protein data into families
Specialized protein sites
Organism
Function
Large variety of enzymes
42. Protein Databases: InterPro a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences
amalgamating the major protein signature databases, data have been manually integrated and curated and are available in InterPro
PROSITE
Pfam
PRINTS
ProDom
SMART
TIGRFAMs Home Home
43. Protein Databases:ProtoNet provides global classification of the proteins, from the SWISS-PROT database into hierarchical clusters
clustering is based on an all-against-all BLAST similarity search
44. Protein Databases:iProClass an integrated resource that provides comprehensive family relationships and structural/functional features of proteins
currently consists of non-redundant PIR and SwissProt/TrEMBL proteins
36,200 PIR superfamilies
145,300 families
5720 domains
1300 motifs
280 post-translational modification sites
links to over 50 biological databases.
45. Protein Databases: Other Examples Nuclear Protein Database Proteins localized in the nucleus
PLANT-Pls Plant protease inhibitors
SWISS-PROT/TrEMBL Curated protein sequences
SENTRA Sensory signal transduction proteins
Ribonuclease P Database
46. Protein Sequence Motifs Alignment of protein sequences
Organization of proteins into families
47. Protein Sequence Motifs:BLOCKS multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins
Tools:
Block Searcher -- compare a protein or DNA sequence to a database of protein blocks
Get Blocks -- retrieve blocks
Block Maker -- create new blocks
48. Protein Sequence Motifs:Pfam a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
For each family in Pfam you can:
Look at multiple alignments
View protein domain architectures
Examine species distribution
Follow links to other databases
View known protein structures
49. Protein Sequence Motifs:PROSITE database of protein families and domains. It consists of biologically characterized sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs
currently contains patterns and profiles specific for more than a thousand protein families or domains.
each of these signatures comes with documentation providing background information on the structure and function of these proteins
50. Protein Sequence Motifs: Other Examples ASC Active Sequence Collection Biologically active oligopeptides
ClusTr Automatic classification of SWISS-PROT and TrEMBL proteins
TMPDB Experimentally-characterized transmembrane topology
O-GLYCBASE O- and C- linked glycosylation sites in proteins
51. RNA Sequences Repository of RNA sequences
RNA structure data
RNA metabolism information
Specialized site by organism, function, etc
52. RNA Sequences:HyPaLib contains annotated structural elements characteristic for certain classes of structural and/or functional RNAs
developing software tools that allow a user to search sequence databases for any pattern in HyPaLib
53. RNA Sequences:Rfam a collection of multiple sequence alignments and covariance models representing non-coding RNA families
allow the user to search a query sequence against a library of covariance models, and view multiple sequence alignments and family annotation
54. RNA Sequences:tRNA sequences compilation of tRNA Sequences and Sequences of tRNA genes
55. RNA Sequences:Other Examples 16S and 23S Ribosomal RNA Mutation Database
ACTIVITY functional DNA/RNA site activity
PLANTncRNAs Plant non-coding RNAs
RNA Modification Database Naturally modified nucleosides in RNA
56. Structure Information on protein structure derived from physical data crystallography, NMR
Classification of proteins according to tertiary structures
Specialized site for specific proteins
57. Structure:ASTRAL provides databases and tools useful for analyzing protein structures and their sequences
Partially derived from the SCOP database (Structural Classification of Proteins)
58. Structure:SCOP Comprehensive ordering of proteins to know structures based on their evolutionary and structural relationships
Protein domains are grouped into species and hierarchically classified in families superfamilies, folds, and classes
59. Structure:PDB Structure data determined by X-ray crystallography and NMR
60. Structure: Other Examples CADB conformation angles of protein structures, with associated crystallographic data
Database of Macromolecular Movements
DSDBase Disulfide Bonds in proteins
PSSH alignment between sequences and tertiary structures
SUPERFAMILY Assignments of proteins to structural superfamilies
61. Other Databases Intermolecular Interactions
Metabolic Pathways and Cellular Regulation
Pathology
Proteome Resources
Retrieval Systems and Database Structure
Transgenics
Varied Medical Content
62. Other Databases: Intermolecular Interactions BIND Molecular interactions, complexes and pathways
DIP (Database of Interacting Proteins) Experimentally determined protein-protein interactions
KDBI Kinetic data on biomolecular interactions
63. Other Databases: Metabolic Pathways and Cellular Regulation KEGG Kyoto Encyclopedia of Genes and Genomes
MetaCyc Metabolic Pathways and Enzymes from Various organisms
PathDB
EcoCyc E. coli K-12 genome and pathway data
PRODORIC gene regulation and regulatory networks in prokaryotes
64. Other Databases:Pathology BayGenomics cardiovascular and pulmonary disease
INFEVERS hereditary inflammatory disorder
GOLD.db lipid-associated disorders
Mouse Tumor Biology Database
65. Other Databases: Proteome Resources GELBANK 2D gel data repository
REBASE Restriction enzymes and associated methylases
SWISS-2DPAGE Annotated two-dimensional gel electrophoresis database
66. Other Databases: Retrieval Systems and Database Structure TESS Transcription Element search system
Virgil Database interconnectivity
67. Other Databases:Transgenics Cre Transgenic database Cre transgenic mouslines
Transgenic/targeted mutation database information on transgenic animals and targeted mutations
68. Other Databases: Varied Medical Content Tree of Life phylogeny and biodiversity
PubMed biomedical literature
NCBI Taxonomy Browser organisms with at least one sequence deposited in the database
Pharmgkb Pharmacogenomics and variations in drug response based on human variation
69. The Rat Genome Database Data
Tools
Operations
70. The Rat Genome Database: data Genes
Maps and Markers
QTLs
Strains
Homologs
71. The Rat Genome Database: tools VCMap
Mapserver
Meta Gene
Genome Scanner
Ontology Browser
72. The Rat Genome Database: operations Curation
Data QC and Loading
Data development
Tool development
73. The Rat Genome Database Operations: Curation Information gathering from peer-reviewed work
Coordination with other model organism data bases
Data quality policy development and assessment
74. The Rat Genome Database Operations: data development Development of data integration strategies
Development of ontology annotation protocols
Some development of curation policies
Outreach
Ontology development
75. The Rat Genome Database Operations: tool development Ontology system development
Systems analysis
Tool integration
Tool building
Software system migration