Finding What you Need in Biological Databases Cédric Notredame
Databases: Where is my Needle ?
Our Scope Give you means to answer simple questions Databases are UNFRIENDLY INFORMATION DESKS Give you an idea of what is possible WHAT can you ask ? HOW can you ask it ?
Outline - An Overall view - Asking a biological question to a database - Turning a question into a query - Bibliographic Databases: Medline, OMIM - Gene Databases: GenBank, LocusLink, ENSEMBL - Protein Databases: SwissProt, InterPro, Prodom - SRS
Database: What is a Database ?
DataBase Entries 1 entry = 1 Sequence AGCTGTCGAGGGATAGGACA TATACATAAATTAATATAAT SEQ 1 entry = 1 File = Sequence +Doc DOC = Flat File Database = Collection of Flat Files SEQ SEQ SEQ SEQ SEQ SEQ SEQ DOC DOC DOC DOC DOC DOC DOC
DataBase Entries: Flat Files Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2002 http://www.expasy.org/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000, sept 2001;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000; sept 2001; DEA=oct-nov-dec 2000; http://www.expasy.org/people/Marie-Claude.Blatter-Garin.html //
DataBase: Relational Databases Relational database (« table file »):
To Summarize: What’s a database ? • Collection of Data that is: • Structured Data • Searchable (index) -> table of contents • Updated periodically (release) -> new edition • Cross-referenced (hyperlinks) -> links with other db • Collection of tools (software) necessary for:Searching –Updating -Releasing • Data storage managment: flat files, relational databases…
Database: What’s on the Menu?
A large amount of information • More than 1000 different databases • Generally accessible through the web • EBI: http://www.ebi.ac.uk/ • NCBI: http://www.ncbi.nlm.nih.org • Google: http://www.google.com • Variable size: <100Kb to >10Gb • DNA: > 10 Gb • Protein: 1 Gb • 3D structure: 5 Gb • Other: smaller • Update frequency: daily to annually
A Non Exhaustive List AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,GCRDB, GDB, GENATLAS, Genbank, GeneCards,Genline, GenLink, GENOTK, GenProtEC, GIFTS,GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,YPM, etc .................. !!!! There Exists A Specialized Database on Almost anything you can think of
What’s on the Menu:The Art of Eating Well Always Use Fresh Data: The Latest Update of your DataBase Make Sure The DataBase is Maintained: Many Databases are poorly maintained Treat DataBases like Publications: Some Journals are Better than Others
Bio-Google: How Can I Search a Database ?
Searching Databases SEQ DOC Similarity Searches: BLAST AGCTGTCGAGGGATAGGACA TATACATAAATTAATATAAT Text based queries: Medline, Entrez Search For « Smith AND dUTPase> There are 2 ways to search databases
Searching Databases Each database is a little kingdom… • Has its own query system • Has its own information structure • The main databases are well documentedand this documentation is available online • Most databases can be searched using SRSor Entrez
Databases: Asking the right Question When you search a Database you must have an idea of what your Needle-in-a-hay-stack looks like Databases ARE NOT meant for browsing
Databases: Asking the right Question Browsing a database is like Using yourphone book in place of a dating agency…
Databases: Asking the right Question Finding Data: Database Search Data Mining Finding Questions:
The Kind Of Questions We Can Ask: SEQUENCE Based InterPro SwissProt Any Known Domain in my Protein ??? Any Protein like mine ??? These ARE Predictions
The Kind Of Questions We Can Ask: TEXT Based Medline SwissProt PDB Who Worked on my Protein ??? Function of My Protein ??? Structure of My Protein ??? These are NOT Predictions
Just like When You Google up Specific Queries give Precise Answers
Medline: Who worked on my Protein ?
What is in Medline ? • MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences • more than 4,000 biomedical journals and More than 10 million citations since 1966 until now • Contains links to biological db and to some journals • Many papers not dealing with human are not in Medline • Before 1970, keeps only the first 10 authors !
Using Medline: Asking a question During the last Lab Meeting, I heard the word dUTPase. What can it be ? What has been published on this ?
Using Medline: Asking a question By Default, Medline Assumes you mean: Abergel AND dUTPase
Using Medline: Asking a question Save Your Data in the Proper DataBase format I have found the reference I wanted. Now I want to save it so that I can use it later, For instance to Import it in ENDnote my Reference Manager
[AB] [AD] Restricted fields Retrieving EXACTLY the Information that you need
Using Medline: Looking for a Review I Want to Find the LATEST REVIEW on the dUTPase. Use The Limit Option of Medline
1-Limits Title OR Abstract Language Article type Using Medline: Looking For a Review
Using Medline: A Few Tips • Quoted queries (e.g. «down syndrome» ) behave as a single word, and are great to improve the relevance of your search • Adding initials to names (e.g. “Abergel C” ) (if you can) also reduces your output • Write down the PubMed Identifier (the number in the PMID field) of that interesting paper you just find. It could be very useful in your subsequent search for related items such as associated gene and protein sequences
Using Medline: A Few Tips • Spelling mistakes, wrong field restrictions or Limits setting can occur. These may be the problem. • Use abstracts to enlarge your vocabulary and look for synonyms: some papers on dUTPase might use dUTP pyrophosphatase instead! • The “related papers” button (on the extreme right of the PubMed output). Try it from time to time, to enlarge a search that is not giving you enough references
Using Medline: A Few Tips • Storing your PDFs, • Memory is cheap, access is sometimes strange… • Storing your favourite PDF is a good idea • Which name on your disk? • THE MEDLINE ID NUMBER !!! • With a reference manager like EndNote
GenBank: What is the Sequence of my Gene ?
GenBank: an Overview EMBL, GenBank and DDBJ are the same database. They are synchronized every day. GenBank EMBL DDBJ
GenBank contains EVERY piece of DNA that has been sequenced and made publicly available. It contains GOOD and BAD data There is a Historical Aspect in the GenBank data: -Complex Genes are spread in many entries: GenBank: an Overview
GenBank Entries Are Complex because Genes are complex Prokaryotic Example Gene RBS Promoter ATG STOP mRNA ORF Protein
GenBank Entries Are Complex because Genes are complex Gene Protein (form1) mRNA (form1) Promoter exon exon exon exon exon exon mRNA (form2) Protein (form2)