1 / 70

MBG305 Applied Bioinformatics

MBG305 Applied Bioinformatics. Week 2 (05.10.2010) Jens Allmer. Databases. Bioinformatics needs data Where is this data? Is there any organization? How should I cite data?. Where is the data?. Many targeted resources exist miRBase http://www.mirbase.org/ Contains microRNAs

karah
Download Presentation

MBG305 Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MBG305Applied Bioinformatics Week 2 (05.10.2010) Jens Allmer

  2. Databases • Bioinformatics needs data • Where is this data? • Is there any organization? • How should I cite data?

  3. Where is the data? • Many targeted resources exist • miRBase http://www.mirbase.org/ • Contains microRNAs • PDB http://www.rcsb.org/pdb/home/home.do • Contains protein structures • PeptideAtlas http://www.peptideatlas.org/ • Contains mass spectrometric measurements • KEGG http://www.genome.jp/kegg/ • Contains regulatory and biochemical pathways • PubMed http://www.ncbi.nlm.nih.gov/pubmed/ • Contains indexed journals • ...

  4. Where is the data? • Sequence Databases • EBI (www.ebi.ac.uk/) • Ensembl (www.ensembl.org) • GenBank (www.ncbi.nlm.nih.gov/Genbank) • SwissProt (www.tigr.org/tdb) • ... • Make these pages bookmarks • Are your bookmarks where you are? • Try: http://www.delicious.com • Or bring your own browser • http://portableapps.com/apps/internet/google_chrome_portable

  5. How is Data Organized? • Flat Text Files • FASTA Format • Structured Text Files • XML based Formats (e.g.: ASN.1) • Databases • Structure • Index • Users • Details in MBG403

  6. Flat Text Files • FASTA Format (Pearson and Lipman, 1988) • Allows multiple sequences per file • Requires identifiers for each sequence • Some special characters and formatting rules • > introduces the definition line (sequence identifier) • 80 characters per sequence line • Only supported characters (IUPAC) • http://www.bioinformatics.org/sms/iupac.html • Example >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC ... >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC ...

  7. FASTA Tools • FASTA Viewer and DNA Translator • http://www.biolnk.com/ • Some FASTA Tools • http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaTools • FASTA Validator/ Converter to CSV file • http://mbg305.allmer.de/tools/

  8. FASTA Usage • Most programs that accept sequence input accept FASTA format • BLAST (partially) • FastA (obviously) • Multiple Sequence Alignment Tools • Most • MS-based Database Search Engines • Some (only database, not queries) • Most Online Forms

  9. FASTA Definition Line Formats • http://en.wikipedia.org/wiki/Fasta_format • GenBank gi|gi-number|gb|accession|locus • EMBL Data Library gi|gi-number|emb|accession|locus • DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus • NBRF PIR pir||entry Protein Research Foundation prf||name • SWISS-PROT sp|accession|name • Brookhaven Protein Data Bank (1) pdb|entry|chain • Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE • Patents pat|country|number GenInfo • Backbone Id bbs|number • General database identifier gnl|database|identifier • NCBI Reference Sequence ref|accession|locus • Local Sequence identifier lcl|identifier

  10. GenBank Flat Text File • GenBank • Sample record and explanation: • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord • FAQs • http://www.ncbi.nlm.nih.gov/books/NBK49541/#NucProtFAQ.Section_A_GenBank_nucleotide

  11. Structured Text Files • Different ways to structure text files • ASN.1 • XML • JSON • Wait for MBG403 for details

  12. Structured Text Files • ASN.1 Example • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622.1?report=asn1&log$=seqview • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622 • Select Display Settings ASN.1

  13. Databases • Unlike the previous formats not easily readable • Special tools and languages are used to add, edit, retrieve, and view data • Advantages • Secure • Stable • Distributed • Fast Access • Huge sizes supported • http://www.freerepublic.com/focus/f-chat/2508670/posts • Ever tried to search in 100 TB of text for something?

  14. Scientific Data Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

  15. Characteristics of Scientific Data • Highly Complex • Images, sequences, time series, ... • Strong interdependence of data • In Science • Outliers are of interest • Focus of interest changes rapidly • Data is usually shared • Data must be secure • Never change data only add • Many viewers few creators • Collections • Large collections must be shared via strong servers • Small collections (e.g. SwissProt 63MB) can be shared more easily • New methodologies (MS, NGS, ...) have expanded size of databases

  16. Desired Features for Databases • Efficiency • Scalability • Concurrency • Security • Integrity • Stability • Cross references to other databases • Universally accessible • Query Language • Data mining • Data Warehouse

  17. How Many Bioinformatics Databases? Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

  18. An Abundance of Databases • Databases and Collections on http://www.hsls.pitt.edu/obrc/ • DNA Sequence Databases and Analysis Tools (499) • Enzymes and Pathways (281) • Gene Mutations, Genetic Variations and Diseases (303) • Genomics Databases and Analysis Tools (703) • Immunological Databases and Tools (61) • Microarray, SAGE, and other Gene Expression (215) • Organelle Databases (29) • Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179) • Plant Databases (159) • Protein Sequence Databases and Analysis Tools (492) • Proteomics Resources (74) • RNA Databases and Analysis Tools (257) • Structure Databases and Analysis Tools (452) • Sum: 3704

  19. Data Warehouses • Are resources like NCBI and EBI databases? • No they are larger than what is generally called a database • They can be called data warehouses • They consist of many interlinked databases

  20. Need for Improvement • Anyone can submit data to online resources • Rigorous data checking is necessary • Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215) • Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038) • Data must be standardized • Quality of data must be specified

  21. How to Cite Data • It is rarely necessary to present a sequence in any writing • In general it suffices to give • Accession number of sequence • Database where sequence is located • If database is not given try • Accession Parser (www.biolnk.com) • In case you have a new sequence • Generally required to deposit it in a database • E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/ • Then cite the assigned accession number(s)

  22. End of Theoretical Part 1 • Mind mapping • 10 min break

  23. Practical Part 1

  24. Where is the data? • Turn on your computers and let’s find out • EBI (www.ebi.ac.uk/) • Ensembl (www.ensembl.org) • GenBank (www.ncbi.nlm.nih.gov/Genbank) • SwissProt (www.tigr.org/tdb) • Make these pages bookmarks • Are your bookmarks where you are? • Try: http://www.delicious.com

  25. Retrieve Data • You want the DNA sequence of some human Hemoglobine • How do you get it? • Try to achive this goal for a few minutes

  26. Ilginç

  27. Ctrl-F

  28. No results

  29. Where have we gone wrong? Language! Database!

  30. GenBank

  31. GenBank • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

  32. GenBank • Accession number • Applies to full record • X00000 • XX000000 • Never changes

  33. GenBank • Version • Identifies a single sequence • Adds version to accession number format • X00000.0 • Version ie .0 -> .1 changes if even a single nucleotide in the sequences is changed • Other versions are referenced • http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

  34. GenBank • GeneInfo identifier (GI) • Any change to the sequences forces a new gi number • Translations get separate gi numbers • GI:00000

  35. GenBank

  36. GenBank • Sequence?

  37. GenBank • Eukaryotic

  38. Retrieving Sequences By Example • Basic Local Alignment Search Tool • BLAST

  39. http://www.ebi.ac.uk/

  40. What did we do? • We wanted to find one of the human hemoglobins • The nucleotide sequence in FASTA format • We wanted to find similar sequences • BLAST (ncbi) • FASTA (ebi) • Who got lost in the jungle of LINKS? • That is normal • Bioinformatics is a quickly growing field • Consolidation not any time soon

  41. End of Practical Part 1 • 15 min break

  42. Theoretical Part 2 • And now for something completely different • http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different • How can we find sequences? • Can the algorithm we found last week be used?

  43. Similarity Searching • Search Algorithms • BLAST • FASTA • ... • This is at the heart of bioinformatics • It demands a lot of attention

More Related