MBG305 Applied Bioinformatics

MBG305Applied Bioinformatics Week 2 (05.10.2010) Jens Allmer

Databases • Bioinformatics needs data • Where is this data? • Is there any organization? • How should I cite data?

Where is the data? • Many targeted resources exist • miRBase http://www.mirbase.org/ • Contains microRNAs • PDB http://www.rcsb.org/pdb/home/home.do • Contains protein structures • PeptideAtlas http://www.peptideatlas.org/ • Contains mass spectrometric measurements • KEGG http://www.genome.jp/kegg/ • Contains regulatory and biochemical pathways • PubMed http://www.ncbi.nlm.nih.gov/pubmed/ • Contains indexed journals • ...

Where is the data? • Sequence Databases • EBI (www.ebi.ac.uk/) • Ensembl (www.ensembl.org) • GenBank (www.ncbi.nlm.nih.gov/Genbank) • SwissProt (www.tigr.org/tdb) • ... • Make these pages bookmarks • Are your bookmarks where you are? • Try: http://www.delicious.com • Or bring your own browser • http://portableapps.com/apps/internet/google_chrome_portable

How is Data Organized? • Flat Text Files • FASTA Format • Structured Text Files • XML based Formats (e.g.: ASN.1) • Databases • Structure • Index • Users • Details in MBG403

Flat Text Files • FASTA Format (Pearson and Lipman, 1988) • Allows multiple sequences per file • Requires identifiers for each sequence • Some special characters and formatting rules • > introduces the definition line (sequence identifier) • 80 characters per sequence line • Only supported characters (IUPAC) • http://www.bioinformatics.org/sms/iupac.html • Example >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC ... >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC ...

FASTA Tools • FASTA Viewer and DNA Translator • http://www.biolnk.com/ • Some FASTA Tools • http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaTools • FASTA Validator/ Converter to CSV file • http://mbg305.allmer.de/tools/

FASTA Usage • Most programs that accept sequence input accept FASTA format • BLAST (partially) • FastA (obviously) • Multiple Sequence Alignment Tools • Most • MS-based Database Search Engines • Some (only database, not queries) • Most Online Forms

FASTA Definition Line Formats • http://en.wikipedia.org/wiki/Fasta_format • GenBank gi|gi-number|gb|accession|locus • EMBL Data Library gi|gi-number|emb|accession|locus • DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus • NBRF PIR pir||entry Protein Research Foundation prf||name • SWISS-PROT sp|accession|name • Brookhaven Protein Data Bank (1) pdb|entry|chain • Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE • Patents pat|country|number GenInfo • Backbone Id bbs|number • General database identifier gnl|database|identifier • NCBI Reference Sequence ref|accession|locus • Local Sequence identifier lcl|identifier

GenBank Flat Text File • GenBank • Sample record and explanation: • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord • FAQs • http://www.ncbi.nlm.nih.gov/books/NBK49541/#NucProtFAQ.Section_A_GenBank_nucleotide

Structured Text Files • Different ways to structure text files • ASN.1 • XML • JSON • Wait for MBG403 for details

Structured Text Files • ASN.1 Example • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622.1?report=asn1&log$=seqview • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622 • Select Display Settings ASN.1

Databases • Unlike the previous formats not easily readable • Special tools and languages are used to add, edit, retrieve, and view data • Advantages • Secure • Stable • Distributed • Fast Access • Huge sizes supported • http://www.freerepublic.com/focus/f-chat/2508670/posts • Ever tried to search in 100 TB of text for something?

Scientific Data Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

Characteristics of Scientific Data • Highly Complex • Images, sequences, time series, ... • Strong interdependence of data • In Science • Outliers are of interest • Focus of interest changes rapidly • Data is usually shared • Data must be secure • Never change data only add • Many viewers few creators • Collections • Large collections must be shared via strong servers • Small collections (e.g. SwissProt 63MB) can be shared more easily • New methodologies (MS, NGS, ...) have expanded size of databases

Desired Features for Databases • Efficiency • Scalability • Concurrency • Security • Integrity • Stability • Cross references to other databases • Universally accessible • Query Language • Data mining • Data Warehouse

How Many Bioinformatics Databases? Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

An Abundance of Databases • Databases and Collections on http://www.hsls.pitt.edu/obrc/ • DNA Sequence Databases and Analysis Tools (499) • Enzymes and Pathways (281) • Gene Mutations, Genetic Variations and Diseases (303) • Genomics Databases and Analysis Tools (703) • Immunological Databases and Tools (61) • Microarray, SAGE, and other Gene Expression (215) • Organelle Databases (29) • Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179) • Plant Databases (159) • Protein Sequence Databases and Analysis Tools (492) • Proteomics Resources (74) • RNA Databases and Analysis Tools (257) • Structure Databases and Analysis Tools (452) • Sum: 3704

Data Warehouses • Are resources like NCBI and EBI databases? • No they are larger than what is generally called a database • They can be called data warehouses • They consist of many interlinked databases

Need for Improvement • Anyone can submit data to online resources • Rigorous data checking is necessary • Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215) • Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038) • Data must be standardized • Quality of data must be specified

How to Cite Data • It is rarely necessary to present a sequence in any writing • In general it suffices to give • Accession number of sequence • Database where sequence is located • If database is not given try • Accession Parser (www.biolnk.com) • In case you have a new sequence • Generally required to deposit it in a database • E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/ • Then cite the assigned accession number(s)

End of Theoretical Part 1 • Mind mapping • 10 min break

Practical Part 1

Where is the data? • Turn on your computers and let’s find out • EBI (www.ebi.ac.uk/) • Ensembl (www.ensembl.org) • GenBank (www.ncbi.nlm.nih.gov/Genbank) • SwissProt (www.tigr.org/tdb) • Make these pages bookmarks • Are your bookmarks where you are? • Try: http://www.delicious.com

Retrieve Data • You want the DNA sequence of some human Hemoglobine • How do you get it? • Try to achive this goal for a few minutes

Ilginç

Ctrl-F

No results

Where have we gone wrong? Language! Database!

GenBank

GenBank • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

GenBank • Accession number • Applies to full record • X00000 • XX000000 • Never changes

GenBank • Version • Identifies a single sequence • Adds version to accession number format • X00000.0 • Version ie .0 -> .1 changes if even a single nucleotide in the sequences is changed • Other versions are referenced • http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

GenBank • GeneInfo identifier (GI) • Any change to the sequences forces a new gi number • Translations get separate gi numbers • GI:00000

GenBank

GenBank • Sequence?

GenBank • Eukaryotic

Retrieving Sequences By Example • Basic Local Alignment Search Tool • BLAST

http://www.ebi.ac.uk/

What did we do? • We wanted to find one of the human hemoglobins • The nucleotide sequence in FASTA format • We wanted to find similar sequences • BLAST (ncbi) • FASTA (ebi) • Who got lost in the jungle of LINKS? • That is normal • Bioinformatics is a quickly growing field • Consolidation not any time soon

End of Practical Part 1 • 15 min break

Theoretical Part 2 • And now for something completely different • http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different • How can we find sequences? • Can the algorithm we found last week be used?

Similarity Searching • Search Algorithms • BLAST • FASTA • ... • This is at the heart of bioinformatics • It demands a lot of attention

MBG305 Applied Bioinformatics