Mbg305 applied bioinformatics
1 / 70

MBG305 Applied Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

MBG305 Applied Bioinformatics. Week 2 (05.10.2010) Jens Allmer. Databases. Bioinformatics needs data Where is this data? Is there any organization? How should I cite data?. Where is the data?. Many targeted resources exist miRBase http://www.mirbase.org/ Contains microRNAs

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' MBG305 Applied Bioinformatics' - karah

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mbg305 applied bioinformatics

MBG305Applied Bioinformatics

Week 2 (05.10.2010)

Jens Allmer


  • Bioinformatics needs data

    • Where is this data?

    • Is there any organization?

    • How should I cite data?

Where is the data
Where is the data?

  • Many targeted resources exist

    • miRBase http://www.mirbase.org/

      • Contains microRNAs

    • PDB http://www.rcsb.org/pdb/home/home.do

      • Contains protein structures

    • PeptideAtlas http://www.peptideatlas.org/

      • Contains mass spectrometric measurements

    • KEGG http://www.genome.jp/kegg/

      • Contains regulatory and biochemical pathways

    • PubMed http://www.ncbi.nlm.nih.gov/pubmed/

      • Contains indexed journals

    • ...

Where is the data1
Where is the data?

  • Sequence Databases

    • EBI (www.ebi.ac.uk/)

    • Ensembl (www.ensembl.org)

    • GenBank (www.ncbi.nlm.nih.gov/Genbank)

    • SwissProt (www.tigr.org/tdb)

    • ...

  • Make these pages bookmarks

    • Are your bookmarks where you are?

      • Try: http://www.delicious.com

    • Or bring your own browser

      • http://portableapps.com/apps/internet/google_chrome_portable

How is data organized
How is Data Organized?

  • Flat Text Files

    • FASTA Format

  • Structured Text Files

    • XML based Formats (e.g.: ASN.1)

  • Databases

    • Structure

    • Index

    • Users

    • Details in MBG403

Flat text files
Flat Text Files

  • FASTA Format (Pearson and Lipman, 1988)

    • Allows multiple sequences per file

    • Requires identifiers for each sequence

    • Some special characters and formatting rules

      • > introduces the definition line (sequence identifier)

      • 80 characters per sequence line

      • Only supported characters (IUPAC)

        • http://www.bioinformatics.org/sms/iupac.html

  • Example

    >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence




    >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence




Fasta tools

  • FASTA Viewer and DNA Translator

    • http://www.biolnk.com/

  • Some FASTA Tools

    • http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaTools

  • FASTA Validator/ Converter to CSV file

    • http://mbg305.allmer.de/tools/

Fasta usage

  • Most programs that accept sequence input accept FASTA format

    • BLAST (partially)

    • FastA (obviously)

    • Multiple Sequence Alignment Tools

      • Most

    • MS-based Database Search Engines

      • Some (only database, not queries)

    • Most Online Forms

Fasta definition line formats
FASTA Definition Line Formats

  • http://en.wikipedia.org/wiki/Fasta_format

    • GenBank gi|gi-number|gb|accession|locus

    • EMBL Data Library gi|gi-number|emb|accession|locus

    • DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus

    • NBRF PIR pir||entry Protein Research Foundation prf||name

    • SWISS-PROT sp|accession|name

    • Brookhaven Protein Data Bank (1) pdb|entry|chain

    • Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE

    • Patents pat|country|number GenInfo

    • Backbone Id bbs|number

    • General database identifier gnl|database|identifier

    • NCBI Reference Sequence ref|accession|locus

    • Local Sequence identifier lcl|identifier

Genbank flat text file
GenBank Flat Text File

  • GenBank

    • Sample record and explanation:

      • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord

    • FAQs

      • http://www.ncbi.nlm.nih.gov/books/NBK49541/#NucProtFAQ.Section_A_GenBank_nucleotide

Structured text files
Structured Text Files

  • Different ways to structure text files

    • ASN.1

    • XML

    • JSON

    • Wait for MBG403 for details

Structured text files1
Structured Text Files

  • ASN.1 Example

    • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622.1?report=asn1&log$=seqview

    • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622

      • Select Display Settings ASN.1


  • Unlike the previous formats not easily readable

    • Special tools and languages are used to add, edit, retrieve, and view data

  • Advantages

    • Secure

    • Stable

    • Distributed

    • Fast Access

    • Huge sizes supported

      • http://www.freerepublic.com/focus/f-chat/2508670/posts

      • Ever tried to search in 100 TB of text for something?

Scientific data
Scientific Data

Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

Characteristics of scientific data
Characteristics of Scientific Data

  • Highly Complex

    • Images, sequences, time series, ...

    • Strong interdependence of data

  • In Science

    • Outliers are of interest

    • Focus of interest changes rapidly

    • Data is usually shared

    • Data must be secure

      • Never change data only add

      • Many viewers few creators

  • Collections

    • Large collections must be shared via strong servers

    • Small collections (e.g. SwissProt 63MB) can be shared more easily

    • New methodologies (MS, NGS, ...) have expanded size of databases

Desired features for databases
Desired Features for Databases

  • Efficiency

  • Scalability

  • Concurrency

  • Security

  • Integrity

  • Stability

  • Cross references to other databases

  • Universally accessible

  • Query Language

  • Data mining

  • Data Warehouse

How many bioinformatics databases
How Many Bioinformatics Databases?

Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf

An abundance of databases
An Abundance of Databases

  • Databases and Collections on http://www.hsls.pitt.edu/obrc/

    • DNA Sequence Databases and Analysis Tools (499)

    • Enzymes and Pathways (281)

    • Gene Mutations, Genetic Variations and Diseases (303)

    • Genomics Databases and Analysis Tools (703)

    • Immunological Databases and Tools (61)

    • Microarray, SAGE, and other Gene Expression (215)

    • Organelle Databases (29)

    • Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179)

    • Plant Databases (159)

    • Protein Sequence Databases and Analysis Tools (492)

    • Proteomics Resources (74)

    • RNA Databases and Analysis Tools (257)

    • Structure Databases and Analysis Tools (452)

  • Sum: 3704

Data warehouses
Data Warehouses

  • Are resources like NCBI and EBI databases?

    • No they are larger than what is generally called a database

    • They can be called data warehouses

    • They consist of many interlinked databases

Need for improvement
Need for Improvement

  • Anyone can submit data to online resources

  • Rigorous data checking is necessary

    • Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215)

    • Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038)

  • Data must be standardized

  • Quality of data must be specified

How to cite data
How to Cite Data

  • It is rarely necessary to present a sequence in any writing

  • In general it suffices to give

    • Accession number of sequence

    • Database where sequence is located

      • If database is not given try

        • Accession Parser (www.biolnk.com)

  • In case you have a new sequence

    • Generally required to deposit it in a database

    • E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/

    • Then cite the assigned accession number(s)

End of theoretical part 1
End of Theoretical Part 1

  • Mind mapping

  • 10 min break

Where is the data2
Where is the data?

  • Turn on your computers and let’s find out

  • EBI (www.ebi.ac.uk/)

  • Ensembl (www.ensembl.org)

  • GenBank (www.ncbi.nlm.nih.gov/Genbank)

  • SwissProt (www.tigr.org/tdb)

  • Make these pages bookmarks

    • Are your bookmarks where you are?

    • Try: http://www.delicious.com

Retrieve data
Retrieve Data

  • You want the DNA sequence of some human Hemoglobine

  • How do you get it?

  • Try to achive this goal for a few minutes

Where have we gone wrong
Where have we gone wrong?




  • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html


  • Accession number

    • Applies to full record

    • X00000

    • XX000000

    • Never changes


  • Version

    • Identifies a single sequence

    • Adds version to accession number format

      • X00000.0

    • Version ie .0 -> .1 changes if even a single nucleotide in the sequences is changed

    • Other versions are referenced

  • http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi


  • GeneInfo identifier (GI)

    • Any change to the sequences forces a new gi number

    • Translations get separate gi numbers

    • GI:00000

Gen b ank


  • Sequence?


  • Eukaryotic

Retrieving sequences by example
Retrieving Sequences By Example

  • Basic Local Alignment Search Tool


What did we do
What did we do?

  • We wanted to find one of the human hemoglobins

    • The nucleotide sequence in FASTA format

  • We wanted to find similar sequences

    • BLAST (ncbi)

    • FASTA (ebi)

  • Who got lost in the jungle of LINKS?

    • That is normal

    • Bioinformatics is a quickly growing field

    • Consolidation not any time soon

Theoretical part 2
Theoretical Part 2

  • And now for something completely different

    • http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different

  • How can we find sequences?

  • Can the algorithm we found last week be used?

Similarity searching
Similarity Searching

  • Search Algorithms

    • BLAST

    • FASTA

    • ...

  • This is at the heart of bioinformatics

  • It demands a lot of attention

Similarity searching1
Similarity Searching

  • Exact pattern matching

  • Approximate pattern matching

String matching math
String Matching Math

  • Remember the string matching we did last week?

  • Today we will look at the math of finding EXACT matches between queries and databases

  • If time allows we will look into substitution matrices

Probability for perfect matches
Probability for perfect matches

Query (Q): ATTGCC




LQ= length of query (number of nucleotides)

LT = length of sequence (number of nucleotides)

Element probability
Element Probability

Probability of finding a nucleotide

Very roughly 0.25

Given the sequence:


What are the probabilities for A, C, G, and T?

Sequence probability
Sequence Probability


What is p?

p = the probability of randomly generating the sequence

given the frequency and number of its elements (e.g.: PA).

There is no sequential dependency assumed in this model.

What is the probability of generating

AAAAGTTT given the probabilities that we just calculated?

p = 0.244 * 0.26 * 0.323

= 0.003 * 0.260 * 0.033

= 0.000026

How often do we expect to find the query
How Often do we Expect to Find the Query

  • The number of matches is restricted by the database size

  • How often can we shift Q (Query) against T (Target)?

  • This defines the number of possible matching operations

n = LT – LQ +1


LQ = 6

LT = 10

n = 10 – 6 + 1

n = 5



n = 20

p = 0.1

p = 0.5

p = 0.8

The probability distribution of

the number of matches

is approximately binomial:

Definition: q = 1 - p

p(x) = (n! / x!(n – x)!) px qn-x

What is p?

What is n?

What is q?


P: probability for being true

Q: probability for being false

N: number of trials

X: number of successes


  • Factorial leads to overflows in computer programming

  • With n*p < 1 and large n

  • The distribution can be approximated by a Poisson distribution

    • Much easier to calculate for a computer

Poisson vs binomia l distributions
Poissonvs. Binomial Distributions


p(x) = e-λ(λx / x!)

λ: n*p


p(x) = (n! / x!(n – x)!) px qn-x

Partial matches
Partial matches

  • So far we considered matching the complete query

  • Partial match:

L ( L<= LQ ^ L <= LT)

p = 2-2L

m = LQ - L -1

n = LD - L -1

E = m n 2-2L

Blast e value

  • E = mn2-S

    E = mn2-2L

  • Describes the number of expected matches which are equally good or better

End of theoretical part 2
End of Theoretical Part 2

  • Mind mapping

  • 10 min break

Practice poisson vs binomial
Practice Poisson vs Binomial



Calculate p(0), p(1) and p(3)

Note: at least one match = 1 – p(0)

E = m n 2-2L

Assuming a database size of 10 000 000

and a query length of 10 calculate the number

of matches that would happen by chance?

Practical concerns
Practical Concerns

  • Human genome 3 billion nucleotides

  • Dogma: 14 nucleotides are enough to uniquely identify a gene

  • Verify this using Poisson distribution


p(x) = e-λ(λx / x!)

λ: n*p

Blast interface
BLAST Interface

  • Setting a cutoff E-value

    • Consider the calculation you just did

    • If someone was to set the cutoff to 0.01 with the same assumptions

      • How many results would you expect?

      • What would you advise the user?

  • Topic will be revisited later

Amino acid sequences
Amino Acid Sequences

  • What changes when instead of nucleotide sequences we were to use amino acid sequences?

Practise this
Practise this

  • Determine how long a query must be that it can uniquely identify a gene in the human genome

    • p < 0.05

  • Make a table showing the E value against LQ(10..100) with LD = 3 000 000 000

  • Use Excel to do this


  • Go to GenBank and inspect all parameters

    • Find their meaning (even if you think you know what it means)

    • Sometimes definitions are surprising

  • Collect information about parameters that pose problems to you

    • Submit this information to us so that we can discuss in the following week