Mbg305 applied bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 70

MBG305 Applied Bioinformatics PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

MBG305 Applied Bioinformatics. Week 2 (05.10.2010) Jens Allmer. Databases. Bioinformatics needs data Where is this data? Is there any organization? How should I cite data?. Where is the data?. Many targeted resources exist miRBase http://www.mirbase.org/ Contains microRNAs

Download Presentation

MBG305 Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mbg305 applied bioinformatics

MBG305Applied Bioinformatics

Week 2 (05.10.2010)

Jens Allmer


Databases

Databases

  • Bioinformatics needs data

    • Where is this data?

    • Is there any organization?

    • How should I cite data?


Where is the data

Where is the data?

  • Many targeted resources exist

    • miRBase http://www.mirbase.org/

      • Contains microRNAs

    • PDB http://www.rcsb.org/pdb/home/home.do

      • Contains protein structures

    • PeptideAtlas http://www.peptideatlas.org/

      • Contains mass spectrometric measurements

    • KEGG http://www.genome.jp/kegg/

      • Contains regulatory and biochemical pathways

    • PubMed http://www.ncbi.nlm.nih.gov/pubmed/

      • Contains indexed journals

    • ...


Where is the data1

Where is the data?

  • Sequence Databases

    • EBI(www.ebi.ac.uk/)

    • Ensembl(www.ensembl.org)

    • GenBank(www.ncbi.nlm.nih.gov/Genbank)

    • SwissProt(www.tigr.org/tdb)

    • ...

  • Make these pages bookmarks

    • Are your bookmarks where you are?

      • Try: http://www.delicious.com

    • Or bring your own browser

      • http://portableapps.com/apps/internet/google_chrome_portable


How is data organized

How is Data Organized?

  • Flat Text Files

    • FASTA Format

  • Structured Text Files

    • XML based Formats (e.g.: ASN.1)

  • Databases

    • Structure

    • Index

    • Users

    • Details in MBG403


Flat text files

Flat Text Files

  • FASTA Format (Pearson and Lipman, 1988)

    • Allows multiple sequences per file

    • Requires identifiers for each sequence

    • Some special characters and formatting rules

      • > introduces the definition line (sequence identifier)

      • 80 characters per sequence line

      • Only supported characters (IUPAC)

        • http://www.bioinformatics.org/sms/iupac.html

  • Example

    >gi|189443480|gb|FG602538.1|FG602538 PF_T3_37R_G02_08AUG2003_004 Opium poppy root cDNA library Papaver somniferum cDNA, mRNA sequence

    GAACGAAGGGAGAGAACGAAAAAGAAGGAGAGAATGTGTGAGGGTCGGTTTCATACGTTTGGTGTTAACTGAGTTATGCA

    ATCTGCAAAAGAGGAGAGATTAGATAGAAGATGAGAAGAATTATGACAACCTAGTCAAGTATGGATCATTGCTCTAATTC

    ...

    >gi|189457344|gb|FG613049.1|FG613049 stem_S093_F08.SEQ Opium poppy stem cDNA library Papaver somniferum cDNA, mRNA sequence

    CTTTCTCTAGGTTTCTCCGCAATTTTCAAGTGGACGAATCCAAATAGAATTTGCCAAGCTTTTCTTGATTTATCCTACTC

    GGTGTAAAAATGGCGACAATAGGAGCTTCCTCAGCTTGCTGCATGATCAGAAGCACACCCCAGAACAGTGGTAAAATTGC

    ...


Fasta tools

FASTA Tools

  • FASTA Viewer and DNA Translator

    • http://www.biolnk.com/

  • Some FASTA Tools

    • http://bioinformatics.iyte.edu.tr/index.php?n=Softwares.FastaTools

  • FASTA Validator/ Converter to CSV file

    • http://mbg305.allmer.de/tools/


Fasta usage

FASTA Usage

  • Most programs that accept sequence input accept FASTA format

    • BLAST (partially)

    • FastA (obviously)

    • Multiple Sequence Alignment Tools

      • Most

    • MS-based Database Search Engines

      • Some (only database, not queries)

    • Most Online Forms


Fasta definition line formats

FASTA Definition Line Formats

  • http://en.wikipedia.org/wiki/Fasta_format

    • GenBank gi|gi-number|gb|accession|locus

    • EMBL Data Library gi|gi-number|emb|accession|locus

    • DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus

    • NBRF PIR pir||entry Protein Research Foundation prf||name

    • SWISS-PROT sp|accession|name

    • Brookhaven Protein Data Bank (1) pdb|entry|chain

    • Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE

    • Patents pat|country|number GenInfo

    • Backbone Id bbs|number

    • General database identifier gnl|database|identifier

    • NCBI Reference Sequence ref|accession|locus

    • Local Sequence identifier lcl|identifier


Genbank flat text file

GenBank Flat Text File

  • GenBank

    • Sample record and explanation:

      • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord

    • FAQs

      • http://www.ncbi.nlm.nih.gov/books/NBK49541/#NucProtFAQ.Section_A_GenBank_nucleotide


Structured text files

Structured Text Files

  • Different ways to structure text files

    • ASN.1

    • XML

    • JSON

    • Wait for MBG403 for details


Structured text files1

Structured Text Files

  • ASN.1 Example

    • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622.1?report=asn1&log$=seqview

    • http://www.ncbi.nlm.nih.gov/nuccore/NC_003622

      • Select Display Settings ASN.1


Databases1

Databases

  • Unlike the previous formats not easily readable

    • Special tools and languages are used to add, edit, retrieve, and view data

  • Advantages

    • Secure

    • Stable

    • Distributed

    • Fast Access

    • Huge sizes supported

      • http://www.freerepublic.com/focus/f-chat/2508670/posts

      • Ever tried to search in 100 TB of text for something?


Scientific data

Scientific Data

Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf


Characteristics of scientific data

Characteristics of Scientific Data

  • Highly Complex

    • Images, sequences, time series, ...

    • Strong interdependence of data

  • In Science

    • Outliers are of interest

    • Focus of interest changes rapidly

    • Data is usually shared

    • Data must be secure

      • Never change data only add

      • Many viewers few creators

  • Collections

    • Large collections must be shared via strong servers

    • Small collections (e.g. SwissProt 63MB) can be shared more easily

    • New methodologies (MS, NGS, ...) have expanded size of databases


Desired features for databases

Desired Features for Databases

  • Efficiency

  • Scalability

  • Concurrency

  • Security

  • Integrity

  • Stability

  • Cross references to other databases

  • Universally accessible

  • Query Language

  • Data mining

  • Data Warehouse


How many bioinformatics databases

How Many Bioinformatics Databases?

Source: http://www.bioinformatics.wsu.edu/bioinfo_course/notes/Lecture12.pdf


An abundance of databases

An Abundance of Databases

  • Databases and Collections on http://www.hsls.pitt.edu/obrc/

    • DNA Sequence Databases and Analysis Tools (499)

    • Enzymes and Pathways (281)

    • Gene Mutations, Genetic Variations and Diseases (303)

    • Genomics Databases and Analysis Tools (703)

    • Immunological Databases and Tools (61)

    • Microarray, SAGE, and other Gene Expression (215)

    • Organelle Databases (29)

    • Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and others) (179)

    • Plant Databases (159)

    • Protein Sequence Databases and Analysis Tools (492)

    • Proteomics Resources (74)

    • RNA Databases and Analysis Tools (257)

    • Structure Databases and Analysis Tools (452)

  • Sum: 3704


Data warehouses

Data Warehouses

  • Are resources like NCBI and EBI databases?

    • No they are larger than what is generally called a database

    • They can be called data warehouses

    • They consist of many interlinked databases


Need for improvement

Need for Improvement

  • Anyone can submit data to online resources

  • Rigorous data checking is necessary

    • Saçar and Allmer (http://journal.imbio.de/index.php?paper_id=215)

    • Bağcı and Allmer (http://dx.doi.org/10.1109/HIBIT.2012.6209038)

  • Data must be standardized

  • Quality of data must be specified


How to cite data

How to Cite Data

  • It is rarely necessary to present a sequence in any writing

  • In general it suffices to give

    • Accession number of sequence

    • Database where sequence is located

      • If database is not given try

        • Accession Parser (www.biolnk.com)

  • In case you have a new sequence

    • Generally required to deposit it in a database

    • E.g.: http://www.ncbi.nlm.nih.gov/genbank/submit/

    • Then cite the assigned accession number(s)


End of theoretical part 1

End of Theoretical Part 1

  • Mind mapping

  • 10 min break


Practical part 1

Practical Part 1


Where is the data2

Where is the data?

  • Turn on your computers and let’s find out

  • EBI(www.ebi.ac.uk/)

  • Ensembl(www.ensembl.org)

  • GenBank(www.ncbi.nlm.nih.gov/Genbank)

  • SwissProt(www.tigr.org/tdb)

  • Make these pages bookmarks

    • Are your bookmarks where you are?

    • Try: http://www.delicious.com


Retrieve data

Retrieve Data

  • You want the DNA sequence of some human Hemoglobine

  • How do you get it?

  • Try to achive this goal for a few minutes


Mbg305 applied bioinformatics

Ilginç


Mbg305 applied bioinformatics

Ctrl-F


Mbg305 applied bioinformatics

No results


Where have we gone wrong

Where have we gone wrong?

Language!

Database!


Genbank

GenBank


Genbank1

GenBank

  • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html


Genbank2

GenBank

  • Accession number

    • Applies to full record

    • X00000

    • XX000000

    • Never changes


Genbank3

GenBank

  • Version

    • Identifies a single sequence

    • Adds version to accession number format

      • X00000.0

    • Version ie .0 -> .1 changes if even a single nucleotide in the sequences is changed

    • Other versions are referenced

  • http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi


Genbank4

GenBank

  • GeneInfo identifier (GI)

    • Any change to the sequences forces a new gi number

    • Translations get separate gi numbers

    • GI:00000


Gen b ank

GenBank


Genbank5

GenBank

  • Sequence?


Genbank6

GenBank

  • Eukaryotic


Retrieving sequences by example

Retrieving Sequences By Example

  • Basic Local Alignment Search Tool

  • BLAST


Mbg305 applied bioinformatics

http://www.ebi.ac.uk/


What did we do

What did we do?

  • We wanted to find one of the human hemoglobins

    • The nucleotide sequence in FASTA format

  • We wanted to find similar sequences

    • BLAST (ncbi)

    • FASTA (ebi)

  • Who got lost in the jungle of LINKS?

    • That is normal

    • Bioinformatics is a quickly growing field

    • Consolidation not any time soon


End of practical part 1

End of Practical Part 1

  • 15 min break


Theoretical part 2

Theoretical Part 2

  • And now for something completely different

    • http://en.wikipedia.org/wiki/And_Now_for_Something_Completely_Different

  • How can we find sequences?

  • Can the algorithm we found last week be used?


Similarity searching

Similarity Searching

  • Search Algorithms

    • BLAST

    • FASTA

    • ...

  • This is at the heart of bioinformatics

  • It demands a lot of attention


Similarity searching1

Similarity Searching

  • Exact pattern matching

  • Approximate pattern matching


String matching math

String Matching Math

  • Remember the string matching we did last week?

  • Today we will look at the math of finding EXACT matches between queries and databases

  • If time allows we will look into substitution matrices


Probability for perfect matches

Probability for perfect matches

Query (Q): ATTGCC

Target (T): CGATTGCCCG

LQ

LT

LQ= length of query (number of nucleotides)

LT = length of sequence (number of nucleotides)


Element probability

Element Probability

Probability of finding a nucleotide

Very roughly 0.25

Given the sequence:

ATTTCCGGGGTAGCTAGCTAGTATATTATCGGCGCTAA

What are the probabilities for A, C, G, and T?


Sequence probability

Sequence Probability

p = PAPC2PGPT2

What is p?

p = the probability of randomly generating the sequence

given the frequency and number of its elements (e.g.: PA).

There is no sequential dependency assumed in this model.

What is the probability of generating

AAAAGTTT given the probabilities that we just calculated?

p = 0.244 * 0.26 * 0.323

= 0.003 * 0.260 * 0.033

= 0.000026


How often do we expect to find the query

How Often do we Expect to Find the Query

  • The number of matches is restricted by the database size

  • How often can we shift Q (Query) against T (Target)?

  • This defines the number of possible matching operations

n = LT – LQ +1

Example:

LQ = 6

LT = 10

n = 10 – 6 + 1

n = 5

Query: ATTGCC

Target: CGATTGCCCG


Mbg305 applied bioinformatics

n = 20

p = 0.1

p = 0.5

p = 0.8

The probability distribution of

the number of matches

is approximately binomial:

Definition: q = 1 - p

p(x) = (n! / x!(n – x)!) px qn-x

What is p?

What is n?

What is q?

http://en.wikipedia.org/wiki/Binomial_distribution

P: probability for being true

Q: probability for being false

N: number of trials

X: number of successes


Problem

Problem

  • Factorial leads to overflows in computer programming

  • With n*p < 1 and large n

  • The distribution can be approximated by a Poisson distribution

    • Much easier to calculate for a computer


Poisson vs binomia l distributions

Poissonvs. Binomial Distributions

Poisson

p(x) = e-λ(λx / x!)

λ: n*p

Binomial

p(x) = (n! / x!(n – x)!) px qn-x


Partial matches

Partial matches

  • So far we considered matching the complete query

  • Partial match:

L ( L<= LQ ^ L <= LT)

p = 2-2L

m = LQ - L -1

n = LD - L -1

E = m n 2-2L


Blast e value

BLAST E-Value

  • E = mn2-S

    E = mn2-2L

  • Describes the number of expected matches which are equally good or better


End of theoretical part 2

End of Theoretical Part 2

  • Mind mapping

  • 10 min break


Practical part 2

Practical Part 2


Practice poisson vs binomial

Practice Poisson vs Binomial

Q: ATG

D: CGATTGCCCG

Calculate p(0), p(1) and p(3)

Note: at least one match = 1 – p(0)


Mbg305 applied bioinformatics

E = m n 2-2L

Assuming a database size of 10 000 000

and a query length of 10 calculate the number

of matches that would happen by chance?


Practical concerns

Practical Concerns

  • Human genome 3 billion nucleotides

  • Dogma: 14 nucleotides are enough to uniquely identify a gene

  • Verify this using Poisson distribution

Poisson

p(x) = e-λ(λx / x!)

λ: n*p


Blast interface

BLAST Interface

  • Setting a cutoff E-value

    • Consider the calculation you just did

    • If someone was to set the cutoff to 0.01 with the same assumptions

      • How many results would you expect?

      • What would you advise the user?

  • Topic will be revisited later


Amino acid sequences

Amino Acid Sequences

  • What changes when instead of nucleotide sequences we were to use amino acid sequences?


Practise this

Practise this

  • Determine how long a query must be that it can uniquely identify a gene in the human genome

    • p < 0.05

  • Make a table showing the E value against LQ(10..100) with LD = 3 000 000 000

  • Use Excel to do this


Assignments

Assignments

  • Go to GenBank and inspect all parameters

    • Find their meaning (even if you think you know what it means)

    • Sometimes definitions are surprising

  • Collect information about parameters that pose problems to you

    • Submit this information to us so that we can discuss in the following week


  • Login