Iosif vaisman
Download
1 / 75

Iosif Vaisman - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Introduction to Bioinformatics. Iosif Vaisman. Email: [email protected] NIH working definition of bioinformatics and computational biology (July 2000).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Iosif Vaisman' - alair


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Iosif vaisman

Introduction to Bioinformatics

Iosif Vaisman

Email: [email protected]


Nih working definition of bioinformatics and computational biology july 2000
NIH working definition of bioinformatics and computational biology (July 2000)

The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.


Bioinformatics bibliography papers with the word bioinformatics in title or abstract

Liebman MN, Molecular modeling of protein biology (July 2000)

structure and function: a bioinformatic approach.

J Comput Aided Mol Des 1988, 1(4):323-41

Bioinformatics bibliography(papers with the word “bioinformatics” in title or abstract)


Dynamics of database growth
Dynamics of Database Growth biology (July 2000)


Comparative sequence sizes
Comparative Sequence Sizes biology (July 2000)

  • Yeast chromosome 3 350,000

  • Escherichia coli (bacterium) genome 4,600,000

  • Largest yeast chromosome now mapped 5,800,000

  • Entire yeast genome 15,000,000

  • Smallest human chromosome (Y) 50,000,000

  • Largest human chromosome (1) 250,000,000

  • Entire human genome 3,000,000,000


The string alignment problem
The String Alignment Problem biology (July 2000)

string - a sequence of characters from some alphabet

given: two strings acbcdb and cadbd

one of possible alignments:

a c - - b c d b

- c a d b - d -

score:

3 . (2) + 5 . (-1) = 1

scoring function:

exact match +2

mismatch -1

insertion -1


The string alignment problem1
The String Alignment Problem biology (July 2000)

given: two strings CTCATG and TACTTG

C T C A T G

| | |

T A C T T G

score:

3 . (2) + 3 . (-1) = 3

C T C A - T - G

| | | |

. T - A C T T G

score:

4 . (2) + 4 . (-1) = 4


Entropy and redundancy of language
Entropy and Redundancy of Language biology (July 2000)

CUR F W D DIS AND P

A SED IEND ROUGHT EATH EASE AIN

BLES FR B BR AND AG


Entropy and redundancy of language1
Entropy and Redundancy of Language biology (July 2000)

** CUR**** F*****W******* D***** DIS*****AND P***

|| |||| ||||| ||||||| ||||| ||||| |||

**BLES****FR*****B*******BR*****AND ***** AG***

The sequences are 65% identical

A CURSED FIEND WROUGHT DEATH DISEASE AND PAIN

|| |||| ||||| ||||||| ||||| ||||| |||

A BLESSED FRIEND BROUGHT BREATH AND EASE AGAIN


Substitution matrices

PAM100 biology (July 2000)

PAM100

PAM100

PAM100

PAM200

PAM150

Substitution Matrices

  • Dayhoff (or MDM, or PAM) - Derived from global alignments of closely related sequencesPAM100 - number referes to evolutionary distance (Percentage of Acceptable point Mutations per 108 years)

300 million years

200 million years

100 million years


Substitution matrices1
Substitution Matrices biology (July 2000)

  • BLOSUM (BLOcks SUbstitution Matrix) -Derived from local, ungapped alignments of distantly related sequencesBLOSUM62 - number refers to the minimum percent identity

Reference: Henikoff & Henikoff Proteins17:49, 1993


Selecting a matrix
Selecting a Matrix biology (July 2000)

Low PAM:

short segments,

high similarity

High PAM:

long segments,

low similarity

  • Compared sequences are related:200 PAM or 250 PAM

  • Database scanning:120 PAM

  • Local alignment search: 40 PAM, 120 PAM, 250 PAM

  • Detection of related sequences using BLAST: BLOSUM 62

THERE IS NO “ONE SIZE FITS ALL” MATRIX !


Matrix example
Matrix Example biology (July 2000)

A B C D E F G H I K ..

1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1 0.0 0.0 .. A

1.1 -0.4 1.1 0.7 -0.7 0.6 0.4 -0.2 0.4 .. B

1.5 -0.5 -0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C

1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D

1.5 -0.7 0.5 0.4 -0.2 0.3 .. E

1.5 -0.6 -0.1 0.7 -0.7 .. F

1.5 -0.2 -0.3 -0.1 .. G

1.5 -0.3 0.1 .. H

1.5 -0.2 .. I

1.5 .. K


Dayhoff s acceptable point mutations
Dayhoff’s Acceptable Point Mutations biology (July 2000)

Ala A

Arg R 30

Asn N 109 17

Asp D 154 0 532

Cys C 33 10 0 0

Gln Q 93 120 50 76 0

Glu E 266 0 94 831 0 422

Gly G 579 10 156 162 10 30 112

His H 21 103 226 43 10 243 23 10

Ile I 66 30 36 13 17 8 35 0 3

Leu L 95 17 37 0 0 75 15 17 40 253

Lys K 57 477 322 85 0 147 104 60 23 43 39

Met M 29 17 0 0 0 20 7 7 0 57 207 90

Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17

Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7

Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269

Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696

Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0

Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6

Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17

A R N D C Q E G H I L K M F P S T W Y

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr


Search and alignment entropy
Search and alignment entropy biology (July 2000)

  • Information content per position: pam10 - 3.43 bits pam120 - 0.98 bits pam160 - 0.70 bits pam250 - 0.38 bits blosum62 - 0.70 bits

  • Information requirements: for search - 30 bits for alignment - 16 bit


Search and alignment entropy1
Search and alignment entropy biology (July 2000)

Recommended matrices for different query length

Query length Substitution matrix Gap costs

<35 PAM-30 ( 9,1)

35-50 PAM-70 (10,1)

50-85 BLOSUM-80 (10,1)

>85 BLOSUM-62 (11,1)


Fasta algorithm

Sequence B biology (July 2000)

Sequence A

FASTA Algorithm

1

First run

(identities)


Fasta algorithm1

2 biology (July 2000)

Sequence B

Rescoring using

PAM matrix

high score

low score

Sequence A

FASTA Algorithm

The score of the highest

scoring initial region is

saved as the init1 score.


Fasta algorithm2

Sequence B biology (July 2000)

Sequence A

FASTA Algorithm

3

Joining threshold - eliminates disjointed segments

Non-overlapping regions are

joined. The score equals sum

of the scores of the regions

minus a gap penalty. The

score of the highest scoring region, at the end of this step,

is saved as the initn score.


Fasta algorithm3

Sequence B biology (July 2000)

Sequence A

FASTA Algorithm

4

Alignment

optimization

using dynamic

programming

The score for this alignment

is the opt score.


Fasta algorithm4
FASTA Algorithm biology (July 2000)

FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair.

Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.


FASTA Results biology (July 2000)

  • When init1=init0=opt: 100 % homology over the matched stretch.

  • When initn > init1: more than 1 matching region in the database with poorly matching separating regions.

  • When opt > initn: the matching regions are greatly improved by adding gaps in one or both of the sequences.


Blast basic local alignment search tool
BLAST - Basic Local Alignment Search Tool biology (July 2000)

  • Blast programs use a heuristic search algorithm. The programs use the statistical methods of Karlin and Altschul (1990,1993).

  • Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.


Blast algorithm
BLAST Algorithm biology (July 2000)

1

Query sequence of length L

Maximium of L-w+1 words

(typically w = 3 for proteins)

For each word from the

query sequence find the

list of words with high

score using a substitution

matrix (PAM or BLOSUM)

Word list


Blast algorithm1
BLAST Algorithm biology (July 2000)

2

Database sequences

Word list

Exact matches of words from the word list

to the database sequences


Blast algorithm2
BLAST Algorithm biology (July 2000)

3

Maximal Segment Pairs (MSPs)

For each exact word match, alignment is extended in both

directions to find high score segments


Gapped blast
Gapped BLAST biology (July 2000)

  • The Gapped Blast algorithm allows gaps to be introduces into the alignments. That means that similar regions are not broken into several segments.

  • This method reflects biological relationships much better.


Blast family of programs
BLAST family of programs biology (July 2000)

  • blastp - amino acid query sequence against a protein sequence database

  • blastn - nucleotide query sequence against a nucleotide sequence database

  • blastx - nucleotide query sequence translated in all reading frames against a protein database

  • tblastn - protein query sequence against a nucleotide sequence database dynamically translated in all reading frames

  • tblastx - six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.


Database searches
Database Searches biology (July 2000)

  • Run Blast first, then depending on your results run a finer tool (Fasta, Smith-Waterman, etc.)

  • Where possible use translated sequence.

  • E() < 0.05 is statistically significant, usually biologically interesting. Check also 0.05 < E() <10 because you might find interesting hits.

  • Pay attention to abnormal composition of the query sequence, it usually causes biased scoring.

  • Split large query sequence ( if >1000 for DNA, >200 for protein).

  • If the query has repeated segments, remove them and repeat the search.


Documenting the search
Documenting the Search biology (July 2000)

  • Algorithm(s)

  • Substitution matrix

  • Gap penalty (FASTA)

  • Name of database

  • Version of database

  • Computer used


Multiple sequence alignment
MULTIPLE SEQUENCE ALIGNMENT biology (July 2000)


Computational complexity
Computational complexity biology (July 2000)

Alignment of protein sequences with 200 amino acid residues:


Multiple alignment
Multiple alignment biology (July 2000)

VTISCTGSSSNIGAG-NHVKWYQQLPG

VTISCTGTSSNIGS--ITVNWYQQLPG

LRLSCSSSGFIFSS--YAMYWVRQAPG

LSLTCTVSGTSFDD--YYSTWVRQPPG

PEVTCVVVDVSHEDPQVKFNWYVDG--

ATLVCLISDFYPGA--VTVAWKADS--

AALGCLVKDYFPEP--VTVSWNSG---

VSLTCLVKGFYPSD--IAVEWESNG--

Column cost: the sum of costs for all possible pairs


Multiple alignment1
Multiple alignment biology (July 2000)

A correct multiple alignment corresponds to an evolutionary history:

no correct way to determine

practical way - to find an alignment with the maximum score


Multiple sequence alignment1
Multiple sequence alignment biology (July 2000)

Given k (k > 2) sequences, s1,…, sk, each sequence

consisting of characters from an alphabet A

multiple alignment is a a rectangular array, consisting

of characters from the alphabet A’ (A + "-"), that

satisfies the following 3 conditions:

1. There are exactly k rows.

2. Ignoring the gap character, row number i is

exactly the sequence si.

3. Each column contains at least one character

different from "-".


Consensus
Consensus biology (July 2000)

Plurality - minimum number of votes for a consensus

Threshold - scoring matrix value below which a symbol

may not vote for a coalition.

Sensitivity - minimum score to select consensus

Profiles - blocks of prealigned sequences


Multiple alignment algorithm
Multiple alignment algorithm biology (July 2000)

1. Pairwise alignments (progressive pairwise alignments)

2. Distance matrix calculation

3. Guide tree creation (hierarchical clustering)

4. New sequence addition


Scoring system distances

S biology (July 2000)real(ij) - Srand(ij)

D(ij)= -ln

x 100

Siden(ij) - Srand(ij)

Scoring system (distances)

Sreal(ij) - observed similarity score for two aligned sequences i and j

Siden(ij) - average of the two scores for each sequence aligned with itself

Srand(ij) - average score determined from 100 global randomizations of the two sequences

The distances D(ij) are used to generate the distance matrix

from which the approximate guide tree is generated.


Multiple alignment2
Multiple alignment biology (July 2000)


Multiple alignment3

(1,1,1) biology (July 2000)

C

(1,0)

(1,1)

B

B

(0,0)

(0,1)

A

(0,0,0)

A

Multiple alignment

Segment - line joining two vertices

Each unit m-dimensional cube in the lattice

contains 2m -1 segments


Multiple alignment4
Multiple alignment biology (July 2000)

Alignment Path for 3 Sequences

(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2)


Multiple alignment5
Multiple alignment biology (July 2000)

V S N - S

- S N A -

- - - A S

Pairwise Projections of the Alignment


Alignment statistics
Alignment statistics biology (July 2000)

Rablpb Humcetp Rabcetp Bovbpi

Humlbpa Ratlbp Maccetp Humbpi

1 2 3 4 5 6 7 8

478 67% 65% 19% 19% 18% 42% 43%

1 0 82% 80% 39% 39% 36% 64% 65%

0 1% 0% 5% 5% 12% 2% 2%

327 483 58% 16% 16% 16% 39% 41%

2 400 0 75% 38% 38% 35% 62% 63%

5 0 0% 5% 5% 12% 1% 1%

318 284 482 18% 18% 17% 40% 43%

3 390 367 0 38% 38% 35% 64% 64%

4 1 0 5% 5% 12% 1% 1%

96 84 95 494 95% 74% 20% 21%

4 198 192 194 0 98% 84% 40% 41%

30 29 28 0 0% 7% 6% 5%


Alignment score
Alignment score biology (July 2000)

Rablpb Humcetp Rabcetp Bovbpi

Humlbpa Ratlbp Maccetp Humbpi

1 2 3 4 5 6 7 8

1 4077

2 5358 4129

3 5323 5650 4096

4 8103 8229 8112 4210

5 8109 8243 8118 4332 4219

6 8535 8672 8575 5511 5519 4261

7 6474 6531 6500 8103 8119 8572 4103

8 6392 6434 6378 8033 8035 8520 5508 4083

1 2 3 4 5 6 7 8


Alignment visualization
Alignment visualization biology (July 2000)

Identity

Summary view


Alignment visualization1
Alignment visualization biology (July 2000)

Physico-chemical properties

Differences mode


Alignment visualization tree
Alignment visualization (tree) biology (July 2000)


Sequence logos a quantitative graphical display for binding sites and proteins
Sequence Logos: biology (July 2000)a quantitative graphical display for binding sites and proteins

Reference: Schneider, T.D. Meth. Enzym 274:445, 1996


Sequence logos
Sequence Logos biology (July 2000)


Sequence logos1
Sequence Logos biology (July 2000)


Multiple alignment programs
Multiple Alignment Programs biology (July 2000)

  • Pileup (GCG): Needleman and Wunsch algorithm for pairwise alignment and UPGMA method for tree construction

  • CLUSTAL: Wilbur and Lipman algorithm for pairwise alignment (CABIOS8:189, 1992)

  • PIMA: pattern-matching based algorithm (PNAS87:118, 1990)

  • TreeAlign: phylogenetic algorithm (Meth. Enzymol. 18:626, 1990)


Patterns in protein sequences

Patterns in protein sequences biology (July 2000)


Regular expressions

x biology (July 2000)

ANY

[ ]

OR

[ILV]

I or L or V

{ }

NOT

{DE}

not D or E

( )

repetitions

x(2,3)

x-x or x-x-x

-

separator

<

N-terminal

>

C-terminal

.

END

Regular Expressions

Patterns described in a standard way are known as regular expressions


Regular expressions1
Regular Expressions biology (July 2000)

[AC]-x-V-x(4)-{ED}.

[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

...LKHVAYVFQALIYWIK...

...AVEMAGVKYLQVQHGS...

...LYTGAIVTNNDGPYMA...

...KEYKCKVEKELTDICN...


Prosite database
PROSITE Database biology (July 2000)

Current version contains 1079 documentation entries

that describe 1459 different patterns, rules and

profiles/matrices

[ST]-x(2)-[DE]

Casein kinase II phosphorylation site

[AG]-x(4)-G-K-[ST]

ATP/GTP-binding site motif A (P-loop)

Y-x-[NQH]-K-[DE]-[IVA]-F-[LM]-R-[ED]

Heat shock hsp90 proteins family signature

http://www.expasy.ch/prosite


Blocks database
Blocks Database biology (July 2000)

Blocks are multiply aligned ungapped segments corresponding

to the most highly conserved regions of proteins

N-6 Adenine-specific DNA methylases proteins

width=9 seqs=78

DMA_VIBCH|Q08318 (85) SCTQWWPPF 77

HEMK_MYCLE|P45832 (181) DLFVAQPTL 100

MT57_ECOLI|P25240 (111) DGALGNPPF 13

MTC1_CHVN1|Q01511 (172) NFVFLDPPY 8

MTC1_COREQ|P42828 (71) QLSFSCPPF 49

MTH2_HAEHA|P00473 (32) KIAFFDPQY 52

MTH3_HAEIN|P43871 (23) HAIISDIPY 73

MTM1_MICAM|P50190 (306) AAVLTNPPF 14

MTM2_MORBO|P23192 (25) QLAVIDPPY 10

MTMU_MYCSP|P43641 (37) QVIYADPPW 13

MTR1_RHOSH|P14751 (60) QLIICDPPY 8

....................................

http://www.blocks.fhcrc.org/


Pfam database
Pfam Database biology (July 2000)

Pfam is a large collection of multiple sequence alignments and

hidden Markov models covering many common protein domains

Zinc finger, C2H2 type

TYY1_HUMAN/383-407 YVCPF.DGCN...KKFAQSTNLKSHILT...H

ZG52_XENLA/61-83 YTCT...QCN...KQFSHSAQLRAHIST...H

KRUP_DROME/306-328 YTCE...ICD...GKFSDSNQLKSHMLV...H

YKQ8_CAEEL/78-102 YKCT...VCR...KDISSSESLRTHMFKQ.HH

DEFI_CHICK/268-292 YECP...NCK...KRFSHSGSYSSHISSK.KC

ZFH1_DROME/389-413 FGCD...NCG...KRFSHSGSFSSHMTSK.KC

YL57_CAEEL/42-65 YLCY...YCG...KTLSDRLEYQQHMLK..VH

ZFA_MOUSE/542-564 FKCD...ICL...LTFSDTKEVQQHALV...H

BASO_HUMAN/719-742 FQCD...ICK...KTFKNACSVKIHHKN..MH

HUNB_DROME/297-319 FQCD...KCS...YTCVNKSMLNSHRKS...H

SFP1_YEAST/598-623 FKCPV.IGCE...KTYKNQNGLKYHRLH..GH

ZG29_XENLA/62-84 FVCT...VCG...KTYKYKHGLNTHLHS...H

http://pfam.wustl.edu/


Other motif databases
Other Motif Databases biology (July 2000)

PRINTS : a compendium of protein fingerprints.

A fingerprint is a group of conserved motifs used

to characterise a protein family

http://bioinf.man.ac.uk/dbbrowser/PRINTS/

DOMO : a protein domain database

http://www.infobiogen.fr/~gracy/domo/home.htm

ProDom : a protein domain database

http://protein.toulouse.inra.fr/prodom.html


Interpro database
InterPro Database biology (July 2000)

InterPro : integrated resource for the commonly

used signature databases - Pfam, PRINTS,

PROSITE, ProDom and SWISS-PROT + TrEMBL.

Current release of InterPro (3.2) contains 3939

entries, representing 1009 domains, 2850 families,

65 repeats and 15 post-translational modification sites.

http://www.ebi.ac.uk/interpro


Interpro database1
InterPro Database biology (July 2000)


From genes to proteins biology (July 2000)

DNA

PROMOTER

ELEMENTS

TRANSCRIPTION

RNA

SPLICE

SITES

SPLICING

mRNA

START

CODON

STOP

CODON

TRANSLATION

PROTEIN


From genes to proteins biology (July 2000)


Chromosome 19 gene map biology (July 2000)


Computational gene prediction
Computational Gene Prediction biology (July 2000)

  • Where the genes are unlikely to be located?

  • How do transcription factors know where to bind a region of DNA?

  • Where are the transcription, splicing, and translation start and stop signals?

  • What does coding region do (and non-coding regions do not) ?

  • Can we learn from examples?

  • Does this sequence look familiar?


Measures of prediction accuracy

FN biology (July 2000)

TN

FN

TP

FN

TN

TN

TP

FP

REALITY

PREDICTION

REALITY

Sensitivity

c

nc

Sn = TP / (TP + FN)

FP

TP

c

PREDICTION

Specificity

FN

nc

TN

Sp = TP / (TP + FP)

Measures of Prediction Accuracy

Nucleotide Level


Measures of prediction accuracy1

number of correct exons biology (July 2000)

Sensitivity

Sn =

number of actual exons

number of correct exons

Sp =

Specificity

number of predicted exons

Measures of Prediction Accuracy

Exon Level

MISSING

EXON

WRONGEXON

CORRECTEXON

REALITY

PREDICTION


Spliced alignment procrustes
Spliced Alignment (Procrustes) biology (July 2000)

  • New genomic sequence

  • Selection of candidate exons AUG --- GU initial exons AG --- GU internal exons AG --- UAA or UAG or UGA terminal exons

  • Filtration (based on the codon usge statistics)

  • Construction of all possible chains of candidate exons

  • Finding a chain with the maximum global similarity to the target protein


Spliced alignment procrustes1
Spliced Alignment (Procrustes) biology (July 2000)


Predicted exon assembly procrustes
Predicted Exon Assembly biology (July 2000)(Procrustes)


Pcr primers prediction geneprimer
PCR Primers Prediction (GenePrimer) biology (July 2000)

Exon 1085..1182 (98) hit using first 2 primers

Exon 1628..1676 (49) missed

Exon 1900..2001 (102) hit using first 8 primers

Exon 2110..2184 (75) missed

Exon 2516..2722 (207) hit using first 4 primers

Exon 3385..3472 (88) missed

Exon 3546..3746 (201) hit using first primer

...


Grail gene identification program

REFINED EXON biology (July 2000)

POSITIONS

FINAL EXON CANDIDATES

POSSIBLE EXONS

GRAIL gene identification program




Bibliography (GeneParser)

http://linkage.rockefeller.edu/wli/gene/list.html

and

http://www-hto.usc.edu/software/procrustes/fans_ref/

Gene Discovery Exercise

http://metalab.unc.edu/pharmacy/Bioinfo/Gene


ad