Computing in Molecular Biology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Computing in Molecular Biology PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Computing in Molecular Biology. Hugues Sicotte National Center for Biotechnology Information [email protected] Alignment methods. Sequence Alignment representation using a dot plot. For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.

Download Presentation

Computing in Molecular Biology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computing in molecular biology

Computing in Molecular Biology

Hugues Sicotte

National Center for Biotechnology Information

[email protected]


Alignment methods

Alignment methods

Sequence Alignment representation using a dot plot.

For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.

Query sequence

Subject sequence


H a s h i n g m e t h o d s

query sequence

MLIIKRDELVISWASHERE

H A S H I N G M E T H O D S

Hashing is a common method for accelerating database searches

MLI

LII

IIK

IKR

all overlappingwords of size 3

Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.

KRD

RDE

DEL

ELV

LVI

VIS

ISW

SWA

WAS

ASH

SHE

HER

ERE


Index lookup

Index lookup

  • Each word is assigned a unique integer.

  • E.g. for a word of 3 letters made up of an alphabet of 20 letters.

  • Assign a code to each letter Code(l) (0 to 19)

  • For a word of 3 letters L1 L2 L3 the code is

  • index = Code(L1)*202 + Code(L2)*201 + Code(L3)

  • 3. Have an array with a list of the positions that have that word.

AAA

AAB

MLI

MLJ

0

1

2

3

1

Position in query sequence of word


H a s h i n g m e t h o d s1

query sequence

MLIIKRDELVISWASHERE

H A S H I N G M E T H O D S

Building the dictionary for the query sequence requires (N-2) operations.

MLI

LII

IIK

IKR

all overlappingwords of size 3

KRD

RDE

DEL

ELV

The database contains (M-2) words, and it takes only one operation to see if the word was in the query.

LVI

VIS

ISW

SWA

WAS

ASH

SHE

HER

ERE


H a s h i n g m e t h o d s2

H A S H I N G M E T H O D S

Query sequence

Scan the subject, looking up words in the dictionary

Use word hits to determine were to search for alignments

fills the dynamic programming matrix

in (N-2)+(M-2) operations instead

of MxN.

Subject sequence


H a s h i n g m e t h o d s3

H A S H I N G M E T H O D S

Query sequence

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

Subject sequence

FASTA searches in a band


H a s h i n g m e t h o d s4

H A S H I N G M E T H O D S

Query sequence

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

Database sequence

BLAST extends from word hits


Database search space

Database Search Space

Simplest Database searching could is a large dynamic programming example.

With all the database sequences concatenated one after another.

Query sequence

Concanated Database sequence


Database search space1

Database Search Space

Which alignment is more significant?

Query sequence

Concanated Database sequence


Database search space2

Database Search Space

Score can be used to judge alignments. But a score absolute value is a function of the score parameters.

Match=+1,Mismatch=-1,

Gap_open=5,

gap_extend=1

Yields same alignments as

Match=+10,Mismatch=-10,

Gap_open=50,

gap_extend=10

Scores useful for relative ranking.

Query sequence

Concanated Database sequence


Database search space3

Database Search Space

To Judge relevancy of an alignment, need to judge if match is significant.

E-value = Expect(S) is a function of the score, database size and composition, and query size.

Number of Aligments with scores >= S expected if the query was a random given the database size and composition.

Expect of 0.0 means a very good match unlikely to be random.

Query sequence

Concanated Database sequence


D a t a b a s e s e a r c h i n g

D A T A B A S E S E A R C H I N G

Compare one query sequence against an entire database

>

fasta

myquery

swissprot

-ktup 2

search program

querysequence

sequencedatabase

optionalparameters

A typical search has four basic elements


D a t a b a s e s e a r c h i n g1

D A T A B A S E S E A R C H I N G

With exponential database growth, searches keep taking more time

>

fasta

myquery

swissprot

-ktup 2

searching

.

.

.

.

.

.


E value

E-value

“Hits” can be sorted according to their E-value or their score.

The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.

E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.

e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.

E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.

Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)


D a t a b a s e s e a r c h i n g2

D A T A B A S E S E A R C H I N G

The “hit list” gives titles and scores for matched sequences

>

fasta

myquery

swissprot

-ktup 2

The best scores are: initn init1 opt z-sc E(77110)

gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0

gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21

gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16

gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07

gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07

gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06

gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05

gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002

gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058

gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082

gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022

gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017

gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022

gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03

gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031

gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035

gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057

gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058

gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072

gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51

gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56

gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3

gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6

gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7

gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3

gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7

gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5

gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3

gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6

gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8

gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8

gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3


D a t a b a s e s e a r c h i n g3

D A T A B A S E S E A R C H I N G

Detailed alignments are shown farther down in the output

>

fasta

myquery

swissprot

-ktup 2

>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)

initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21

Smith-Waterman score: 395; 52.3% identity in 109 aa overlap

10 20 30 40 50

gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF

: X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::

gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF

10 20 30 40 50 60

60 70 80 90 100 110

gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK

....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:

gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK

70 80 90 100 110 120

120 130 140

gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ

..

gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE

130 140 150 160 170 180

>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)

initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16

Smith-Waterman score: 316; 37.4% identity in 131 aa overlap

10 20 30 40

gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER

:.. :. .v^: :.. ..:::: ::.::::::. ::X :


Database search space4

Concanated Database sequence

Database Search Space

Some matches are non-meaningful because they occur VERY often in database.

e.g. nucleotide AAA (from polyA)

Biological repeated elements(retroposons ALU)

Low-complexity repeated patterns.

(CAGCAG, QQQ,KKK,…)

These elements should be

FILTERED or MASKED

to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’

Query sequence


Score and statistics

Score and Statistics

Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.

Scoring system that doesn’t penalize very much mutations to similar amino acid.

PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.

BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.


S c o r i n g s y s t e m s

S C O R I N G S Y S T E M S

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some amino acid substitutions are more common than others

BLOSUM62

Substitution scores come from an odds ratio based on measured substitution rates

Figure 7.8


S c o r i n g s y s t e m s1

S C O R I N G S Y S T E M S

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Identities get positive scores, but some are better than others

BLOSUM62

Figure 7.8


S c o r i n g s y s t e m s2

S C O R I N G S Y S T E M S

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 11 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 12 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 31 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some non-identities have positive scores, but most are negative

BLOSUM62

Figure 7.8


Blast and blast2sequences

BLAST and BLAST2SEQUENCES

BLAST is a database search engine based on

using hashing to accelerate the search.

blastn (nucleotide query against nucleotide database)

blastp (protein query against protein database)

blastx (nucleotide query against protein database)

- translates a nucleotide query in all 6 reading

frames and compare it to a protein database.

tblastn (protein query against nucleotide database)

- compare a protein against a nucleotide database translated in all 6 reading frames.

tblastx (nucleotide query against nucleotide database)

- compares a nucleotide sequence against a nucleotide database by translating the query and database in all 6 reading frames. Very slow!

A pairwise alignment implementation of this

program is available at:

http://www.ncbi.nlm.nih.gov/gorf/bl2.html


Protein blast databases

Protein BLAST databases

nr All non-redundant GenBank CDS+ translations+PDB+ SwissProt + PIR + PRF

monthAll new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days.

swissprot Last major release of the SWISS-PROT protein sequence database (no updates)

Drosophila Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP).

yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations

ecoli Escherichia coli genomic CDS translations

pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank

kabat [kabatpro] Kabat's database of sequences of immunological interest

alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.


Nucleotide blast databases

Nucleotide BLAST databases

nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".

month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.

Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP).

dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions

dbstsDatabase of GenBank+EMBL+DDBJ sequences from STS Divisions

htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)

gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences

E. coli Escherichia coli genomic nucleotide sequences


Nucleotide blast databases1

Nucleotide BLAST databases

pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank

kabat [kabatnuc] Kabat's database of sequences of immunological interest

vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/

mito Database of mitochondrial sequences

alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

epd Eukaryotic Promotor Database found on the web at http://www.genome.ad.jp/dbget-bin/www_bfind?epd


Blastn search m29204

BLASTN SEARCH (M29204)

Search Nucleotide sequence M29204 against nr.

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=1


Blastp and filtering

BLASTP and filtering.

Search using blastp against nr

With filtering ON (default)\

Then with filtering OFF.

>GCF

MKKRVTNRERHWTHRRRRQRTRKKKKKKKRVLGRRALGPRPWLTGRKGLFGSARLIPATA


Blastn vs blastx

BLASTN vs BLASTX

Search blastn against nr (nucleotide) U15595

Now search using blastx against nr (protein)

Now

Search blastx against ALU


Tblastx against dbest

TBLASTX against dbEST

Search tblastx against dbEST

Picks up homologs based on protein homology of translations.

>OCRL-selected mRNA, partial sequenceTTGAACATCATGAAACATGAGGTTGTCATTTGGTTGGGAGATTTGAATTATAGACTTTGCATGCCTGATGCCAATGAGGTGAAAAGTCTTATTAATAAGAAAGACCTTCAGAGACTCTTGAAATTCGACCAGCTAAATATTCAGCGCACACAGAAAAAAGCTTTTGTTGACTTCAATGAAGGGGAAATCAAGTTCATCCCCACTTATAAGTATGACTCTAA


Prosite search

Prosite search

Search prosite for

NP_000271 (Pax6a)

http://www.expasy.ch/prosite


Phi blast search

PHI-Blast search

Search Prosite db using the NCBI’s PHI-blast.(Pattern-Hit-Initiated blast) using the pattern for Pax6a.

[LIVMFYG]-[ASLVR]-X(2)-[LIVMSTACN]-X-(4)-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W

-e 2e-14


Psi blast search

PSI-Blast search

Search AB026911 using PSI-blast. (at NCBI).

Position-Specific-Iteration.

.. Modifies the scoring matrix as a function of conserved or unconserved residues in alignments.


Online tutorials

ONLINE tutorials

Details of Blast methodology.

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Blast usage and Tutorial

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

Quick overview of terminology.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html


  • Login