350 likes | 382 Views
Discover how to use Sequence Similarity Search to find new functions based on similar sequences within databases. Learn about BLAST algorithms for efficient searching and alignment, scoring systems for amino acid matches, and interpreting BLAST results for significant sequence similarities.
E N D
Lesson 3 Database Similarity Search
Sequence Similarity search is a key to discover new functions Basic assumption Similar sequences Similar function WHY? • Have the required properties to undertake the function • Come from the same origin
new sequence ? Similar function ≈ Discover Function of a new sequence Sequence Database
Searching Databases for similar sequences Due to the huge number and size of the databases using exact algorithm to compare a sequence (query) to all sequences in the databases is not feasible. Solution: Use a heuristic (approximate) algorithm
Heuristic strategy Perform efficient search strategies Preprocess database into new data structure to enable fast accession
BLAST Basic Local Alignment Search Tool • General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC • First, identify (most efficiently) short almost exact matches . • Next, extended to longer regions of similarity. • Finally, optimize the alignment an exact algorithm. Altschulet al 1990
Similar to pairwise sequence alignments BLAST can be used for DNA/RNA (nucleotide) sequences or for proteins sequence (amino acids) • BLASTN(Nucleotide) • BLASTP(Protein)
DNA/RNA vs protein alphabet DNA(4) RNA(4) Protein (20) A T G C A U G C ACDEFGHIKLMNPQRSTVWY A T=A G…. A T=A G…. A G>>A W…. WHY is it different?
The 20 Amino Acids A G W
BLAST(Protein Sequence Example) 1. Identify (most efficiently) short almost exact matches between the query sequence and the database. Query sequence…FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA
BLAST Preprocessing of the database Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAY Seq 2 FDRTSYV FDR, DRT, RTS, TSY, SYV Seq 3 SWRTYVA SWR, WRT,RTY, TYV, YVA ……. FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG.. SVT. GSW. TWF.. WYS…. Seq 1 BAG OF WORDS (BOW) Seq 102 Seq 3546
BLAST Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA… DATABASE FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS…. SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN
BLAST 2. Extend word pairs as much as possible (No Gaps) until the local alignment score meets or exceeds a threshold or cutoffscore (t) HSP High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN 3. Finally, optimize the alignment using an exact algorithm. Q= query sequence, D= sequence in database
Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA BLAST by definition is a local alignment tool
Sometimes we want to include gaps in alignments! • Standard solution: affine gap model wx = g + r(x-1) wx : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length • Once-off cost for opening a gap • Lower cost for extending the gap • Changes required to algorithm
Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance
How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value) The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. page 105
BLAST- E value: Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database m = length of query ; n= length of database ; s= score • K ,λ: statistical parameters dependent upon scoring system and background residue frequencies
What is a Good E-value (Thumb rule) • E values of less than 0.00001 show that sequences are almost always related. • Greater E values, can represent functional relationships as well. • Sometimes a real (biological) match has an E value > 1 • Sometimes a similar E value occurs for a short exact match and long less exact match
How to interpret a BLAST search: • The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance
(How) can we decide if two sequences really have the same function? Homolog = come from a common origin => have the same function
Homologous proteins = come from a common origin => have the same function Last Universal Common Ancestor
Homology Rule of thumb:-Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules?
Alignment between the worm and human arrestin VERY SIGNIFICANT , NOT HIGH IDENTITY
Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) and PAEP (pregnancy associated protein) were found to be correlated with pre-eclampsia High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood PAEP= Pregnancy associated protein
Are they functionally homologous??? PAEP RBP4
Assessing whether proteins are functional homologous RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous???
The lipocalins protein family (each dot is a protein) PAEP RBP4 retinol-binding protein odorant-binding protein apolipoprotein D
Are they functionally homologous??? PAEP RBP4 They belong to the same protein family= have a common ancestor Their functions have probably diverse