protein analysis course n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Protein Analysis Course PowerPoint Presentation
Download Presentation
Protein Analysis Course

Loading in 2 Seconds...

play fullscreen
1 / 52

Protein Analysis Course - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Protein Analysis Course. Day 1: Databases, dotplots and pairwise alignment. Todays timetable. Databases and file formats Exercises Dotplot and pairwise alignment Exercises Coffee breaks during the exercises. Databases and file formats. Sequence file format FASTA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Protein Analysis Course' - tobit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
protein analysis course

Protein Analysis Course

Day 1: Databases, dotplots and pairwise alignment

todays timetable
Todays timetable
  • Databases and file formats
    • Exercises
  • Dotplot and pairwise alignment
    • Exercises

Coffee breaks during the exercises

databases and file formats
Databases and file formats
  • Sequence file format
    • FASTA
  • UniProt (Universal protein resource)
    • Primary structure
  • PDB (Protein Database)
    • Tertiary structure
sequence file format
Sequence file format
  • FASTA (a.k.a Pearson format)
    • Most commonly used
    • Can be easily construted by hand if needed
    • Straightforward way to store multiple sequences – just concatenate multiple FASTA –files
    • Content:
      • First line (Header line) always starts with symbol ”>” followed by identifiers and descriptions
      • Header line is ALWAYS just one line before sequence
      • After header line (from the second line) starts the sequence (presented using single-letter codes)
      • Sequence normally divided into multiple lines (often required)
      • Recommended line length max 80 chars (also with header line)
fasta
FASTA

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

databases uniprot
Databases: UniProt
  • UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information [wikipedia]
  • UniProt provides three core database:
    • The UniProt Archive (UniParc) provides a stable, comprehensive sequence collection without redundant sequences by storing the complete body of publicly available protein sequence data
    • The UniProt Reference Clusters (UniRef) databases provide non-redundant reference data collections based on the UniProt knowledgebase in order to obtain complete coverage of sequence space at several resolutions
    • The UniProt Knowledgebase (UniProtKB) is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation
uniprot archive uniparc
UniProt Archive (UniParc)
  • Comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world
  • Currently UniParc contains protein sequences from the following publicly available databases:
    • EMBL/DDBJ/GenBank nucleotide sequence databases
    • Ensembl
    • European Patent Office (EPO)
    • FlyBase
    • H-Invitational Database (H-Inv)
    • Internation Protein Index (IPI)
    • Japan Patent Office (JPO)
    • PIR-PSD
    • Protein Data Bank (PDB)
    • Protein Research Foundation (PRF)
    • RefSeq
    • Saccharomyces Genome database (SGD)
    • TAIR Arabidopsis thaliana Information Resource
    • TROME
    • USA Patent Office (USPTO)
    • UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL
    • Vertebrate Genome Annotation database (VEGA)
    • WormBase
uniprot reference clusters uniref
UniProt Reference Clusters (UniRef)
  • Sequence clusters, used to speed up similarity searches
  • UniRef100
    • Cluster is composed of sequences that are identical
  • UniRef90
    • Cluster is composed of sequences that have at least 90% sequence identity
  • UniRef50
    • Cluster is composed of sequences that have at least 50% sequence identity
protein knowledgebase uniprotkb
Protein knowledgebase (UniProtKB)
  • Is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation
  • Consists of two sections:
    • Swiss-Prot, which is manually annotated and reviewed by curator
    • TrEMBL, which is automatically annotated and is not reviewed
uniprot entry
UniProt entry
  • Every line in a entry begins with a 2 letter identifier
  • UniProt format closely resembles EMBL format except that considerably more information about physical and biochemical properties is provided
  • More information here
databases pdb
Databases: PDB
  • Founded in 1971 by Brookhaven National Laboratory, New York.
  • Transferred to the Research Collaboratory for Structural Bioinformatics (RCSB) in 1998.
  • Currently it holds more than 55,000 released structures.
slide12
PDB
  • Methods used to solve 3d structure:
    • X-ray: 86%
    • NMR: 13%
    • Electron Microscopy: 0,7%
    • Other: 0,3%
pdb file format
PDB file format
  • Text file – you can edit with a text editor e.g. WordPad
  • Atomic co-ordinates
  • Rich annotation
    • Citation
    • Experimental Method
    • Biological source e.
    • Etc.
fyi errors in databases
FYI: Errors in databases
  • Be aware of errors in the databases:
    • sequence errors:
      • genome projects’ error rate is 1/10,000nts;
      • ESTs’ error rate is 1/100nts.
    • annotation errors:
      • Automated computer programs do not always give correct annotations.
      • SwissProt is a protein database curated and annotated manually by biologists. Most reliable database, but is not up-to-date
exercises
Exercises
  • Go to the course web page and start with exercises given in file: database_exercises.doc
  • http://ekhidna.biocenter.helsinki.fi/how
pairwise sequence alignments
Pairwise sequence alignments
  • Motivation – Why alignments?
  • Sequence comparison
    • Dotplot
    • The alignment problem
  • Pairwise alignment algorithms
    • Exact algorithms
    • Heuristic algorithms
    • Database searches
  • Web tools:
    • Build alignments using EBI server,
    • Blast at NCBI, EBI,
    • PairsDB, …
motivation
Motivation
  • Proteins perform most of the functions required in biological systems:
    • Signaling (kinases, ...)
    • Enzymes (proteases, …)
    • Structural (collagen, elastin, …)
    • Immune system (antibodies, ...)
    • Storage and transport (hemoglobin, …)
  • Large amount of information available in current databanks.
  • Goal: Want to extrapolate information about the function of a newly discovered sequence by comparing it to annotated sequences.
does it make sense
Does it make sense?
  • All functional information is ultimately contained within the sequence.
  • Proteins are evolutionary related:
    • Selective pressure is on function, and thus on residues with functional role (eg: active site or structural key residues are conserved).
    • Modular nature of proteins.
  • Two sequences have the same structure if corresponding residues are similar enough on physico-chemical level.
application of sequence alignment
Application of sequence alignment
  • Determining function of newly discovered genetic or protein sequences.
  • Identification of functional patterns/domains.
  • Predicting structure of proteins.
  • Determining evolutionaryrelationships among genes, proteins, and entire species.

Aligning and comparing sequences, and searching

databases for similar sequences – a cornerstone

of bioinformatics!!

pairwise alignment
Pairwise alignment

Pairwise alignment = identification of residue-residue correspondence.

For the alignment to be meaningful, the correspondence should reflect the functional or evolutionary relationship

What criteria should we use to obtain biologically meaningful alignments?

????? 101 AGVIGTILLISYGIRRLIKKSPSDVKP 115

||:||.|||::|..|||.|:.|:||.|

GLP_HORSE 60 AGIIGIILLLAYVSRRLRKRPPADVPP 86

terminology
Terminology
  • Identity:
    • percentage of pairs of identical residues between two aligned sequences.
  • Similarity:
    • percentage of pairs of similar residues between two aligned sequences.
    • one must define what similar means. Eg:
      • as observed in well studied evolutionary

related protein families,

      • physico-chemical amino acid

properties: hydropathy, size, …

  • Homology:
    • two sequences are homologous if and only if they have a common ancestor.
    • it´s either yes or no.
    • Two types: orthology and paralogy
    • not to be confused with similarity!
    • don’t mix up with analogy
dotplot
DotPlot
  • The simplest way of comparing two sequences:
    • A dot is placed where both sequence elements are identical.
  • Gives an overview of all possible alignments.
  • Each diagonal indicates a possible (ungapped) alignment
filtering out the noise in dotplots
Filtering Out the Noise in Dotplots
  • Dots may be scored according to a sliding window and a similarity cutoff to reduce noise:
  • The smaller the window, the more noise.
  • With large windows, the sensitivity for short sequences is reduced.

Window size = 5, Similarity cutoff = 3

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

L

S

T

I

K

R

K

L

D

T

G *

Q *

Y

Q

E

P

W

Q

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

| | || |||| | || ||| |

LSTIKRKLDTGQYQEPWQYVDDVWLMFNN

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

L

S

T

I

K

R

K

L

D *

T

G

Q *

Y

Q

E

P

W

Q

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

| | || |||| | || ||| |

LSTIKRKLDTGQYQEPWQYVDDVWLMFNN

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

| | || |||| | || ||| |

LSTIKRKLDTGQYQEPWQYVDDVWLMFNN

dotlet
Dotlet

At http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Let´s find repeated domains in the following sequence :

> SLIT_DROME (P24014):

MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCTGLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVITTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSWLSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTLPDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLLLNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCESPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGRISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFEHLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCTCTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYNKLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQMKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNATCTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAKCMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHECKHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAVELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDPAQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLENKCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGNQCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

dotplot summary
DotPlot summary
  • Comparing a sequence with itself, can be used to identify:
    • Repeated domains,
    • Regions of low complexity (eg, …GYCAAAAAAAAALK…).
  • Comparing two protein sequences, can be used to identify:
    • Local regions of similarity,
    • Conserved protein domains.
the pairwise alignment problem
The Pairwise Alignment Problem
  • Lign up diagonal by edit operations:
    • substitution (mutation)
    • gap or indel (insertion/deletion)

sequence 1

substitution

sequence 2

deletion

seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP

|| ||| | ||| | | || | || | |

seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP

gap

insertion

But there are many ways to align 2 sequences  we need to score alignments to decide which is the best.

scoring the edit operations
Scoring the Edit Operations
  • For example:
    • identical: +10 (it´s good)
    • substitution: +2 for S-A, -1 for K-P, …
    • gap: -3

PSDVKP--P | || | | PADVPPPAP

Score: +50+2-1+2*(-3) = 45

Choosing an appropriate scoring scheme: where biological information is introduced (eg, reward the evolutionary most likely alignment).

Standard notation:

  • | for identical
  • : for very similar (eg, size and hydropathy)
  • . for somewhat similar (eg, size or hydropathy)
gap penalty
Gap penalty

TIL--------LISYGIRRLIK

TILKKSPSDVKLISYGIRRLIK

  • Different scores for
    • gap opening, eg: -5
    • gap extension, eg: L*(-1) with L=length of extension
    • gap opening > gap extension

Few long gaps

is better than

IG-TI--LYDL-SYYAG---IR

IGKIIPRL--LVAY--VLIGSR

many small gaps

gap opening

gap extension

TIL--------LISYGIRRLIK

TILKKSPSDVKLISYGIRRLIK

gap score= -5 -6

gap penalty1
Gap penalty
  • Can also consider special penalty for gaps at end/beginning of alignment (eg, zero penalty).
  • Need to be careful in adjusting the gap score to the substitution score:
    • too strong penalty  no gaps,
    • too weak penalty  too many gaps.
  • Insertions and deletions have been found to occur in nature at significantly lower frequency than mutations.
residue substitution
Residue Substitution
  • A substitution score for each aa pair  a substitution matrix.
  • Most used: based on evolutionary relationship.
  • Two types:
    • PAM series,
    • BLOSUM series.
pam percent accepted mutation
PAM (Percent Accepted Mutation)

PAM250

  • PAM1: observed mutations in carefully selected sets of closely related proteins (1572 sequences from 71 families). (1978)
  • Idea: observed substitutions are the result of 1 mutation (not many).
  • PAMn: iterate PAM1 n times to obtain substitution rate between more divergent sequences.

Use

when

PAM: 0 30 80 110 200 250

%identity: 100 75 60 50 25 20

blosum block substitution matrix
BLOSUM (BLOck Substitution Matrix)
  • Based on a larger set than PAM is.
  • More recent than PAM. (1992)
  • Different approach than PAM:
    • not based on an explicit evolutionary model,
    • observed aa substitutions in a set of conserved aa patterns called blocks.
  • BLOSUMn: fromblocks which are n% identical.
  • BLOSUM62: empirically shown to be among the best at detecting weak similarity.

BLOSUM62

tips for using substitution matrices

BLOSUM 80 BLOSUM 62 BLOSUM 45

PAM 1 PAM 120 PAM 250

Less divergent More divergent

Tips for using substitution matrices
  • Generally, BLOSUM matrices perform better than PAM for local similarity searches.
  • For database searches, the most commonly used matrix is BLOSUM62.
  • When comparing closely related proteins, one should use lower PAM or higher BLOSUM, for distantly related proteins higher PAM or lower BLOSUM matrices
  • Caution: substitution matrices are statistical in nature. In a given alignment, a substitution may or may not correspond to an actual mutation.
pairwise alignment algorithms
Pairwise Alignment Algorithms
  • Given a scoring scheme, an alignment algorithm tries to find the best alignment between 2 sequences according to that scheme.
  • Exact algorithms:
    • guaranteed to return an alignment with the best possible score.
  • Heuristic alignments:
    • not guaranteed to return best alignments.
    • but they are quicker (and hopefully still return good alignments).
  • Two types of alignment:
    • Global: forced over the entire length of 2 sequences.
    • Local: between substrings of 2 sequences..
global vs local alignment
Global vs Local Alignment
  • Global alignments:
    • are sensitive to gap penalties,
    • Assumes homology.
    • Outputs everything – either matches or gaps
    • can be used to compare 2 proteins with same function (in, eg, human/mouse).
  • Local alignments:
    • Can be used to look for conserved domains or motifs in 2 proteins,
    • search for local similarities in large sequences,
    • database searches,
    • scanning an entire genome with a short sequence.
    • Does not output everything – only the best hits
exact algorithms dynamic programming
Exact Algorithms: Dynamic Programming

How can we find the best alignment between 2 sequences?

  • Exhaustive search among all possible alignments is not possible (eg, for 2 sequences of 100 and 95 residues: 55 millions possible alignments with 5 gaps).
  • Problem solved by dynamic programming:
    • initialize top row and left column,
    • compute best local scores iteratively,
    • keep track of where best local score comes from,
    • traceback to obtain the best alignments.
  • May exist several best solutions: an alignment reported to you may be one among a number of possibilities.

best global score

Example of 2 best solutions:

ATTCTCTGA

-TAC--TGA

ATTCTCTGA

-TA--CTGA

The example is from www.pasteur.fr

local and global alignment servers exact algorithm
Local and global Alignment Servers (Exact Algorithm)

Use the Needleman-Wunsch algorithm (1970)

and the Smith-Waterman algorithm (1981).

  • Server at EBI: EMBOSS-Align
    • Let´s submit to http://www.ebi.ac.uk/emboss/align/index.html the sequence :

>uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complex

MALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFC

SSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLG

LENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWN

SLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQ

LPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHN

AIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLG

LTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDN

GLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGP

LQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDC

GCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEA

HFAPC

>uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D).

MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRM

DRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDK

MVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRN

NLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDS

LGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGT

HLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALR

GLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGH

NPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPP

DPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVF

AMIKIGQLFRTLIREELLFEAMGKSSN

heuristic algorithms
Heuristic Algorithms
  • Motivations:
    • Exact algorithms are exhaustive but computationally expensive.
    • Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning),
    • and so, database scanning requires faster alignment algorithm (at the cost of optimality).
heuristic algorithms1
Heuristic Algorithms
  • Probing a database with a query is similar to aligning a query with a very long sequence.
  • Main idea:
    • Use dynamic programming, but limited to (sub-)sequences which are likely to produce interesting alignments with the query.
    • Heuristic part of the algorithm: eliminate from search uninteresting sequences (need to make a guess).
  • Algorithms:
    • FASTA : Lipman-Pearson (1985).
    • BLAST (Basic Local Alignment Search Tool) : Altshul et al. (1990).

 need fast local alignment methods.

blast overview
BLAST Overview
  • Many versions for different query-database cases:
    • blastp: protein - protein
    • blastn: nucleotide - nucleotide
    • blastx: nucleotide  protein - protein
    • tblastn: protein - protein  nucleotide
    • tblastx: nucleotide  protein - protein  nucleotide
  • Comes in many flavours.
  • Fast and reliable.
  • Easy to use.
blast overview1
BLAST Overview
  • BLAST computes “an alignment”, not necessarily the exact optimal alignment.
  • Given the query and the database (long sequence):
    • Find all words of length k (default: k=3 for AA and k=11 for DNA) that match the query with a score high enough.
    • Look for subsequences in the database that contain these words.
    • Extend subsequences to see if match score can be increased.
    • Compute total score when no more extensions are possible.
  • Rank the alignments.
blast at ncbi
BLAST at NCBI

>1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR

EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYR

FPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDI

GLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGN

KPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCP

STCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVC

VPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECM

QECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQ

MLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRH

SHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDH

RNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTR

NNGERASCESDVDDDDKEQKLISEEDLN

Let´s submit the query sequence

At http://www.ncbi.nlm.nih.gov/BLAST/

slide43

Bit score: S’

The value S’ is derived from the raw alignment score S, but statistical properties of the scoring system have been taken into account. Because bit scores are normalised w.r.t. scoring system, they can be used to compare alignment scores from different searches.

E value: Expectation value.

Expected # of alignments with scores equivalent to or better than S to occur by chance. The lower the E value, the more significant the score.

NCBI Blast output help: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html

blast servers
BLAST servers
  • Pairwise alignment:
    • BLAST: http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
  • Database screening:
    • BLAST:
      • http://www.ncbi.nlm.nih.gov/BLAST/
      • http://www.ebi.ac.uk/blast/index.html
      • http://www.ch.embnet.org/software/bBLAST.html
      • http://www.ch.embnet.org/software/aBLAST.html

Remark: there is a server with a powerful implementation of Smith-Waterman for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times slower, but is more sensitive and returns less false positives than Blast.

psi blast
PSI-BLAST
  • Position-Specific Iterated Blast:
    • More sensitive, ie better at detecting distant relationships, than BLAST.
    • Computes position-specific substitution matrices (PSSMs) to score matches between query and database sequences. (Blast uses precomputed substitution matrices, eg BLOSUM62.)
psi blast1
PSI-BLAST
  • Repeatedly searches the target databases.
  • At each round:
    • compute a multiple alignment of high scoring sequences to generate a new PSSM for next round of searching.
  • Iterates until no new sequences found (or until a maximal number of iteration is reached).
significance of alignments
Significance of Alignments
  • Scores cannot be used to rank alignments:
    • a bad but long alignment may have a higher score than a good but short alignment.
  • We need a normalized scoring scheme that would allow to compare alignments, and evaluate their biological significance.
  • Idea:
    • Probe the database with random sequences.
    • This gives a distribution of scores (it follows the extreme-value distribution).
    • Establish a threshold for significance.
extreme value distribution
Extreme-Value Distribution

Score distribution for random sequences

probability that the score of our query is no better than random: P-value

score

score of our query

Difficulty: finding a significance threshold.

quantifying the significance of alignments
Quantifying the Significance of Alignments

For an alignment with raw score S:

  • P-value:
    • The probability of an alignment occurring with score S or better if the aligned-against sequence is random.
    • The lower the P-value, the more significant the alignment.
  • E-value:
    • Expected number of alignments with scores equivalent to or better than S to occur by chance only.
    • The lower the E-value, the more significant the alignment.
    • E-value = P-value * size of database.
rough guide for p values and e values
Rough Guide for P-values and E-values
  • P-Value (reported by many programs): 0 ≤ P-val ≤ 1
  • E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database
rules of thumb for pairwise alignment
Rules of thumb for pairwise alignment
  • Use server defaults in the absence of any other information.
  • Adjust the substitution matrix to the expected divergence of the 2 sequences. Use BLOSUM62 if no a priori information.
  • For distantly related sequences, use PSI-Blast rather than BLAST. If PSI-BLAST doesn’t give you anything use GTG.
  • Many ways of aligning 2 sequences.
    • A returned alignment is not the absolute truth.
    • Inspect the alignment from the biologist´s perspective.
exercises1
Exercises
  • Go to the course web page and start with exercises given in file: p_alignment_exercises.doc
  • http://ekhidna.biocenter.helsinki.fi/how