slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024 PowerPoint Presentation
Download Presentation
Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024

Loading in 2 Seconds...

play fullscreen
1 / 42

Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024 - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

Sequence Alignments and Database Searching. Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024. Adapted from DKW lecture. Protein A of interest to you. ornithine decarboxylase?. Why compare protein sequences?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Aug 27, 2008 Biochemistry 111 Thang Chiu, MEB 7E1, x2024' - mliss


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Sequence Alignments and Database Searching

Aug 27, 2008

Biochemistry 111

Thang Chiu, MEB 7E1, x2024

Adapted from DKW lecture

slide2

Protein A of interest to you.

ornithine decarboxylase?

Why compare protein sequences?

Significant sequence similarities allow associations based upon known functions.

slide3

Homology vs. similarity

Possible for proteins to possess high sequence identity/ similarity between segments and not be homologous

Homologous proteins (ie having similar structures) need not posess high sequence identity / similarity:

S. griseus trypsin 36%

S. griseus protease A 25%

cytochrome c4, has reasonably high sequence identity/ similarity with trypsins, yet does not have common ancestor, nor common fold.

subtilisin has same spatial arrangement of active site residues, but is not related to trypsins

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

slide4

Homology vs. similarity

  • Homologous proteins always share a commonthree-
  • dimensional fold, often with common active or binding site.
  • Proteins that share a common ancestor are homologous.
  • Proteins that possess >25% identity across entirelength generally will be homologous.
  • Proteins with <20% identity are not necessarily not homologous
slide5

Homology vs. similarity

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

Orthologous cyctochrome c isozymes

Homologous sequences are either: 1) orthologous, or 2) paralogous

Orthologs - sequence differences arises from divergence in different species (i.e. cyctochrome c)

Paralogs - sequence differences arise after gene duplication within a given species (i.e. GPCRs, hemoglobins)

Hemoglobins contain both orthologs and paralogs

  • For orthologs - sequence divergence and evolutionary relationships will agree.
  • For paralogs - no necessary linkage between sequence divergence and speciation.
slide6

We’ve all seen and/or used sequence alignments, but how

are they accomplished?

Sequence searches and alignments using DNA/RNA are usually not as

informative as searches and alignments using protein sequences. However.

DNA/RNA searches are intuitively easier to understand:

AGGCTTAGCAAA........TCAGGGCCTAATGCG

|||||||| ||| ||||||||||| |||

AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG

The above alignment could be scored giving a “1” for each identical nucleotide,

A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension

of the gap. So score = 25 – 11= 14

slide7

Protein sequence alignments are much more complicated.

How would this alignment be scored?

ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH

| | | | | ||| | | || |||

AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH

Unlike nucleotide sequence alignments, which are either identical or

not identical at a given position, protein sequence alignments include

“shades of grey” where one might acknowledge that a T is sort of

equivalent to an S etc. But how equivalent? What number would you

assign to an S-T mismatch? And what about gaps? Since alanine is

a common amino acid, couldn’t the A-A match be by chance? Since

Trp and Cys are uncommon, should those matches be given higher

scores?

Do you see that accurately aligning sequences and accurately

finding related sequences are  the same problem?

slide8

Needleman-Wunsch global sequence alignment (JMB (1970), 48, 443-453)

Assign score to all cells

A B C N J R

A B C N J R

A B C N J R

A B C N J R

A B C N J R

A B C N J R

A

J

C

J

N

R

A 1 0 0 0 0 0

J 01 1 1 2 1

C 01 1 0 0 0

J 01 0 0 1 0

N 01 0 1 0 0

R 01 0 0 0 1

A 1 0 0 0 0 0

J 01 1 1 2 1

C 01 2 1 1 2

J 01 1 2 3 2

N 01 1 3 2 3

R 01 1 2 2 4

A 1 0 0 0 0 0

J 0 0 0 0 1 0

C 0 0 1 0 0 0

J 0 0 0 0 1 0

N 0 0 0 1 0 0

R 0 0 0 0 0 1

A 1 0 0 0 0 0

J 01 1 1 2 1

C 01 2 1 1 2

J 01 1 2 3 2

N 01 1 3 2 3

R 01 1 2 2 4

A 1 0 0 0 0 0

J 0 0 0 0 1 1

C 0 0 1 0 0 0

J 0 0 0 0 1 0

N 0 0 0 1 0 0

R 0 0 0 0 0 1

Traceback

A B C N J R

A J C J N R

A B C N J R

A J C J N R

SUM S(I,j) with max of S(I,j) of previous column/row

OR

slide9

Databases

Nucleotide: GenBank (NCBI), EMBL, DDBJ

Protein: SwissProt, TrEMBL, GenPept(GenBank)

Huge databases – share much information. Many entries linked to other

databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant

(nr) protein sequence database is very large but sometimes confusing.

These databases can be searched in a number of ways. Can search only

human or metazoan sequences. Can eliminate entries made before a given

Date. Etc.

slide11

Continued….

NCBI

GI numbers:a series of digits that are assigned consecutively by NCBI to each sequence it processes.

Version numbers:consist of the accession number followed by a dot and a version number.

Nucleotide sequence: GI: 6995995VERSION: NM_000492.2

Protein translation: GI: 6995996VERSION: NP_000483.2

>gi|897557|gb|AAA98443.1| TIAM1 protein

http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/geneguide.shtml

slide12

We’ve got the data, now how do we score/search?

First, we need a way to assign numbers to “shades

of grey” matches.

Genetic code scoring system – This assumes that changes in protein

sequence arise from mutations. If only one point mutation is needed

to change a given AA to another (at a specific position in alignment),

the two amino-acids are more closely related than if two point mutations

were required.

Physicochemical scoring system – a Thr is like a Ser, a Trp is not like

an Ala……

These systems are seldom used because they have problems. Why

try to second guess Nature? Since there are many related sequences

out there, we can look at some (trusted) alignments to SEE which sub-

stitutions have occurred and the frequency with which they occur.

slide13

PAM (Percent Acceptable Mutation) matrices

  • Are derived from studying global alignments of well-characterized protein families.
  • PAM1 = only 1% of residues has changed (ie short evolutionary distance)
  • Raise this to 250 power to get 250% change of two sequences (greater
  • evolutionary distance), or about 20% sequence identity.
  • Therefore,
  • a PAM 30 would be used to analyze more closely related proteins,
  • a PAM 400 is used for finding and analyzing distantly related proteins.
  • PAMx = PAM1x
  • (Dayhoff, Atlas of Protein Sequence and Structure, vol. 5, suppl 3, p 345-352)
slide14

Block substitution matrices (BLOSUM)

Arederived from studying local alignments (blocks) of sequences from related proteins that differ by no more than X%. (Henikoff & Henikoff, PNAS ‘92, 89, p10915-10919)

In other words, one might use the portions of aligned sequences from related proteins that have no more than 62% identity (in the portions or blocks) to derive the BLOSUM 62 scoring matrix.

One might use only the blocks that have <80% identity to derive the BLOSUM 80 matrix.

3) BLOSUM and PAM substitution matrices have the opposite effects:

The higher the number of the BLOSUM matrix (BLOSUM X), the more closely related proteins you are looking for.

The higher the number of the PAM matrix (PAM X), the more distantly related proteins you are looking for.

slide15

Amino acid substitution matrices

  • Negative scores - unlikely substitutions

Note that for identical matches, scores vary depending upon observed frequencies. That is, rare amino acid (i.e. Trp) that are not substituted have high scores; frequently occuring amino acids (i.e. Ala) are down-weighted because of the high probability of aligning by chance.

PAM250 matrix

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

slide16

Gap penalties – Intuitively one recognizes that there should be a penalty

for introducing (requiring) a gap during identification/alignment of a given

sequence. But if two sequences are related, the gaps may well be located

In loop regions which are more tolerant of mutational events and probably

have little impact on structure. Therefore, a new gap should be penalized,

but extending an existing gap should be penalized very little.

Filtering – many proteins and nucleotides contain simple repeats or

regions of low sequence complexity. These must be excluded from

searches and alignments. Why?

Significance of a “hit” during a search - More important than an arbitrary

score is an estimation of the likelihood of finding a hit through pure chance.

Ergo the “Expectation value” or E-value. E-values can be as low as 10-70.

slide17

E-value

So, for sufficiently large databases (so can apply statistics):

E = Kmne-S

m- query length

n - database length

E - expectation value

K - scale factor for search space (database)

 - scale factor for scoring system

S - score, dependent on substitution matrix, gap-penalties, etc.

Doubling either sequence string doubles number of sequences with a given expectation value; similarly, double the score and expectation value decreases exponentially

Expectation value - probability that given score will occur by chance given the query AND database strings

slide18

Removing length bias from scoring statistics

  • Must account for increases in similarity score due to increase in sequence length searched.
  • Scaling with against the sequence length allows the detection of distantly-related sequences.
  • solids = individual sequence
  • opens = average score

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

slide19

Global versus local alignments

  • Global scores require alignment of entire sequence length.
    • Cannot be used to detect relationships between domains in mosaic proteins.

Local alignments are necessary to detect domains within mosaic proteins, internal duplications.

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

slide20

Basic local alignment search tool (BLAST)

  • Break query up into “words” e.g. ASTGHKDLLV
  • AST
  • WORDS STG
  • TGH
  • 2) Generate expanded list of words that would match with (i.e. PAM250)
  • a score of at least T – You’re acknowledging that you may not have any
  • exact matches with original list of words.
  • 3) Use expanded list of words to search database for exact matches.
  • 4) Extend alignments from where word(s) found exact match.

Heuristic algorithm – Uses guesses. Increases speed without a great

loss of accuracy (BLASTP, FASTA (local Hueristic), S-W local rigorous,

Needleman-Wunsch global, rigorous)

slide21

Pictorial representation of BLAST algorithm

(Basic Local Alignment Search Tool).

Query sequence

Words (they overlap)

Expand list of words

Search database, find exact hits, extend alignments

Report sorted list of hits

slide22

BLAST

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

exact word match

one hit

Nucleotide BLAST looks for exact matches

Protein BLAST requires two hits

GTQITVEDLFYNI

SEI YYN

neighborhood words

NCBI

two hits

slide23

FASTA

Instead of breaking up query into words (and then generating a list

of similar words), find all sequences in the database that contain

short sequences that are exact or nearly exact matches for sequences

within the query. Score these and sort. Sort of reverse methodology to

BLAST

Database sequence

Query sequence

slide28

sorted by e values

5 X 10-98

S’ = (λS –lnK)/ln2

E=mn2-S’

link to entrez

LocusLink

slide29

Identifying distant homologies

(use several different query sequences)

Also remember - If A is homologous to B, and B to C, then A should be homologous to C

Examine output carefully. A lack of statistical significance doesn’t necessarily mean a lack of homology!

Extracted from ISMB2000 tutorial,

WR Pearson, U. of Virginia

slide30

PSI-BLAST

Very sensitive, but must not include a non-member sequence!

  • Regular BLAST search
  • Sequences above a certain threshold (< specified E-value) are
  • included. Assumed to be related proteins. This group of sequences
  • is used to define a “profile” that contains the essence of the “family”.
  • Now with the important sequence positions highlighted, can look
  • for more distantly related sequences that should still have the essence
  • of the protein family.
  • Inclusion of more distantly related sequences modifies the profile
  • further (further defines the essence) and allows for identification of
  • even more distantly related sequences. Etc.

Note: PSI-BLAST may find and then subsequently lose a homologous

sequence during the iteration process! “Drifting” of the program, would

be the gradual loss of close homologs during the iteration process.

slide31

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH

MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY

VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ

EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG

RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH

VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

PSI-BLAST: initial run

NCBI

e value cutoff for PSSM

slide33

PSI-BLAST: first PSSM search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NCBI

slide34

PSI-BLAST: importance of original query

(remember, if A is like B….)

iteration 1

iteration 2

PSI-Blast of

human Tiam1

slide35

PSI-BLAST: importance of original query

iteration 1

iteration 2

PSI-Blast of

mouse Tiam2 (~90% identity with human Tiam1)

Ras-binding domains

iteration 3

slide36

Weakly conserved serine

Active site serine

Position specific scoring matrix (PSSM)

(learning from your “hits”)

NCBI

slide37

Position specific scoring matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V

206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1

207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5

208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2

209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0

210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6

211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3

212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4

213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3

214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6

215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7

216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5

217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7

218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7

219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6

220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0

221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6

222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0

223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4

224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently

in these two positions

Active site nucleophile

NCBI

slide38

Multiple sequence alignments (MSAs)

In this example, an MSA is used to identify regions of high sequence conservation presumably reflecting structural and functional constraints. Useful for delimiting known domains and potential new functional regions (e.g. the Ras-binding domain in yellow and the blue box of currently unknown function).

slide39

Fun with MSA...

MSA used to locate functional residues and domain boundaries in homologs of Dbl-proteins with known structure (Dbs and Tiam1).

Red amino acids directly interact with GTPases. Blue residues directly interact with phosphoinositides.

slide40

Tutorial on Jalview for MSA

Determining domain boundary for construct to express

Secondary, possible 3D structural information to help

narrow down 5’ and 3’ regions for PCR primers

slide41

What you should know

Homology If two proteins are homologous, they have a common fold and

a common ancestor

If two proteins have >25% identity across their entire length, they are likely to be

Homologs. However, sometimes true homologs have quite low sequence identity!

Orthologs Homologous (and equivalent) proteins from different species.

Arise from speciation.

Paralogs Homologous (and equivalent) proteins found in same species.

Divergence of sequences NOT from speciation.

Alignments How to score?

Minimum # of mutations?, Physicochemical properties (as

perceived by us)?, Or learn from nature?

Scoring schemes PAM, BLOSUM

slide42

E values What it means in words

E = Kmne -λS

Alignment algorithms BLAST (Basic Local Alignment Search Tool)

FASTA (Fast Alignment)

Needleman-Wunsch (Global alignment)

Why use local alignment algorithm?