CSCE555 Bioinformatics

1 / 31

# CSCE555 Bioinformatics - PowerPoint PPT Presentation

CSCE555 Bioinformatics. Lecture 6 Sequence Alignment (partIII) Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Roadmap.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'CSCE555 Bioinformatics' - barr

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CSCE555 Bioinformatics

Lecture 6 Sequence Alignment (partIII)

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South Carolina

Department of Computer Science and Engineering

2008 www.cse.sc.edu.

• Hashing Function based quick search
• Heuristic algorithm: FASTA, BLAST
• Multiple Sequence Alignment algorithm: Clustal W
• Summary
Searching
• Consider the problem of searching an array for a given value
• If the array is not sorted, the search requires O(n) time
• If the value isn’t there, we need to search all n elements
• If the value is there, we search n/2 elements on average
• If the array is sorted, we can do a binary search
• A binary search requires O(log n) time
• About equally fast whether the element is found or not
• It doesn’t seem like we could do much better
• How about an O(1), that is, constant time search?
• We can do it if the array is organized in a particular way
Hashing
• Suppose we were to come up with a “magic function” that, given a value to search for, would tell us exactly where in the array to look
• If it’s in that location, it’s in the array
• If it’s not in that location, it’s not in the array
• This function is called a hash function because it “makes hash” of its inputs
(Magic) Hashing Function
• A hash function is a function that:
• When applied to an Object, returns a number
• When applied to equal Objects, returns the same number for each
• When applied to unequal Objects, is very unlikely to return the same number for each
• Hash functions turn out to be very important for searching, that is, looking things up fast

kiwi

0

1

2

3

4

5

6

7

8

9

banana

watermelon

apple

mango

cantaloupe

grapes

strawberry

Example (ideal) hash function
• Suppose our hash function gave us the following values:
• hashCode("apple") = 5hashCode("watermelon") = 3hashCode("grapes") = 8hashCode("cantaloupe") = 7hashCode("kiwi") = 0hashCode("strawberry") = 9hashCode("mango") = 6hashCode("banana") = 2
Example of Hash Function
• PRIVATE int hash_number (const char *key, int size)
• { int hash = 0;
• if (key) { const char * ptr = key;
• for(; *ptr; ptr++)
• hash = (int) ((hash*3 + (*(unsigned char*)ptr)) % size);
• }
• return hash;

}

BLAST (Basic Local Alignment Search Tool)
• Approach (BLAST) (Altschul et al. 1990, developed by NCBI)
• View sequences as sequences of short words (k-tuple)
• DNA: 11 bases, protein: 3 amino acids
• Create hash table of neighborhood (closely-matching) words
• Use statistics to set threshold for “closeness”
• Start from exact matches to neighborhood words
• Motivation
• Good alignments should contain many close matches
• Statistics can determine which matches are significant
• Much more sensitive than % identity
• Hashing can find matches in O(1) time
• Extending matches in both directions finds alignment
• Yields high-scoring/maximum segment pairs (HSP/MSP)
Multiple Sequence Alignment
• Alignment containing multiple DNA / protein sequences
• Look for conserved regions → similar function
• Example:

#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT

#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT

#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC

#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC

#OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG

#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT

#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

Multiple Sequence Alignment: Why?
• Identify highly conserved residues
• Likely to be essential sites for structure/function
• More precision from multiple sequences
• Better structure/function prediction, pairwise alignments
• Building gene/protein families
• Use conserved regions to guide search
• Basis for phylogenetic analysis
• Infer evolutionary relationships between genes
• Develop primers & probes
• Use conserved region to develop
• Primers for PCR
• Probes for DNA micro-arrays
Multiple Alignment Model

Q1: How should we define s?

Q2: How should we define A?

Model: scoring function s: A

X1=x11,…,x1m1

X1=x11,…,x1m1

Possible alignments of all Xi’s: A ={a1,…,ak}

Find the best alignment(s)

X2=x21,…,x2m2

X2=x21,…,x2m2

S(a*)= 21

XN=xN1,…,xNmN

XN=xN1,…,xNmN

Q4: Is the alignment biologically

Meaningful?

Q3: How can we find a* quickly?

Minimum Entropy Scoring
• Intuition:
• A perfectly aligned column has one single symbol (least uncertainty)
• A poorly aligned column has many distinct symbols (high uncertainty)

Count of symbol a in column i

=Maximum score of an alignment up to the subsequences ending with

Alignment: 0,0,0…,0---|x1| , …, |xN|

We can vary both the model and the alignment strategies

Multidimensional Dynamic Programming

NP-complete problem. High complexity

Approximate Algorithms for Multiple Alignment
• Two major methods (but it remains a worthy research topic)
• Reduce a multiple alignment to a series of pairwise alignments and then combine the result (e.g., Feng-Doolittle alignment)
• Using HMMs (Hidden Markov Models)
• Feng-Doolittle alignment (4 steps)
• Compute all possible pairwise alignments
• Convert alignment scores to distances
• Construct a “guide tree” by clustering
• Progressive alignment based on the guide tree (bottom up)
How to Align One Sequence to an Existing Alignment?

Add a sequence to an existing group:

a sequence s: CGAAATC want to align to a existing alignment

s1 AG–AT–

s2 -GAATC

The high scoring pairwise alignment is

s2 -G–AATC

s CGAAATC

Hence , s is merged into the group alignment as:

s1 AG--AT–

s2 -G–AATC

s CGAAATC

fixed

How to Align a Group to Another Group?
• Two groups:
• S1 ATTGCCATT--
• S2 ATC-CAATTTT
• S3 ATGGCCATT
• S4 ATCTTC-TT

The highest score alignment is S1 – S3 , so it is used for aligning the two groups as

S2 ATC–CAATTTT

S1 ATTGCCATT--

S3 ATGGCCATT--

S4 ATCTTC-TT--

Limitation of Feng-Doolittle Alignment
• Problems of Feng-Doolittle alignment
• All alignments are completely determined by pairwise alignment (restricted search space)
• No backtracking (subalignment is “frozen”)
• No way to correct an early mistake
• Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process
• Iterative Refinement
• Re-assigning a sequence to a different cluster/profile
• Repeatedly do this for a fixed number of times or until the score converges
• Essentially to enlarge the search space
Clustal W: A Multiple Alignment Tool
• CLUSTAL and its variants are software packages often used to produce multiple alignments
• Essentially following Feng-Doolittle
• Do pairwise alignment (dynamic programming)
• Do score conversion/normalization (Kimura’s model)
• Construct a guide tree (neighbour-journing clustering)
• Progressively align all sequences using profile alignment
• Offer capabilities of using substitution matrices like BLOSUM or PAM
• Many Heuristics
• Kalign
• MAFFT (Multiple Alignment using Fast Fourier Transform)
• MUSCLE stands for MUltiple Sequence Comparison by Log-Expectation. MUSCLE is claimed to achieve both better average accuracy and better speed than ClustalW2 or T-Coffee
• T-Coffee allows you to combine results obtained with several alignment methods
Measuring Alignment Significance
• The statistical significance of a an alignment score is used to try to determine if an alignment is the result of homology or just random chance.
• The E-value of an alignment score is the expected number of unrelated sequences in a database that would have a score at least as good.
E-values and p-values
• The E-value of a particular score is determined by multiplying the number of sequences in the database, n, times the p-value of the score.
• The p-value of score X is the probability of a single random alignmenthaving score X or larger.

E-value(X) = n•p-value(X)

To compute the p-value of X, we must know how random scores are distributed.

The p-value of X is equal to the area under the distribution curve to the right of X.

For ungapped local alignments, the distribution can be computed analytically.

For gapped alignments, it must be estimated empirically.

Computing p-values
Summary
• Hashing for quick search
• Blast and Fasta
• Progressive Multiple Sequence alignment
• Testing significance of alignments
Next Lecture
• Profiles and HMM