Sequence Analysis, Pair Wise Alignment, and Database Searching

Sequence Analysis, Pair Wise Alignment, and Database Searching Stephen Sontum Middlbury College sontum@middlebury.edu Chapter 5: Alignments and Phylogenetic Trees Introduction to Bioinformatics by Arthur M. LeskPapers by: Needleman and Pearson

What we hope to learnPair-wise Alignment • Finite Parts List Approach (review) • Protein evolution, similarity, and homology • Methods for Pairwise alignment- Hand alignment- Dot Plots- Heuristic FastA and BLAST- Dynamic Programming • Parameters of Sequence Alignment- Gap penalties- Protein scoring matrices Core

How Do We Organize Information in an OME? ● Controlled Vocabulary ● Ranked Tree Structure» Biota » Eukarya » Animalia » Chordata » Mammalia » Primates » Hominidae » Homo @ Homo Sapiens Core

A Parts List Approach to Bike Maintenance

A Parts List Approach to Bike Maintenance How many roles can these play? How flexible and adaptable are they mechanically? Core What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Where are the parts located?

A Parts List Approach to Baby Maintenance Clearly this is a much larger informatics problem but we can begin in the same way by finding and organizing the parts. TGAATCCCAGTTCAGCTCTTCAGCCTTTCGTGGATAAGAGAAGGCTGAAAGCGGGTCACGTTTTGGACTAAGCGACGCCC TTGCCAGGCATCCAGCTTAGTGGCTGTTGGTTTATTTGTAGAGTCCCCTTAACTCTCTCTCCCCCACATCGCCCATCTCC ACCGACGCCTCTCTCTCTCGTGTTATTTCTCCCCATTCTCGCTTCATTTCCCATCCATTTTCGAGTTCTGCAATATCCTC ACTAACTAGTATAGCCATGGTACGCCTCACTCGATCATCATCGTTGTTCGTGCGCTCAAACGCATCCGCTGTGCGGGGCA GATCTACTGGTGTCCTCCTGCGTAGATGAGCTGACGACTTCACTTCCAGGCCGACTCTCTGACCGAAGAGCAAGTTTCCG AGTACAAGGAGGCCTTCTCCCTATTTGTAAGTGCCATTGGTTACTGTTATATCAAAATCGAATTTGTATTGAGAGTATAC TAATACATTCCGCACTAAACAGGACAAGGATGGCGATGGTTAGTGCATCTGTCCCCCCAGGCTTGATCGCATTCGCCCAG CATGTCTGCTGTAGCTCTATATAACCGTTTCTGACAAACGGCGACAGGCCAGATTACCACTAAGGAGCTTGGCACTGTCA TGCGCTCGCTCGGTCAGAATCCTTCAGAGTCTGAGCTTCAGGACATGATCAACGAAGTTGACGCCGACAACAATGGCACC ATTGACTTTCCAGGTACGCGAACTCCCCAATCTACTTCGCACCAGCCTAGAAATGTACTAATGCTAAACAGAGTTCCTTA CCATGATGGCCAGAAAGATGAAGGACACCGATTCCGAGGAGGAAATTCGGGAGGCGTTCAAGGTCTTCGACCGTGACAAC AATGGTTTCATCTCCGCTGCTGAGCTGCGTCACGTCATGACCTCGATCGGTGAGAAGCTCACCGATGACGAAGTCGACGA

Organize:Part = Project ProposalX= Galileo DNA Transposon • (Important problem)Galileo is the only transposable element (TEs) known to have generated natural chromosomal inversion in the Genus Drosophila. It has not been assigned to a DNA transposon superfamily nor has its diversity in Drosophila genus been investigated. • (You understand various approaches) Galileo is a Foldback-like element that contain a transposase (TPase), which catalyzes their movement in the genome. It was discovered in Drosophilia buzzatii. A search of the Drosophilia Genus using TPase/buzzatti will be done to find homologues usefull in its classification. • (You have a good source of data) Chomosome assemblies of D.melanogaster, D.simulans, and the Contig CAF1 assemblies of 10 other Drosophila genomes may be found at http://rna.lbl.gov/drosophila

Organize:Part = Project ProposalX= Galileo DNA Transposon • (You found an interesting approach)TBLASTN searches using the 40 terminal bp of Dbuz\Galileo will be used to find homologs. One interesting homolog in meganogaster Dmel\P has 30% identity, which can be used to broaden the homology search. • (You have a specific plan) Contigs containing Putative TPases will be scrutinized to characterize the different number of copies of TEs using dot plots to define the boundaries of each copy. BLASTN searches with the full Galileo sequence will be used to investigate the abundance of this TEs within each species. • (You will know when you are done)Ten superfamilies of TEs at ENCODE site have been well characterized. If homologs to Galileo fall into one of these superfamilies we will be able to place it into a specific family. Find a paper that investigates a similar problem. This was base on: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2268567/?tool=pubmed

Organize:Part = DNA ENCODE Project The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of the consortium is to build a comprehensive parts listof the functional elements of the human genome, including elements that act at the protein level (coding genes) and RNA level (non-coding genes), and regulatory elements that control the cells and circumstances in which a gene is active. The results of ENCODE experiments, collected in the ENCODE DCC database, are displayed on the UCSC Genome Browser.

Organize:Part = DNA ENCODE Project • 1% of the Human genome 29.9 Mb is being compared to 28 other vertebrates • Only 60% of the sequences under evolutionary constraint map to functional elements • Almost all of the human genome is transcribed • histone modifications can be mapped to temporal patterns of DNA replication Figure 2.5 Phylogenetic tree of species treated in the ENCODE project

Organize: Part = Sequence g c t g – a a – c g | | | | |- c t a t a a t c - • Homologue – common ancestor • Orthologue • Similar function different species • Paralogue • Different function in the same species • Pair-wise Alignment • Similarity of sequence • Sequences with 25% identity are homologus

Protein Evolution “For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson • When we align sequences, we assume that they share a common ancestor • They are then homologous • Protein folds are much more conserved than protein sequence • DNA sequences tend to be less informative than protein sequences

Use Protein Sequences for Similarity Searches 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) Similarity of AAs can be scored - # of mutations, chemical similarity, PAM matrix 3) Protein databanks are much smaller than DNA databanks less random matches. 4) Similarity is determined by pair-wise alignment of different sequences Core

Pair-wise Alignment • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. • There are lots of possible alignments. • Two sequences can always be aligned. • Sequence alignments have to be scored. • Often there is more than one solution with the same score.

Methods of Alignment • By hand - slide sequences on two lines of a word processor • Dot plot • with windows • Heuristic methods (fast, approximate) • BLAST and FASTA • Word matching and hash tables • Dynamic programming • Needleman-Wunsch(slow, optimal) Core

By Hand Raw Data ???T C A T GC A T T G

By Hand Raw Data ???T C A T GC A T T G 2 matches, 0 gaps T C A T G | |C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G We still need some kind of scoring system to find the best alignment

By Hand Raw Data ???T C A T GC A T T G 2 matches, 0 gaps T C A T G | |C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G Hamming distance = # mismatched characters Hamming = 3 Levenshtein distance = # edit operations need to change one string into the other Levenshtein = 3 What about gaps?

By Hand Raw Data ???T C A T GC A T T G Core 2 matches, 0 gaps T C A T G | |C A T T G 3 matches (2 end gaps) T C A T G . | | | . C A T T G 4 matches, 1 gap T C A - T G| | | |. C A T T G 4 matches, 1 gap T C A T - G| | | |. C A T T G We need to base our score on evolution – deletions or insertions are common

A Brute Force String Search HERE_IS_A_SIMPLE_EXAMPLE ... EXAMPLE EXAMPLE Slide string one character at a time

A Brute Force String Search HERE_IS_A_SIMPLE_EXAMPLE ... EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE Slide string one character at a time

A Brute Force String Search HERE_IS_A_SIMPLE_EXAMPLE ... EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE HERE_IS_A_SIMPLE_EXAMPLE ... Slide string one character at a time

A Brute Force String Search HERE_IS_A_SIMPLE_EXAMPLE ... EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE HERE_IS_A_SIMPLE_EXAMPLE ... Slide string one character at a time M = Length of Query Pattern (7) N = Length of Database (24) Brute Force Expected N - M Worst Case N*M (gaps)

A Brute Force String Search HERE_IS_A_SIMPLE_EXAMPLE ... EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE EXAMPLE HERE_IS_A_SIMPLE_EXAMPLE ... Slide string one character at a time M = Length of Query Pattern (7) N = Length of Database (24) Brute Force Expected N - M Worst Case N*M (gaps) But when N = 3 billion we are going to need a more efficient methods.

T A C A T T A C G T A C A T A C A C T T A Dotplot: Visualizing the Sequence A dot plot gives an overview of all possible alignments Sequence 1 Sequence 2 1024 x 1024 pixels on a screen – we can view 1 Kb sequences

Dotplot: Visualizing the Sequence Alternately, both conventions will be used in our software A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C Sequence 2 Sequence 1

Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C Sequence 2 Core Sequence 1 T A C A T T A C G T A C A T A C A C T T A Possible alignment:

Insertions / Deletions in a Dotplot T A C T G T C A T T A C T G T T C A T Sequence 2 Sequence 1 T A C T G-T C A T | | | | | | | | | T A C T G T T C A T

Word Size Algorithm T A C G G T A T G A C A G T A T C Word Size = 3 C T A T G A C A T A C G G T A T G T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C 

Window / Stringency 12 11 10 9 8 7 6 5 4 3 2 1 0 Score = 11 PTHPLASKTQILPEDLASEDLTI  PTHPLAGERAIGLARLAEEDFGM Score = 11 PTHPLASKTQILPEDLASEDLTI Scoring Matrix Filtering  PTHPLAGERAIGLARLAEEDFGM Matrix: PAM250 Window = 12 Higher Stringency 12 Lower Stringency 9 Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM

DotLet http://myhits.isb-sib.ch/cgi-bin/dotlet Linear frequency Stringency low high Stringency Log frequency

DotLet Core http://myhits.isb-sib.ch/cgi-bin/dotlet • Each pixel represents a residue match • The pixel’s intensity depends on the similarity of the match • Black is a low score, white is a high score • Use the Histogram window to tune the gray scale to make the background noise disappear and the similar region stand out. • Move the bottom slide (low score threshold) past the blue peak. • Move the high score threshold past the purple peak. Move copy slit_drome.fasta from our class web site data to the desktop then copy into DotLet 1.5 under links

Repeated Protein Domains Drosophila melanogaster SLIT protein against itself. http://www.isrec.isb-sib.ch/java/dotlet/repeats.html

Dotlet Examples • slit_drome.fastaSLIT protein against itself • BSGTABX.fastaUTP-glucose pyrophosphorylase (gtab) gene against itself • ANCALM.fasta gene vs CALM_EMENI.fastaThe Calmodulin gene versus its protein product • CTRB1_HUMAN.fasta; CTRB1_MOUSE.fasta;CTRB1_FISH.fasta Chymotrypsin B proteins Down load these files to your desktop from the Data link on our webpage. When you need to grab a sequence open the files by right clicking on then and selecting edit with VIM. Load all of these into Dotlet1.5.

Terminators and other stem-loop structures UTP-glucose pyrophosphorylase (gtaB L12272) gene against itself The stem-loop functions as the transcription terminator for this gene. http://www.isrec.isb-sib.ch/java/dotlet/terminator.html

Introns and Exons Calmodulin gene against the calmodulin protein in Emericella nidulans Choose a stringent Matrix because you expect an exact match and a small window to find a sharp exon-intron boundary. Actually there are five you will need even a smaller window size http://www.isrec.isb-sib.ch/java/dotlet/exonintron.html

Sequence Analysis, Pair Wise Alignment, and Database Searching Week 3 First Lecture End Here • Methods for Pairwise alignment- Heuristic FastA and BLAST- Dynamic Programming • Parameters of Sequence Alignment- Gap penalties- Protein scoring matrices

Trypsin Phylogenetics Mouse vs Human

Trypsin Phylogenetics Fish vs Human

Trypsin Phylogenetics NCBI Blastp of CTRA_BOVIN (P00766) http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

Main Algorithms for SequenceSearching Core • FastA– Better for nucleotides than for proteins • BLAST - Basic Local Alignment Search Tool– Better for proteins than for nucleotides • Needleman-Wunsch or Smith-Waterman– More sensitive than FastA or BLAST. Sensitivity: the ability to detect "true positive" matches. The most sensitive search finds all true matches, but might have lots of false positives Specificity: the ability to reject "false positive" matches. The most specific search will return only true matches, but might have lots of false negatives Ideally we would like our searches to be Sensitive and Specific

Main Algorithms for SequenceSearching Sources for Alignment Programs and Servers • EMBOSS: http://emboss.sourceforge.net/ • servers: http://pro.genomics.purdue.edu/cgi-bin/emboss/needle • Expasy tools: http://ca.expasy.org/tools/ • Karplus page: http://users.soe.ucsc.edu/~karplus/compbio_pages.html • NCBI tools: http://www.ncbi.nlm.nih.gov/guide/data-software/#tools_ • Genome Workbench: http://www.ncbi.nlm.nih.gov/projects/gbench/

Main Algorithms for SequenceSearching

Main Algorithms for SequenceSearching Fish Cow Human Rat Mouse Search CTRB1_*

FastA http://fasta.bioch.virginia.edu/ • A set of programs for fast local sequence alignment maintained by William Pearson at the University of Virginia • Searches for short sequences called “ktups” that have previously been found and located in the data base. • The program builds local alignments based on these ktups matches

FastA: Ktups Regions FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches. The ten regions with the highest density of matches are rescored using a scoring matrix.

FastA: Gapped Alignment • The high scoring ktup regions are joined together to form an approximate alignment with gaps. • Only non-overlapping regions may be joined. • The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. • The highest scoring gapped alignment is optimized using a variation of the Smith-Waterman algorithm.

FastA: Programs

FastA: EMBL-EBIhttp://www.ebi.ac.uk/fasta33 Core

Sequence Analysis, Pair Wise Alignment, and Database Searching