CONT

Outline 1.what is BLAT & why we need it 2.BLAT's similarity & difference compared with BLAST 3.BLAT's application forms 4.BLAT's 3 major application 5. conclusion

What is BLAT & why we need it there exist many alignment tools-SmithWaterman's algo :solves two short sequence alignment problem -FASTA,NCBIBLAST,MegaBLAST WU-BLAST :provides flexible & fast alignment involving large database -Sim4 :does a fine job with cDNA alignment-SAM,PSI-BLAST :slowly but surely find remote homology

CONTprocess of assembling and annotating the human genome-aligning three millions ESTs and aligning 13 million mouse whole- genome random reads against the human genome-need to be done in less than two weeks in order to have time to process an updated genome every month or two==>we need a very high speed alignment algorithm so the author developed BLATthe Blast-Like Alignment Tool

CONT -BLAT(compared with existing tools) -more accurate -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment -BLAT’s steps 1.using nonoverlapping k-mers to create index 2.using index to find homologous region 3.aligning these regions seperately 4.stiches these aligned region into larger alignment 5.revisit small internal exons possibly missed in first stage and adjusts large gap boundaries that have canonical splice sites where feasible

CONT -BLAT’s speed & sensitivity are decided by 1.k-mer size (finding hits step) 2.mismatch scheme (aligning step) 3.number of required index matches (find hits step)

BLAT's similarity & difference compare with BLAST Similarity:-scans relative short matchs(hits) ie.build index then find hits-extend hits into high-scoring pairs (HSPs)

CONTDifference:-BLAST build index for query sequence but BLAT build index for database-BLAST scans linearly through database but BLAST scans linearly through query sequence -BLAST triggers an extension when one or two hits occur in proximity to each other but BLAT can trigger extensions on any number of perfect or near-perfect hits

CONTDifference:-BLAST returns each area of homology between two sequence but BLAT stitches them together into a larger alignment-BLAT has special code to handle introns in RNA/DNA alignments i.e. BLAT unsplices mRNA onto the genome

BLAT's application formsserver-client-building index is a relatively slow procedure a BLAT server is available for keeping index in memory for clients to query ==>good for interactive applicationsstand-alone-suitable for batch runs on one or more CPUs

BLAT's 3 major application & evaluation -mRNA/DNA alignment -Mouse/Human Translated alignment -client/server version to power interactive searches

CONTEvaluating mRNA/DNA Alignments(compared with Sim4)-test set: remapped 713 mRNAs to genes on chromosome 22-speed: BLAT:26 sec Sim4:5hr-sensitivity: BLAT: 99.99% agreed of the annotated bases Sim4: 99.96%

CONTEvaluating Mouse/Human Tanslated Alignment(compared with TBLASTX)-for Human/Mouse :it has been shown that gapless alignment are in many ways preferable to gapped alignment for detecting coding regions

CONTEvaluating Mouse/Human Tanslated Alignment(compared with TBLASTX)speed comparison:method k N matrix timeWU-TBLASTX 5 1 +15/-12 2736sWU-TBLASTX 5 1 BLOSUM62 2714sBLAT 5 1 +2/-1 61sBLAT 4 2 +2/-1 37sk:the size of perfect matching hit N:how many hits required to trigger a detailed alignmentmatrix:scoring method

CONTEvaluating Mouse/Human Tanslated Alignment(compared with TBLASTX)sensitivity comparison:method %chr 22 % Refseq Enrichment % Refseq bases exonsWU-TBLASTX 2.67% 81.7% 31x 84.5% BLAT 2.89% 80.8% 28x 86.7%%chr 22:percentage of chromosome 22 coverd by the alignment%Refseq:percentage of bases inside of human RefSeq coding sequence covered by the alignmentEnrichment:column 2 / column 1 and high level indicate more specificity%Refseq exons:percentage of RefSeq coding exons covered by the alignment

CONTserver/client to power interactive searches-thousands of interactive sequence searches per day -just one time for building index and keeps index in memory for query ===>efficient-but not as efficient as stand-alone version -because server need to save memory so it only keep the index,not the database

BLAT – The BLAST-Like Alignment Tool W.James Kent Genome Research 2002 陳韋仰

Database & Query Sequence • Database : nonoverlapping • Query sequence : overlapping database ……… K-mer query sequence K-mer

Three Search Criteria • Single Perfect Matches • Single Almost Perfect Matches • Multiple Perfect Matches

Definition • K : The K-mer size • M : The match ratio between homologous areas • H : The size of a homologous area • G : The size of the database • Q : The size of the query sequence • A : The alphabet size 20 for amino acids 4 for nucleotides

T • How many nonoverlapping K-mers in the homologous region ? H : Homologous area size K : K-mer size

Single Perfect • : The probability that a specific K-mer in a homologous region of the database matches perfectly with the corresponding K-mer in the query = (M : The match ratio between homologous areas)

Sensitivity • P : The probability that at least one nonoverlapping K-mer in the homologous region matches perfectly with the corresponding K-mer in the query P = 1 – (1 – )T = 1 – (1 – )T (T : #nonoverlapping K-mers in the homologous region)

Specificity • F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K #K-mers in the query sequence #K-mers in the database

Single Perfect (Nucleotide) M P H = 100 ; G = 3 billion , Q = 500

Single Almost Perfect : The probability that a nonoverlapping K-mer in a homologous region of the database matches almost perfectly with the corresponding K-mer in the query = + K * (1 – M) One letter may mismatch

Sensitivity • P : The probability that any nonoverlapping K-mer in the homologous region matches almost perfectly with the corresponding K-mer in the query P = 1 – (1 – )T

Specificity • F : The number of nonoverlapping K-mers that are expected to match by chance F = (Q - K +1) * ( ) * ( )K + (Q - K +1) * ( ) * (K * ( )K-1(1 - ( )))

Single Almost Perfect (Nucleotide) H = 100 ; G = 3 billion , Q = 500

Multiple Perfect • There must be N perfect matches, each no further than W letters from each other in the target coordinate, and have the same diagonal coordinate • Example : N = 2

Sensitivity • N = 1 , = • Pn : The probability that there are exactly n matches within the homologous region Pn = n(1 – )T – n ( ) • The probability that there are N or more matches => Pn+ Pn+1 +…+PT

Specificity • FN : the number of chance matches of N K-mers each separated by no more than W from the previous match • N = 1, F1 = (Q - K +1) * ( ) * ( )K

Specificity (continued) • S : The probability of a second match occuring within W letters after the first S = 1 – (1 - ( )K)W/K => Consider the Nth match is within W letters after the (N-1)th match FN = S * FN-1 FN = F1 * SN-1

Multiple Perfect (Nucleotide)

Default Match Criteria • Nucleotide : two perfect 11-mer • Protein : • stand-alone --- single perfect 5-mer • client/server --- three perfect 4-mer Reference : http://www.csie.ntu.edu.tw/~kmchao/seq05fall/BLAT_final.ppt

Implementation mickey

Algorithm • 1. Search stage • The program detects regions of the two sequences which are likely to be homologous. • 2. Alignment stage • Examining these regions in more detail and producing alignments for the regions.

Search stage • 1. building up an index • creating non-overlapping k-mers and their positions in the database. • 2. excluding useless k-mers • deleting K-mers that occur too often from index and containing ambiguity codes.

Search stage database … non-overlapping K-mer database position Index P1 P2 P1 P1 P2 P1 …

Search stage Index (K-mers) query sequence database position … P1 P2 P1 P1 P2 P1 … … overlapping K-mers

Search stage Hit list K-mer database position P1 query sequence position P1 database position P1 query sequence position P1 …

Search stage (example) picture from:http://www.csie.ntu.edu.tw/~kmchao/seq05fall/

Search stage (example) • According to previous page, we know that… If diagonal values are equal, they are on the same diagonal.

Search stage DP – QP = 0 DP – QP < 0 query Coordinate DP – QP > 0 database Coordinate … bucket 1 bucket 2 bucket 3

Search stage Don’t care 2. Hits within proto-clumps are then sorted along the database coordinates and put into real clumps if they are within the window limit. 1. Hits that are within the gap limit are bundled together into proto-clumps. picture from:http://www.csie.ntu.edu.tw/~kmchao/seq05fall/

Search stage • Clumps with less than the minimum number of hits are discarded • The rest are used to define regions of the database which are homologous to the query sequence. • Clumps which are within 300 bases or 100 amino acids in the database are merged together. 500 additional bases are added on each side to form the final homologous region.

3. Homologous region 1. Two clumps with the distance < 300 2. Adding 500 bases on each side Search stage picture from:http://www.csie.ntu.edu.tw/~kmchao/seq05fall/

Nucleotide Alignments • 1. Search hits • generating a hit list between the query and the homologous region of the database. • 2. Extend hits

Nucleotide Alignments (Extend hits) • The extension first merges adjacent hits and expands their ends as far as the cDNA and genomic DNA match perfectly. (overlapping hits are also matched) • Allow N's in the cDNA to match any single base. unaligned areas http://www.soe.ucsc.edu/~kent/intronerator/algo.html

Nucleotide Alignments (Extend hits) • The program then recurses, making up tiles and trying to match in the unaligned areas. • The recursion runs until either no tiles are found or until the gap between aligned blocks in the genome or cDNA becomes less than 6 (5 in BLAT) Possibly introns tiles (Using smaller k to find match in BLAT) http://www.soe.ucsc.edu/~kent/intronerator/algo.html

Nucleotide Alignments • Extensions that allow 1 or 2 mismatches if followed by multiple matches. • Extensions that allow 1 or 2 insertions or deletions (indels) followed by multiple matches are pursued. http://www.soe.ucsc.edu/~kent/intronerator/algo.html

CONT

CONT

Presentation Transcript

SQL (cont.)

Mammals Cont.

RAM (cont.)

References cont.

Spreadsheets, cont.

Solutions, Cont.

Cont…

cont

CONT SOFT

Energy (cont.)

Normalization (cont.)

Pointers (... Cont.)

(Cont ...)

Cont.

Indexing (Cont.)

P.E.M (Cont…)

Cont.

CONT