Introduction on biological sequence indexing, searching and text mining

Introduction on biologicalsequence indexing,searching and text mining 序列索引、搜尋及文字探勘之簡介王世融

Outlines • Introduction on sequence searching • Patten Searching -Suffix tree • BLAST • Introduction on text mining -Case study

Bioinformaticstopics

Taking a glance on bioinformatics--Information Retrieval • Indexing Designing a spelling checker absurb absorb (mistyped) (really wanted) How fast can the system retrieve all potential candidates? absurb, absorb… What if he mistyped this as cbsorb

Indexing • Create a database that lists all words containing b boy, by, absorb, … i I, ill, illustrate, … n next, net no, … d dance, drive, dog, … • Which words contain ”b,i,n,d”? • When someone typed ”bond” Which words contain “b,o,n,d” or at least 3 out of these 4 letters? • dictionary:a simple search engine for lexicons

What is the relation? Treating Genomic/Proteomic dataas aLanguage

Decoding an unknown language • For proteomic data: Amino acid motif protein Alphabet word sentence Protein structure Sentence meaning • Finding the interrelationships of data Data Mining, Knowledge Discovery

Pattern Matching—Sequence indexing&searching • Hash Tables • Repeat Finding • Exact Pattern Matching • Keyword Trees • Suffix Trees • Approximates String Matching

Why Pattern Matching? • Pattern matching has important applications in Bioinformatics: -Finding repeats in genomes -Finding unique sequence in genomes -Searching databases for known and similar nucleotide sequences

Genomic Repeats • Example of repeats: ATGGTCTAGGTCCTAGTGGTC • Motivation to find them: -Genomic rearrangement, duplications, and deletions are often associated with repeats -Trace evolutionary secrets -Many cancers are characterized by an explosion of repeats

k-mer Repeats • Long repeats are difficult to find • Shot repeats are easy to find (e.g., hashing) • Simple approach to finding long repeats: Find exact repeats of short k-mers( is usually 10 to 13) Use k-mer repeats to potentially extend into longer, maximal repeats Repeat-(shorten) -> repeat Uniqueness-(extended)-> Uniqueness GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC starts at locations 3 and 17

Extending k-mer Repeats GCTTACAGATTCAGTCTTACAGATGGT • Extend these 4-mer matches: GCTTACAGATTCAGTCTTACAGATGGT • Maximal repeat: TTACAGAT -To find maximal repeats in this way, we need ALL start locations of all k-mers in the genome - Hashing lets us find repeats quickly in this manner

Hashing: What is it? • What does hashing do? For different data, generate unique integer Store data in an array at the unique integer index generated from the data • Hashing is a very efficient way to Store and Retrieve data

Hashing: Example • Where do the animals eat? • Records: each animal • Key: where each animal eats

Hashing DNA sequences • Each k-mer can be translated into a binary string (A, T, C, G can be represented as 00,01,10,11) • After assigning a unique integer per k-mer it is easy to get all start locations of each k-mer in a genome

Hashing: Maximal Repeats • To find repeats in a genome: For all k-mers in the genome, note the start position and the sequence Generate a hash table index for each unique k-mer sequence In each index of the hash table, store all genome start locations of the k-mer which generates that index • Extend k-mer repeats to maximal repeats

Hashing: Collisions • Dealing with collisions: “Chain” all start location of k-mers (linked list)

Pattern Matching • What if, instead of finding repeats in a genome, we want to find sequences in a database that contain a given pattern? • This leads us to a different problem, the Pattern Matching Problem

Pattern Matching algorithm for: Pattern GCAT Text CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC Exact Pattern Matching: An Example

Exact Pattern Matching: Running Time • Pattern Matching runtime: O(nm) • Probability-wise, it is less than O(nm) Better solution: suffix trees -Can build text in O(m) time find pattern length n in O(n) time -Conceptually related to keyword trees

Keyword Trees Properties of keyword trees: • Store a set of keyword in a rooted labeled tree • Each edge labeled with a letter form an alphabet • Every keyword stored can be spelled on a path from root to some leaf

Keyword Trees: Threading • To match patterns in a text using a keyword tree: -Build keyword tree of patterns -“Thread” the text through the keyword tree

Keyword Tree: Threading (cont’d) • Threading is “complete” when we reach a leaf in the keyword tree • When threading is “complete,” we’ve found a pattern in the text

Keyword Tree: Threading (cont’d) • Thread “appeal” appeal

Keyword Tree: Threading (cont’d) • Thread “apple” apple

Suffix Trees • Similar to keyword trees, except edges are condensed - All internal edges have at least two outgoing edges - Leaves labeled with suffix start position - Suffix trees build faster than keyword trees

Use of Suffix Trees • Suffix trees hold all suffixes of a text i.e., ATCGC: ATCGC, TCGC, CGC, GC, C Builds in O(m) time for text of length m • To find any pattern of length n in a text: Build suffix tree for text Thread the pattern through the suffix tree Can find pattern in text in O(n) time! • O(n+m) time for “Pattern Matching Problem” Build suffix tree and lookup pattern

Suffix Trees: Example

Multiple Pattern Matching: Summary • Keyword and suffix trees are used to find patterns in a text (i.e. homology between different genes) • Keyword trees: -Build keyword tree of patterns, and thread text through it • Suffix trees: -Build suffix tree of text, and thread patterns through it

An Excellent Research example: Annotating Large Genomes With Exact Word Matches John Healy Elizabeth E. Thomas Jacob T. Schwartz and Michael Wigler Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; Courant Institute of Mathematical Science, New York University, New York, New, York 10003, USA Genome Research 2003

An Excellent Research example: Motivation • Finding highly unique probes from the genome • Finding word count for words of arbitrary length in any given genome

An Excellent Research example:Issues and solutions • When word length exceeds 15, directly addressable structures become impractical -Suffix array, suffix tree? • Creating a reversible permutation of a string of text that tends to be more compressible than the original text -based on Burrows-Wheeler transform [‘94]

Approximate vs. Exact Pattern Matching • So far all we’ve seen are exact pattern matching algorithms • Usually, because of mutations, it makes much more biological sense find approximate pattern matches • Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches

Dot matrices show similarities between two sequences FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches) Dot Matrices

Dot Matrices (cont’d) • Identify diagonals above a threshold length Diagonals in the dot matrix indicate exact substring matching • Extend diagonals and try to link them together, allowing for minimal mismatches/indels Linking diagonals reveals approximate matches over longer substrings Single-nucleotide matching v.s. di-nucleotide matching

BLAST: Basic Local Alignment Search Tool • A Google for biological sequences • Approximate sequence alignment • Not only using BLAST, but also understanding why it works -Not just a computational tool -Misapply it, misinterpret its results

Interpreting New Words with a Dictionary • Finding the sentence with the same (or similar) semantics • Encountering a new word: “rucksack” Meaningless without a dictionary or some point of reference • Encountering a DNA or protein sequence: Need a point of reference No dictionary available but thesaurus exists • Rucksack backpack, bag, purse Does not give exact meaning, but helps with understanding

The statistics of sequence similarity Scores • Raw score S

The statistics of sequence similarity scores • Bit-Scores S’ -Calculated from raw scores S -Normalized with respect to the scoring system -Used to compare scores from different searches • E-value -Expectation value -The number of different alignments with scores >=S in database search -E-value ↓, S is more significant

BLAST: Locally Maximal Segment Pairs • A segment pair is maximal if it has the best score over all segment pairs • A segment pair is locally maximal if its score can’t be improve by extending or shortening • Statistically relevant locally maximal segment pairs are of biological interest • BLAST finds all locally maximal segment pairs with scores above some threshold A significantly high threshold will filter out some “junk”, irrelevant matches

High Scoring Pair (HSPs) • All segment pair whose scores can not be improved by extension or trimming • Need to model a random sequence to analyze how high the score is in relation to chance

Expected number of HSPs • Expected number of HSPs with score ≧S • E-value E for the score S: E=Kmn2 • Given: Two sequence, length n and m The statistics of HSP scores are characterized by two parameters K and λ K: scale for the search space size λ: scale for the scoring system

Dictionary: All words of length K (~10) Alignment initiated between word of alignment score ≧T (typically T=K) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score >statistical threshold Indexing-based local alignment

BLAST algorithm (cont’d)

Gapped extensions Extensions with gaps in a band around anchor Output: GTAAGGTCC-AGT GTTAGGTCCTAGT Indexing-based local alignment－Extensions

Types of BLAST • blastn: Nucleotide-nucleotide • blastp: Protein-protein • blastx: Translated query vs. protein database • tblastn: Protein query vs. translated database • tblastx: Translated query vs. translated database (6 frames each )

Right Tool for the Right Job

Variants of BLAST • NCBI BLAST: search the universe http://www.ncb.nlm.nih.gov/BLAST/ • PSI-BLAST Find members of a protein family or build a custom position-specific score matrix Bootstrapping results to find very related sequence • MEGABLAST: Optimized to align very similar sequence Works best when k=4i≧16 Linear gap penalty • Wu-BLAST: (Wash U BLAST) Very good optimizations Good set of feature & command line arguments • BLAT Faster, less sensitive than BLAST Good for aligning huge numbers of queries • CHAOS Uses inexact k-mers, sensitive • PatternHunter Uses patterns instead of k-mers • BlastZ Uses patterns, good for finding genes

Sample BLAST output • Blast of human beta globin protein against zebra fish

Sample BLAST output (cont’d) • Blast of human beta globin DNA against human DNA

Introduction on biological sequence indexing, searching and text mining

Introduction on biological sequence indexing, searching and text mining

Presentation Transcript

Indexing and Searching

Indexing and Searching

Text Mining: Fast Phrase-based Text Indexing and Matching

Introduction to Text Mining

Tools for Text Indexing and SearchING

Biological Sequence Comparison / Database Homology Searching

Indexing and Mining Biological Images

Biological Sequence Comparison / Database Homology Searching

Indexing and Mining Biological Images

Text Indexing

Indexing and Searching

Introduction on biological sequence indexing, searching and text mining

Indexing Biological Sequence Data

Sequence Indexing Schemes

Text Data Mining: Introduction

Performing Indexing and Full-Text Searching

Text Mining: an Introduction

Searching and Indexing

Introduction to Text Mining

Indexing and Searching

Multimedia and Text Indexing