Pattern Recognition CIS 786 Prof. Barry Cohen Pavan Tipirneni Niranjan Mulay Rana Farha Ketal Patel
What is Pattern Recognition? • A Technique to identify interesting patterns of events such as Amino acid, Nucleotide, Gene Expression levels etc. that appear in number of times in a particular set of data.
Pattern Recognition in Molecular Biology • Human Genome Project • Protein analysis • Gene Expression & DNA Micro Analysis • Drug Discovery
Pattern Discovery in Proteins • Three main steps • - Proteins related to a query sequence are found by searching the database for similar sequences. • - Sequences revealed from this initial screen are then used as query sequences to search other family members • - This process is repeated till exhaustion.
Tandem Repeats • These are two or more contiguous, approximate copies of a pattern of nucleotides. • There duplicates occur as a result of mutational events in which an original segment of DNA, the pattern is converted into a sequence of individual copies. • They have been linked to a number of different diseases. • These might play a role in gene regulation and in the development of immune system cells.
Types of Patterns • Deterministic Matches a given string or not. • Probabilistic each sequence is given a probability that this sequence is generated by a model. The higher the probability, the better is the match between sequence and pattern.
TEIRESIAS Algorithm • TEIRESIAS searches for patterns consisting of characters of the alphabet Σ and wild-card characters ‘.’. • Ambiguous Character is a character corresponding to a subset of Σ. Ex. A-[LF]-G • Wild-card or Don’t care is a special kind of ambiguous character that matches any character in Σ. Ex. N in nucleotide, X in protein sequences and are also denoted by ‘.’. • Flexible Gap is a gap of variable length. Ex. X(4,6) matches any gap with length 4,5 or 6. X(I) denotes a fixed gap of length I.
(L,W) Patterns • Pattern P is a (L,W) pattern iff • P is a string of characters from Σ and wild cards ‘.’. • P starts and ends with a character from Σ • Any sub pattern of P( i.e subsequence starting and ending with a character from Σ) containing exactly L non-wildcard characters has length of at most W. Ex. For L=3 and W=5 AF..CH..E
Algorithm • Idea: If a pattern P is a (L,W) pattern occurring in at least K sequences, then its sub patterns are also (L,W) patterns occurring in at least K sequences. • Necessary Condition: K >= 2 • P is more specific than Q if we can get Q from P by removing several characters from P and replacing several non wildcard characters with wildcard characters. • Ex: AB.CD.E is more specific than AB..D.
Two Phases • The algorithm works in two phases. • Scanning phase: it finds all (L,W) patterns occurring in at least K sequences that contain exactly L non-wildcards. • Pruned Exhaustive Search: • find a short pattern that appears in K input sequences • extend them until the support doesn’t go below K • once we find pattern that cannot be extended further, we can say that the patters in maximal and can be written to output.
Convolution Phase • For each elementary pattern P, try to extend the pattern with other elementary patterns. • Extend Pattern P: • While there exist an elementary pattern Q, which can be glued to the left side of P: • Take such Q which is largest in suffix ordering. • Let R be the pattern resulting from gluing Q to the left side of P • If pattern R has number of occurrences at least K and is maximal with respect to the set of already reported patterns: • Try to extend pattern R with other elementary patterns. • If Pattern R has the same number of occurrences as pattern P, then P is not maximal and we do not need to search for other extensions of P • Otherwise pattern P is not a significant pattern. • Repeat the same process for the elementary patterns which can be glued on the right side of P. • Report Pattern P.
Demonstration • http://cbcsrv.watson.ibm.com/Ttwpd.html • example for convolution phase • QK…LLI.K.PFQ…R.I FQ…R.IAQ..K.D.R QK…LLI.K.PFQ…R.I.AQ..K.D.R
Snapshots(Contd….) • For L=2 W=3 K=2 • For L=2 W=4,5,6,7,8 K=2 • For L=2 W=9…. K=2
Pattern Discovery Approaches • Different Pattern Discovery Approaches • Depth First Approach of PRATT
Other Approaches • Sequence pattern discovery • Structural pattern discovery  • Enumeration (Brute Force) • Pruning (Divide-n-conquer) • NP hard – machine learning
What is PRATT? • Pattern discovery software • Use pattern graphs • Use Depth First Algorithm
Depth 1st in Pattern Discovery Sequences: abb aab bab K(supp)=2 empty b supp=3 a supp=3 ab supp=3 ba supp=1 bb supp=1 aa supp=1 aba supp=0 abb supp=1 Result is ab, b and a.
Advantages • Fast on average inputs • Finds maximal patterns  • Practically linear time algorithm
SPLASH :Structural Pattern Localization Analysis by Sequential Histograms • Pattern discovery usually is reduced to an enumeration • and verification problem or a multiple alignment • problem. • Either of these class of problems is NP-Hard so most • of the solutions that have been proposed use heuristics • or ad hoc constraints to discover patterns effectively
Eg: • Probabilistic algorithms such as Meme maximize a likelihood • function. • Enumeration algorithms such as PRATT limit the maximum size • of discovered patterns to avoid exponential requirements on system • memory. • Splash is a deterministic pattern discovery algorithm which can • find sparse amino or nucleic acid patterns matching identically in a • set of protein or DNA sequences
Splash can deal with very general patterns that are defined through arbitrary homology metrics.This means Splash is not limited to the detection of identity in signals but can as easily detect similarity.
Pattern discovery by Splash Given a set of protein or DNA sequences A1,A2,…..An Splash will discover patterns of the form T(T U ‘.’) * T where T is an amino acid or nucleic acid or a class of amino acids and ‘.’ is a wild card character,T is called a token. Eg:String 1:A L C A L F A A G S K Q String2: K C A Q W S G G R N P S Pattern: CA.[FW]..G
Constraints: • Minimum support:There are two choices • a)Pattern must occur atleast jo times in the set of sequences. • b)Pattern must occur in atleast jo independent sequences. • Density constraint:Patterns must have atleast ko matching tokens in each substring of length wo that starts with a token.These parameters can be set independently.
Identical matches:Either one or two characters in the pattern must match identically. • Length:Patterns are reported only if they have atleast lo tokens.
Algorithm: An initial density constraint (ko,lmin) and minimumsupport jo are chosen . • How it works: • Splash uses MOTIF algorithm as its starting point and combines it with maximality principle. • It works as follows • (1)Enumerate all L tuples of amino acids that appear in the input set and the distance between the first and the last triplet is bound from above W.Those L tuples with instances exceeding the threshold are • used as anchor regions to induce local alignment patterns. – This is the principle of MOTIF.
2)If fewer than no patterns are found then decrease the density constraint while progressively increasing the value of lo. 3)If the value of lmax is exceeded without discovering atleast no patterns,the minimum support jo is decreased and the procedure is repeated. 4)If a predefined support threshold jmin is reached,without any pattern being discovered,the procedure is halted and no pattern is reported.
Note: Patterns are reported only if their z-score is greater than or equal to a predefined threshold zo. The z-score is the number of standard deviations away from the mean of the expected number of patterns of that type in a randomized database,a measure of the statistical significance of the pattern computed by Splash.
Performance: A comparison with PRATT
Applications: • Exhaustive Motif discovery • Hierarchical Motif discovery • Remote Homology Detection • Analysis of data from gene expression arrays • Phylogeny • The analysis of promoter regions • Analysis and prediction of protein secondary and tertiary structure.
Exhaustive Motif discovery Splash can be used to exhaustively analyze a sequence database for all non overlapping motifs that are statistically significant.This is useful in order of relative sequence support,all regions of a protein family that have been preserved by evolution and may therefore play a structural role.
Comparison with TEIRESIAS: 1)Teiresias takes exponential time for execution for sparse patterns , a disadvantage which is overcome by Splash 2)Teiresias enumerates only patterns consistent with the data set.Splash is not limited to a fixed alphabet size. 3)Patterns have to be identical in Teiresias, they don’t have to be so when using Splash as it uses a homology metric rather than a distance metric.
Pattern recognition technique is used • Text mining • Protein structure characterization and prediction • Promoter signal detection • Gene Expression analysis
Tools Provided by IBM Bioinformatics and Pattern Discovery Group • Protein Annotation w/Biodictionary • Gene Expression analysis • Sequence pattern discovery • Multiple sequence alignment • Gene discovery • Motif Discovery
Protein annotation w/Biodictionary • Important task to find membership of sequence in a protein family, metal binding, domain of amino acid sequence and structural confirmation such as helix or turn • It uses TEIRESIAS algorithm
Input • Sequence is entered in FASTA format • Query is searched against pattern available in data base called biodictionary
Output • Plot of similarities that query sequence have with other sequences in database in descending order • Features such as active site, binding site, modified sites, signals and various domain that can be identified in the processed query