Sampling Approaches to Pattern Extraction

Sampling Approaches to Pattern Extraction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Probabilistic Motif M = prob model p(S|M) M matches every sequence but with different probabilities E.g., M={p(x|i)}, i=1,…, w (width) P(x|i)=prob. symbol x occurs in position i Task: Find best M, and the matching positions in each Si. Best= p(S|M) is the highest. Combinatorial Motif M = deterministic pattern M either matches a sequence or not E.g., M= AT..G Task: Find best M’s Best = highly frequent Pattern Extraction: Probabilistic vs. Combinatorial Problem: Find common patterns (motifs) in sequences S={ s1,…, sN}

Probabilistic Pattern Extraction • Motif M = prob. model of sequences p(Seq|M) • M matches every sequence but with different probabilities • E.g., M={p(x|i)}, i=1,…, w (width) • P(x|i)= prob symbol x occurs in position I • Task = Find best M and the matching positions in each Si; “best” = p(S|M) is the highest • Method = Search for the best model • Sampling is an efficient way of searching

Position Weighted Matrix (PWM) Position: 1 2 3 w-1 w 1.0 1.0 … 1.0 Essentially a simple linear HMM Parameters: qij=p(symbol j | position i) E.g., q1A=q1G=0.5; q1C=q1T=0 q9C=1.0; q9A=q9C=q9T=0 Covers a deterministic patter such as AT.G as a special case with the following Q matrix: 1 2 3 4 A 1.0 0 0.25 0 T 0 1.0 0.25 0 C 0 0 0.25 0 G 0 0 0.25 1.0

Discovering a PWM from Sequences • Given • a set of sequeces S={ s1,…, sN} • a pattern width w (e.g. 10) • Discover the most discriminative PWM M, i.e., the M that maximizes p(S|M)/p(S|Background) • P(S|M)=p(s1|M)…p(sN|M) ( roughly!) • Prior could be incorporated through maximizing posterior probability of M • How to discover it? • Using HMM training algorithm? (not all observations are relevant) • Gibbs Sampler

Gibbs Sampler: Basic Idea • Introduce an auxiliary variable akto record the position of the pattern in sequence sk • Randomly choose initial positions ak • Iterate with the following two steps • Predictive update: Using the current positions to estimate the model qij • Sampling: Using the current model to improve the position in one sequence (e.g., ak) • Take one sequence and compute the probability ratio of each position p(x|i)/p(x|Background) • Sample a position based on the ratio weight • In general, we get a high ratio position, but not always the highest • Observations • If a position is improved, then the model will be improved • If a model is improved then all the positions will also be improved

Gibbs Sampler: Details of One Iteration • At every step, take one sequence out (e.g., sequence z) for position improvement • Use the rest to estimate two models qij and pj (background) • qij is estimatd based on the matching segments at the current positions • pj is estimated based on all other regions of these sequences (negative model) • For each position i in the sequence taken out, compute the probability ratio • Normalize the ratios to get a probabilities and choose a position stochastically according to the probabilities • Change the current position for sequence z to the new position obtained

Estimation of qij and pj • qij is estimated based on the sequence segments at the current “matching positions” a1, …,aN • pj is estimated based on the “non-matching regions” of all the sequences (relevant frequency) • In general, smoothing is necessary Total counts of symbol j in relative position i Pseudocounts

Example of Estimating qij N=6, W=10, without smoothing q1A= 3/5, q2G = 2/5, … q1G= 0

Example of Computing the Ratios Ratio = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/pj)

Sampling Approaches to Pattern Extraction

Sampling Approaches to Pattern Extraction

Presentation Transcript

Access Pattern Analysis, Ideas and Alternative Approaches

The Sampling Analysis Pattern

Introduction to sampling

New Approaches for Feature Extraction in Hyperspectral Imagery

Addressed Based Sampling as an Alternative to Traditional Sampling Approaches:

Hoax or Truth: Pattern Extraction Through Diffusion Analysis

The Sampling Analysis Pattern

ALTERNATIVE APPROACHES FOR WATER EXTRACTION IN AREAS SUBJECT TO SALTWATER UPCONING

Unit 4: Sampling approaches

IEPAD: Information Extraction Based on Pattern Discovery

IEPAD: Information Extraction based on Pattern Discovery

Relational Learning of Pattern-Match Rules for Information Extraction

Sampling Techniques to Accelerate Pattern Matching in Network Intrusion Detection Systems

Comparing Information Extraction Pattern Models

Hierarchies of Units and non-traditional sampling approaches

PATTERN TO PROGRAM

Relational Learning of Pattern-Match Rules for Information Extraction

Learning Subjective Nouns using Extraction Pattern Bootstrapping

Learning Subjective Nouns using Extraction Pattern Bootstrapping

IMAGE ANALYSIS AND PATTERN RECOGNITION Introduction Feature extraction:

Introduction to Sampling

Introduction to Sampling