1 / 10

Sampling Approaches to Pattern Extraction

Sampling Approaches to Pattern Extraction. (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Probabilistic Motif M = prob model p(S|M)

Download Presentation

Sampling Approaches to Pattern Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling Approaches to Pattern Extraction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Probabilistic Motif M = prob model p(S|M) M matches every sequence but with different probabilities E.g., M={p(x|i)}, i=1,…, w (width) P(x|i)=prob. symbol x occurs in position i Task: Find best M, and the matching positions in each Si. Best= p(S|M) is the highest. Combinatorial Motif M = deterministic pattern M either matches a sequence or not E.g., M= AT..G Task: Find best M’s Best = highly frequent Pattern Extraction: Probabilistic vs. Combinatorial Problem: Find common patterns (motifs) in sequences S={ s1,…, sN}

  3. Probabilistic Pattern Extraction • Motif M = prob. model of sequences p(Seq|M) • M matches every sequence but with different probabilities • E.g., M={p(x|i)}, i=1,…, w (width) • P(x|i)= prob symbol x occurs in position I • Task = Find best M and the matching positions in each Si; “best” = p(S|M) is the highest • Method = Search for the best model • Sampling is an efficient way of searching

  4. Position Weighted Matrix (PWM) Position: 1 2 3 w-1 w 1.0 1.0 … 1.0 Essentially a simple linear HMM Parameters: qij=p(symbol j | position i) E.g., q1A=q1G=0.5; q1C=q1T=0 q9C=1.0; q9A=q9C=q9T=0 Covers a deterministic patter such as AT.G as a special case with the following Q matrix: 1 2 3 4 A 1.0 0 0.25 0 T 0 1.0 0.25 0 C 0 0 0.25 0 G 0 0 0.25 1.0

  5. Discovering a PWM from Sequences • Given • a set of sequeces S={ s1,…, sN} • a pattern width w (e.g. 10) • Discover the most discriminative PWM M, i.e., the M that maximizes p(S|M)/p(S|Background) • P(S|M)=p(s1|M)…p(sN|M) ( roughly!) • Prior could be incorporated through maximizing posterior probability of M • How to discover it? • Using HMM training algorithm? (not all observations are relevant) • Gibbs Sampler

  6. Gibbs Sampler: Basic Idea • Introduce an auxiliary variable akto record the position of the pattern in sequence sk • Randomly choose initial positions ak • Iterate with the following two steps • Predictive update: Using the current positions to estimate the model qij • Sampling: Using the current model to improve the position in one sequence (e.g., ak) • Take one sequence and compute the probability ratio of each position p(x|i)/p(x|Background) • Sample a position based on the ratio weight • In general, we get a high ratio position, but not always the highest • Observations • If a position is improved, then the model will be improved • If a model is improved then all the positions will also be improved

  7. Gibbs Sampler: Details of One Iteration • At every step, take one sequence out (e.g., sequence z) for position improvement • Use the rest to estimate two models qij and pj (background) • qij is estimatd based on the matching segments at the current positions • pj is estimated based on all other regions of these sequences (negative model) • For each position i in the sequence taken out, compute the probability ratio • Normalize the ratios to get a probabilities and choose a position stochastically according to the probabilities • Change the current position for sequence z to the new position obtained

  8. Estimation of qij and pj • qij is estimated based on the sequence segments at the current “matching positions” a1, …,aN • pj is estimated based on the “non-matching regions” of all the sequences (relevant frequency) • In general, smoothing is necessary Total counts of symbol j in relative position i Pseudocounts

  9. Example of Estimating qij N=6, W=10, without smoothing q1A= 3/5, q2G = 2/5, … q1G= 0

  10. Example of Computing the Ratios Ratio = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/pj)

More Related