Discovering Motif Patterns Using Haplotype Models for Genetic Association Studies

Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005

Problem: Applying haplotype models • Input: • Output: 10000010100010010 00010100101101001 01101101001000010 10101011111000010 (14,17,“0010”) a set of recurring patterns of the form (start column, end column, pattern)

Haplotype Minor allele Major Allele SNP Association Test Background 1000011010110100000010 Given that this sample has haplotype 1101, does it have the disease?

…1001001… …1000001… …1000101… …1110011… Genetic Variation Mutation: Recombination: …1000011… …1110101… Because of recombination, similar genetic variation can be found within closely linked regions.

Cases: Download from HapMap.org Apply Disease Model Controls: Generate using MS Apply Haplotype Model Perform Association Tests Data Sets 10010011101 01100101101 10010010101 10001110100 Input: 1001001010110 1001001110100 0110010110100 1000111010010

Testing individual SNP’s • Go through each SNP and determine which SNP’s accurately predict which samples have the disease and which do not. Case: 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 Control: 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1

Haplotype block method • Instead of looking at each individual SNP, we can look at groups of contiguous SNP’s. 1101000000…11… 1101100100…01… 0111000000…10… 1101100100…00…

Haplotype motif method • Notion that a sequence is the concatenation of segments (like the block method) but does not require conservation of boundaries. 1101000000… 1100100100… 0111000000… 1101100111…

c c c c c c c c 10000100………………………………… 00011100………………………………… 11011110………………………………… 01010110………………………………… Approximation Algorithm General idea: Pick the best partition, minimizing the number of motifs needed to explain all the data.

C 0 1 000…000 111…111 000..100 ……… Finding Motifs 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1

Problems Really, really, really slow Took over a week to partition our biggest data set. Added a ‘max leaves explored’ feature. Useless for larger c.

Real Data

Simulated Data

False Positives

General Linear Program Objective Function: minimize: x + y + z Constraints: x + y <= 2 1 1 0 x 2 x +2z <= 5 1 0 2 * y <= 5 z 0 <= x <= 3 0 <= y <= Inf -Inf <= z <= 0

A Linear Program Input: A matrix with M rows and N columns Output: The minimum number of motifs.

Variables X’s: each x corresponds to a motif Define a motif by a tuple: (start column, end column, string pattern) Y’s: each y corresponds to a row partition Define a row partition by a set of motifs: {(1,e1,“…”),(e1+1,e2,“…”),...,(en,N,“…”)}

Constraints Exactly one partition must be chosen per row. If a motif used in a row partition is not chosen, then the row partition may not be chosen. Minimize the sum of all X’s.

Example 10001101 X’s: (1,1,“1”),(1,2,“10”),(1,3,“100”), etc. Y’s: (1,1,“1”),(1,8,“0001101”) (1,2,“10”),(3,3,“0”),(4,8,“01101”)

Constraint Matrix(1) Exactly one row partition must be chosen per row. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row 1 0 0 … 0 1 1 … Row 2 0 0 … 0 0 0 … Row 3 0 0 … 0 1 1 … .. Row M 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) =1 =1 =1 … =1

Constraint Matrix(2) If a motif used in a row partition is not chosen, then the row partition may not be chosen. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row i: (1,1,“1”) 1 0 … 0 -1 0 … (1,2,“10”) 0 0 … 1 0 -1 … (1,3,“100”) 0 0 … 0 0 0 … .. … … … … … … … (8,8,“1”) 0 0 … 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) >=0 >=0 >=0 … >=0

Constraint Matrix x’s y’s 1 K K+1 K+P 0 1 0 0 0 0 0 …0 0 0 0 1 1 1 0 0 0 0…. 0 ** Constraint 1 ** 2 0 0 0 0 0 …0 0 0 0 1 0 0 1 1 1 0…. 0 == 1 … M 0 0 0 0 0 …0 0 0 0 0 0 1 0 0 0 1…. 1 1 1 1 0 0 0 0 …0 0 0 0 -1 0 0 0 ….0 0 ** Constraint 2 ** 2 0 1 0 0 0 …0 0 0 0 -1 -1 0 0….-1 0 >= 0 … K_1 0 0 1 0 0 …0 0 0 0 0 0 0 0 ….0 0 . . . M Where K is the number of unique motifs, K_i is the number of motifs appearing in row i, and P is the number of unique partitions

Problems Each row has N(N+1)/2 motifs. So there will be a polynomial number of X’s. Good! Each row can be partitioned in 2^(N-1) ways. So there will be an exponential number of Y’s. Bad! Solution: column generation

Column generation We find the optimal solution to the problem which contains all X’s and only some of the Y’s. Then we see if adding any Y’s would improve the solution.

Where are we now? • Where are we going?

Discovering Motif Patterns Using Haplotype Models for Genetic Association Studies

Discovering Motif Patterns Using Haplotype Models for Genetic Association Studies

Presentation Transcript

SNP Haplotype

Haplotype Blocks

Applying Instructional Strategies to Backward Design

Association Design

Applying Universal Design to Accessible Communication

Applying Design Thinking to ID

Haplotype Trees

Haplotype analysis

Applying NLP models to the Biological Domain

Applying Finite Mixture Models

Computational Approaches to Haplotype Inference

HAPLOTYPE ANALYSIS

Applying Hidden Markov Models to Bioinformatics

Applying Theories and Models

Introduction to Haplotype Estimation

Applying Machine Learning to Circuit Design

Applying Finite Mixture Models

Haplotype analysis

“Applying Morphology Generation Models to Machine Translation”

Applying to study in the UK

Applying Transformations to Responsive Web Design

Haplotype analysis