Clustering and optimization in genetic data: the problem of Tag-SNPs selection

Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** * Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti”, CNR ** Dipartimento di Dipartimento di Matematica e Applicazioni "R.M. Caccioppoli“, Universita’ degli Studi di Napoli “Federico II”

Biological background DNA Chromosomes Haplotypes and Genotypes SNPs Haplotype analysis Tag SNPs selection Problem definition State of the art Reconstruction Function and Linkage disequilibrium Clustering techniques Set covering techniques Computational results Conclusions and future work Summary

DNA Structure Double Helix ((Watson-Crick) of two sequences of Nucleotides A, T, C. G Base pairs (A-T, G-C) are complementary One DNA sequence contains regions (i.e. genes, introns, exons) located in the same position of the sequence, in each individual of a species

Chromosomes One individual genome is organized in Chromosomes, i.e. large DNA macromolecules packaged in linear or circular shape In polyploid organisms multiple copies of each chromosome exist In diploid organisms (human) there are two copies of each chromosome, packaged in linear shape. Each Chromosome includes hundreds of different genes Four-arm structureduring meiosis and mitosis

Haplotypes and genotypes • A single ‘copy’ of a chromosome is called haplotype, while a description of the mixed data on the two ‘copies’ is called genotype. • H1 AATCGCCTTA (maternal chrom) • H2 ACACGTCTCA (paternal chrom) • G(H1,H2) A A/C T/A C G T T/C A • For disease association studies, haplotype data is more valuable than genotype data • Haplotype data is hard to collect. • Genotype data is easy to collect

G A C A T A C G T C C G C T A T A T C T A G C T SNPs All humans are 99,99 % identical. Diversity? polymorphism. A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG TCCGTATACCTA AATATATCG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG TCCGTATACCTA TGTGTAATATACG GGGGTGTGTGTAC TGCTAGCACGCG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG AATATATCG TCCGTATACCTA GGGGTGTGTGTAC TGCTAGCACGCG TGTGTAATATACG

G A C A T A C G T C C G C T A T A T C T A G C T Haplotype analysis 1/2 Haplotype analysis* focuses on haplotypes and genotypes that are sequences of SNPs *http://www.hapmap.org/

Haplotype analysis 2/2 • To reduce prohibitively expensive haplotyping costs, a • two stage methodology has been proposed [1] • Pilot Study • All SNPs of interest are genotyped in a small sample of the population • Common haplotypes are inferred using statistical methods • A set of tag SNPs is selected for the population study • Population Study • Tag SNPs are genotyped in the remaining population • Statistical methods are used to infer haplotypes over the tag SNPs • Haplotypes over the tag SNPs are extrapolated to full haplotypes • Two problems: • Find a set of minimum cardinality • Find a reconstruction function

Methods that find a minimum set of clusters of SNPs in high correlation (e.g. linkage disequlibrium) with each other (clusters are called blocks). SNPs prediction should be easier within a block Methods that, given the block structure (based on correlation or on proximity) find a minimum set of SNPs which is able to distinguish each pair of haplotypes in a block; or assume that the number of tag SNPs is given and find a set of Tag which can reconstruct the haplotype of a unknown sample with high accuracy Tag SNPs Selection: methods and models

Problem Definition Given a population of N haplotypes over M SNPs find a small set of SNPs (Tag SNPs) such that all the values of the other SNPs can be derived, with some reconstruction rule, from the selected values of the Tag SNPs. Two aspects: (1) Find a reconstruction function (2) Find a set of minimum cardinality that can reconstruct the other SNPs using (1) And Also: (3) Given (1) and (2), is there a proper way to identify blocks? Tag SNPs Selection: Problem definition

The Approaches Use a reconstruction function based on SNPs similarity Method 1 Cluster the SNPs according to a proper metric; Select the centroid of each cluster as a TAG SNPs. Method 2 Select a subset of SNPs that are able to differentiate each pair of haplotypes (Set Covering formulation) Both method are coherent with the adopted reconstruction function The performance in reconstruction can be used to derive the blocks ex-post Tag SNPs Selection: Problem definition

The reconstruction function • The “Majority Vote” • Given • the set of TAG SNPs • A training set T of haplotypes of which we know the value of all the SNPs • A new haplotype H of which we know only the value of the TAG SNPs • Let S be the set of haplotypes in T that have the same values of H on the TAG SNPs • For each non-TAG SNPs, determine its most frequent value in S and use it as a prediction of the value of this SNPs of H

The reconstruction function • The majority vote rule is based on the assumption that TAG SNPs characterize almost completely the haplotype • If two haplotypes are equal on the TAG SNPs, then they are equal also on the other SNPs.

Clustering : find groups of elements with high dissimilarity between groups and small dissimilarity within each group w.r.t. a chosen distance function Main Assumption: TAG SNPs are those that are very similar to many other SNPs in the Training Data Method 1: SNPs Clustering cluster the SNPs in the haplotypes space using Hamming Distance (HD) with k-means algorithm, for a proper value of k Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule Select k TAG SNPs as those closest to the HD-centroids of each clusters

The “classical” model: Find a minimal subset of TAG SNPs in such a way that each pair of haplotypes in the training set differ in the value of at least 1 TAG SNPs Method 1: Set Covering Model Select SNPs associated with xi = 1 in the solution of the SC problem Use the TAG SNPs to reconstruct the non-TAG SNPs of new haplotypes using the Majority Rule The above problem cannot be solved optimally for realistic sizes

The SC problem has a number of constraints quadratic in the number of haplotypes We use variations of the SC model (SCV) that enable to control the number of TAGs and their quality in a more effective way Used iterative herusitic based on reduced costs Variants of the Set Covering Model Minimize the number of TAGs for a given level  of differentiation between haplotypes Maximize the capacity to differentiate between haplotypes for a given number of TAG SNPs 

A good estimation on the number of TAG SNPs to be used in the model can be found efficiently measuring the quality of the clusters for different values of  The quality of the two methods (Clustering and Set Covering) can be compared directly using the same dimensions of the TAG SNPs set Some Remarks Start with centroids of clustering Add columns with pricing until LP oprimal SC still non tractable if all SNPs are used (most literature uses the first 1000-1500SNPs). Add columns with metric on SNPs until F.O. increases Solve IP

Computational results International HapMap Project Data on Chromosoma 21 of human genome • YRI : Yoruba in Ibadan, Nigeria. • JPT:Japanese in Tokyo, Japan • CHB:Han Chinese in Beijing, China • CEU : Utah Residents with Northern and Western European Ancestry # haplotypes # SNPs YRI 120 38.852 JPT+CHB 180 33.878 CEU 120 34.103

Computational results Experiments Setting Limited to the first block of 1500 SNPs (as in related literature), or Using all SNPs ( 40.000) Used clustering with standard HD with modal centroids and random starting centroids Used SCR with fixed using iterative heuristics based on reduced costs solved with CPLEX Reconstruction with majority rule Quality of reconstraction: if SNPs value coherent in more than 70% of matching haplotypes (set S), then predict, else declare undetermined 2/3 of haplotypes used for training, 1/3 for testing

Computational results

Computational results Observations Reconstruction error in the range of 20% of the SNPs, improving on previous results (where comparable) SCV method performs better that clustering expecially when all SNPs are used Best results are obrtained with approx. 30 TAG SNPs. Larger values do not reduce the reconstruction error and slow down the computation First time so many SNPs are treated simultaneously Completely correct SNPs are in the range 10-20% With  30 TAGs we can reconstruct correctly  6000 SNPs…

Computational results Work in Progress Use the proposed method to indentify the blocks • Use all SNPs on Training Set • Apply SCV to select  TAG SNPs • Apply majority rule to test set and select those SNPs that are predicted correclty all over the test set • Create one block with these SNPs, associate them to TAG set, remove these SNPs from samples • Iterate until sample contains only TAG SNPs or when no improvement is obtained …Preliminary results are encouraging … Larger data sets are needed in order to test the method properly

Clustering and optimization in genetic data: the problem of Tag-SNPs selection