200 likes | 312 Views
This study presents a novel algorithm for genotype calling that addresses biological problems associated with SNP values. By utilizing a performance-sensitive clustering approach, we improve the accuracy of SNP assignments while reducing reliance on heavily tuned parameters. Our method involves convolving data with a Gaussian kernel to identify initial clusters, iteratively refining cluster parameters, and assigning genotype calls based on probe measurements. Testing on the Affy 100K XBA CEU dataset yielded an accuracy of 96.47%, demonstrating its efficiency and effectiveness across various populations.
E N D
Genotype Calling Matt Schuerman
Biological Problem • How do we know an individual’s SNP values (genotype)? • Each SNP can have two values (A/B) • Each individual has two copies of the SNP • Probes can be used to measure how well a particular SNP matches values • Need a reliably way to declare values based on probe measurements
Computational Problem • Given a set of data points how can we partition them to maximize similarity within subsets? • The clustering problem • Similarity function arbitrary, but often based on statistical or distance measures • Several accepted algorithms
Standard Solutions • Algorithms exist which call HapMap genotypes with >99% accuracy • Not general, many hidden parameters tuned to work on existing data • Other algorithms require prior knowledge such as how many clusters are present • Again, not general
My Solution • Wanted a more general method with few tuned parameters • Mine has almost no “tuned” parameters • Wanted a fast solution • Many accepted clustering algorithm have exponential run times • Mine is O(n2), but closer to linear in practice
My Solution • Convolve gaussian kernel over data to find initial cluster candidates • Iteratively re-calculate cluster parameters and then re-assign data points to clusters • Assign calls to clusters based on ratio of probe measurements
Phase 1: Initial clusters • Bin data points to grid • Convolve with a 5x5 gaussian kernel • All peaks are considered potential clusters
Phase 2: Cluster Iteration • While the clusters are changing … • Calculate the mean position and covariance matrix of each cluster • Merge clusters within 3 standard deviations of each other using Mahalanobis distance • Assign each data point to the cluster with the shortest Mahalanobis distance
Phase 2: Cluster Iteration Iteration 1 …
Phase 2: Cluster Iteration Iteration 2 …
Phase 2: Cluster Iteration Iteration 3 …
Phase 2: Cluster Iteration Iteration 4, no change so done!
Phase 3: Assigning calls • Based on the ratio of x to y at the center of each cluster • If y/x ~ 1.3, then call as BB • If y/x ~ 1, then call as AB • If y/x ~ 0.7, then call as AA • If 2 or 3 clusters are present, then find which is closest to these values
Results • Clustering works much better when done within populations • Algorithm’s performance is comparable across all populations • Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate
Results: Example Assignment Ignore point at (10,10). One incorrect call in black.
Results • Sometimes assigning calls is problematic • Sometimes clusters get improperly split • Sometimes clusters get improperly merged • Sometimes the grouping is right, but one of the clusters was miscalled • Could probably be fixed if set ratios more precisely
Conclusions • Accuracy is close to that of best published algorithms • Faster run time • Simpler approach with less tuning • Need to run more data