Efficient Genotype Calling Using a Novel Clustering Algorithm for SNP Analysis

Genotype Calling Matt Schuerman

Biological Problem • How do we know an individual’s SNP values (genotype)? • Each SNP can have two values (A/B) • Each individual has two copies of the SNP • Probes can be used to measure how well a particular SNP matches values • Need a reliably way to declare values based on probe measurements

Example Probe Reads

Computational Problem • Given a set of data points how can we partition them to maximize similarity within subsets? • The clustering problem • Similarity function arbitrary, but often based on statistical or distance measures • Several accepted algorithms

Standard Solutions • Algorithms exist which call HapMap genotypes with >99% accuracy • Not general, many hidden parameters tuned to work on existing data • Other algorithms require prior knowledge such as how many clusters are present • Again, not general

My Solution • Wanted a more general method with few tuned parameters • Mine has almost no “tuned” parameters • Wanted a fast solution • Many accepted clustering algorithm have exponential run times • Mine is O(n2), but closer to linear in practice

My Solution • Convolve gaussian kernel over data to find initial cluster candidates • Iteratively re-calculate cluster parameters and then re-assign data points to clusters • Assign calls to clusters based on ratio of probe measurements

Phase 1: Initial clusters • Bin data points to grid • Convolve with a 5x5 gaussian kernel • All peaks are considered potential clusters

Phase 2: Cluster Iteration • While the clusters are changing … • Calculate the mean position and covariance matrix of each cluster • Merge clusters within 3 standard deviations of each other using Mahalanobis distance • Assign each data point to the cluster with the shortest Mahalanobis distance

Phase 2: Cluster Iteration Iteration 1 …

Phase 2: Cluster Iteration Iteration 4, no change so done!

Phase 3: Assigning calls • Based on the ratio of x to y at the center of each cluster • If y/x ~ 1.3, then call as BB • If y/x ~ 1, then call as AB • If y/x ~ 0.7, then call as AA • If 2 or 3 clusters are present, then find which is closest to these values

Results • Clustering works much better when done within populations • Algorithm’s performance is comparable across all populations • Testing 1111 SNPs in the Affy 100K XBA CEU dataset found to be 96.47% accurate

Results: Example Assignment Ignore point at (10,10). One incorrect call in black.

Results • Sometimes assigning calls is problematic • Sometimes clusters get improperly split • Sometimes clusters get improperly merged • Sometimes the grouping is right, but one of the clusters was miscalled • Could probably be fixed if set ratios more precisely

Results: Sample Split Error

Results: Sample Merge Error

Conclusions • Accuracy is close to that of best published algorithms • Faster run time • Simpler approach with less tuning • Need to run more data

Efficient Genotype Calling Using a Novel Clustering Algorithm for SNP Analysis