Enhancing K-means Clustering Efficiency in HPC Environments

Case Studies and Explorations with Kmeans Clustering Paul Rodriguez PACE Gordon Summer Institute 2012

Clustering with Kmeans • Kmeans is a standard data driven technique • Kmeans clustering • Assign each point to one of a few clusters so that total distance to center is minimized • Options: distance function, number of clusters, initial cluster centers, number of iterations, stopping criteria

Clustering in HPC environment • Data Set Up: • Does it fit in memory? Split? sample? • Tools and Processors: • Which machines/queues: Normal compute nodes or vsmp? • How much coding/prototyping/optimization?

Clustering in HPC environment • Data to Try: • NYTimes Article • 1000 Genomes Data • Tools and Processors to Try: • Matlab: high level math programming tool • Map/Reduce: C++ program library

Matlab Parallel Computing Toolbox • Communication is handled for you (MPI or threads under the hood) • You still have to decide data/task set up LAB 1 X local part CLIENT X matrix . . . Separate nodes or threads LAB N X local part

Matlab PCT in a nutshell • Distributed toolbox provides distribute/gather functions • In job submission • Create job object: createMatlabPoolJob(scheduler information) • Create tasks for that job createTask(job,@myfunction,#tasks,{parameters..}) • In your code: spmd • D=codistributed(X); or D=codistributed.build(X); • < statements> • end;

Matlab PCT in a nutshell • A codistributed array is divided into local parts, each residing in the workspace of a different lab. • Practically: • find bottleneck in kmeans.m program • add code to distribute and gather data

Matlab PCT pseudo code … old Kmeans code … %NEW CODE: distribute data matrix X to local nodes spmd Xsd =codistributed(X); %declare it as distributed X_Local=getLocalPart(Xsd); %now get the part for this lab % Also distribute Cluster Means % Csd =codistributed(Cluster_Means_Set); Cluster_Means_Local=getLocalPart(Csd); …

Matlab PCT pseudo code %ALTERNATIVE: % split input file into parts prior and % use lab id to read in correct file spmd Currentlab=labindex; Read file ‘nytimes_forlab_’ + currentlab …

Matlab PCT pseudo code % Calculate Distance Matrix for this part as usual Distance_Part = get_distance(X_Local, Cluster_Means_Local); end; %end of SPMD block %Now Distance_Part are available as ‘distributed’ matrix in Client for i=1: num_labs Distance=Distance + Distance_Part{i}; end; rest of Kmeans Code ….

Matlab in vSMP setting • vSMP submission indicates threads • In submission script, • set environment variables for MKL (Intels Math Kernel Library) • In matlab code: • setenv('MKL_NUM_THREADS', getenv(number_of_procs); • No programming changes necessary, but programming considerations exist

Matlab: threads vs comm. Matrix Multiplication Matrix Inversion 32 threads 8 threads time time(s) 16 threads 32 threads N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 Square Matrix size • threads: more is better for multiplication, less is better for inversion • (or use different operation)

Matlab original Kmeans Script 1. Difference_by_col=X(:,1)-Cluster_Means(1,1) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • square difference • sum as you loop across cols to get Distances to cluster center Works better for large N small P

Matlab Kmeans Script altered 1. Difference_by_row=X(1,:)-Cluster_Means(1,:) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • dot(difference_by_row) • loop across rows to get Distances Works better for large P and dot( ) will use threads

Matlab Kmeans Benchmarks • Kmeans on 10,000,000 entries from NYTimes articles (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words) • Running as full data matrix ~ 45K articles x102K words, • Each cell holds word count (double float) • about 37Gb in Matlab, total memory for script about 61Gb • Kmeans (original) runtime ~ 50 hours • Kmeans (altered) runtime ~ 10 hours, 8 threads

Matlab Kmeans Results cluster means shown with coordinates determining fontsize 7 viable clusters found

MapReduce Framework • A library for distributed computing • Started by Google, gaining popularity • Various implementations: Hadoop (distributed), Phoenix (threaded), Sandia (MPI) MR provides parallelization,concurrency, and intermediate data functions (by key&value) User outputs keys & values e.g. Ekanayake et al User defined functions

Paradigmatic Example: string counting • Scheduler: manage threads, initiate data split and call Map • Map: count strings, output key=string & value=count • Scheduler: re-partitions keys & values • Reduce: sum up counts MR provides parallelization,concurrency, and intermediate data functions (by key&value) User defines keys & values User defined functions

MapReduce Kmeans clustering • C-code for Kmeans(sample code with MapReduce Phoenix) • Use 10,000,000 entries from NYTimes articles • Running as full data matrix (int) ~ 45K docs x102K word tokens, ~ 20 Gb total in vSMP • Running time ~ 20 min, 32 threads

MapReduce Kmeans clustering • Use ~70,000,000 entries from NYTimes articles • full data matrix (int) ~ 300K docs x102K word tokens, • ~ 120 Gb total in vSMP memory • Running time ~ 120 min, 32 threads • Running time ~ 175 min, serial version

Case Study: Genomic Data (with Multi-Modal Imaging Lab, UCSD) • Genomic Sequence Database on over 1000 subjects (1000genomes.org). • Each sequence mapping is ~10Gb => 10Tb total data. • Goal: identify genetic variants and priors for analysis of brain imaging & sequence data together.

Exploring Genomic Data • How does genomic clustering matching demographics? • What categories of genes (e.g. coding, regulation) account for differences? • Start small: kmeans clustering for one chromosome.

Exploring Genomic Data • Starting small: Chromosone 11 aligned data • Shell script to retrieve 75 subject • wgetftp://ftp-trace.ncbi.nih.gov/ … (about 700Mb each file) • Preprocessing: • Download BAM (binary sequence alignment data) utilities • Run BAM function to get a consensus sequences (other consensus methods?) • Use perl,grep, wcetc to strip headers, get metadata about files • Pick Coding Scheme (A,C,G,T=1,2,3,4; No allele =0) • Gather summary statistics

Exploring Genomic Data • Data ends up as 250M integers (alleles): • 000001243111333324000004311322224 ….. • Try subsets: 3M, 10M, 50M integers • Use Matlab and MapReduce

Case Study: Cluster Exploration High correlation of Clustering despite much less alleles used Num Alleles Used . 3M O 50M + 10M Finnish subjects All others GBR Cluster ‘A’ Cluster ‘B’ 1 20 40 60 75 Subjects

Case Study: Cluster Exploration Do outliers belong in their own cluster? outlier Distance to Cluster Mean Should these be reassigned? 1 20 40 60 75 Subjects

Case Study: More steps • running multiple cluster sizes • visualizing clusters • use other distance functions (ie city-block) • do more data with vsmp (in progress) • other cluster algorithms, compare to PCA, MDS,

How to use full genome? • Distance Speed Up Heuristics • sampling columns during distance calculation • start with subsets of data points (ie rows) and add 1 at a time • only processor outliers or in points in between clusters • HPC set up • put data on flash, some data in memory, • use distributed jobs for some processing steps, vsmp for others

PACE Ongoing and Future • Continue building experience with large memory trade offs for Data Mining Algorithms • Support a variety of ways to execute a variety of tools

Enhancing K-means Clustering Efficiency in HPC Environments

Enhancing K-means Clustering Efficiency in HPC Environments

Presentation Transcript

Case Studies and Examples

Teaching and Learning with Case Studies

Growth and Case Studies

Challenges, Explorations with Lines, and Explorations with Parabolas

Misrepresentations and Case Studies

Teaching with Case Studies

CASE STUDIES

Explorations with Geogebra

Case Studies:

Case Studies:

Pilot and Case studies

Verification Case Studies with ObjectCheck

Case Studies

Case Studies and Examples

Case Studies

Case Studies

Hybrid Hierarchical Kmeans clustering and DB SCAN

Detecting Spatial Clustering in Matched Case-Control Studies

Case Studies

Case Studies and Analysis with MATSim

Case Studies

Increase Sales with Case Studies