1 / 29

Case Studies and Explorations with Kmeans Clustering

Case Studies and Explorations with Kmeans Clustering. Paul Rodriguez PACE Gordon Summer Institute 2012. Clustering with Kmeans. Kmeans is a standard data driven technique Kmeans clustering Assign each point to one of a few clusters so that total distance to center is minimized

mdegroat
Download Presentation

Case Studies and Explorations with Kmeans Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Studies and Explorations with Kmeans Clustering Paul Rodriguez PACE Gordon Summer Institute 2012

  2. Clustering with Kmeans • Kmeans is a standard data driven technique • Kmeans clustering • Assign each point to one of a few clusters so that total distance to center is minimized • Options: distance function, number of clusters, initial cluster centers, number of iterations, stopping criteria

  3. Clustering in HPC environment • Data Set Up: • Does it fit in memory? Split? sample? • Tools and Processors: • Which machines/queues: Normal compute nodes or vsmp? • How much coding/prototyping/optimization?

  4. Clustering in HPC environment • Data to Try: • NYTimes Article • 1000 Genomes Data • Tools and Processors to Try: • Matlab: high level math programming tool • Map/Reduce: C++ program library

  5. Matlab Parallel Computing Toolbox • Communication is handled for you (MPI or threads under the hood) • You still have to decide data/task set up LAB 1 X local part CLIENT X matrix . . . Separate nodes or threads LAB N X local part

  6. Matlab PCT in a nutshell • Distributed toolbox provides distribute/gather functions • In job submission • Create job object: createMatlabPoolJob(scheduler information) • Create tasks for that job createTask(job,@myfunction,#tasks,{parameters..}) • In your code: spmd • D=codistributed(X); or D=codistributed.build(X); • < statements> • end;

  7. Matlab PCT in a nutshell • A codistributed array is divided into local parts, each residing in the workspace of a different lab. • Practically: • find bottleneck in kmeans.m program • add code to distribute and gather data

  8. Matlab PCT pseudo code … old Kmeans code … %NEW CODE: distribute data matrix X to local nodes spmd Xsd =codistributed(X); %declare it as distributed X_Local=getLocalPart(Xsd); %now get the part for this lab % Also distribute Cluster Means % Csd =codistributed(Cluster_Means_Set); Cluster_Means_Local=getLocalPart(Csd);  …

  9. Matlab PCT pseudo code %ALTERNATIVE: % split input file into parts prior and % use lab id to read in correct file  spmd Currentlab=labindex; Read file ‘nytimes_forlab_’ + currentlab  …

  10. Matlab PCT pseudo code % Calculate Distance Matrix for this part as usual Distance_Part = get_distance(X_Local, Cluster_Means_Local); end; %end of SPMD block %Now Distance_Part are available as ‘distributed’ matrix in Client for i=1: num_labs Distance=Distance + Distance_Part{i}; end; rest of Kmeans Code ….

  11. Matlab in vSMP setting • vSMP submission indicates threads • In submission script, • set environment variables for MKL (Intels Math Kernel Library) • In matlab code: • setenv('MKL_NUM_THREADS', getenv(number_of_procs); • No programming changes necessary, but programming considerations exist

  12. Matlab: threads vs comm. Matrix Multiplication Matrix Inversion 32 threads 8 threads time time(s) 16 threads 32 threads N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 Square Matrix size • threads: more is better for multiplication, less is better for inversion • (or use different operation)

  13. Matlab original Kmeans Script 1. Difference_by_col=X(:,1)-Cluster_Means(1,1) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • square difference • sum as you loop across cols to get Distances to cluster center Works better for large N small P

  14. Matlab Kmeans Script altered 1. Difference_by_row=X(1,:)-Cluster_Means(1,:) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • dot(difference_by_row) • loop across rows to get Distances Works better for large P and dot( ) will use threads

  15. Matlab Kmeans Benchmarks • Kmeans on 10,000,000 entries from NYTimes articles (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words) • Running as full data matrix ~ 45K articles x102K words, • Each cell holds word count (double float) • about 37Gb in Matlab, total memory for script about 61Gb • Kmeans (original) runtime ~ 50 hours • Kmeans (altered) runtime ~ 10 hours, 8 threads

  16. Matlab Kmeans Results cluster means shown with coordinates determining fontsize 7 viable clusters found

  17. MapReduce Framework • A library for distributed computing • Started by Google, gaining popularity • Various implementations: Hadoop (distributed), Phoenix (threaded), Sandia (MPI) MR provides parallelization,concurrency, and intermediate data functions (by key&value) User outputs keys & values e.g. Ekanayake et al User defined functions

  18. Paradigmatic Example: string counting • Scheduler: manage threads, initiate data split and call Map • Map: count strings, output key=string & value=count • Scheduler: re-partitions keys & values • Reduce: sum up counts MR provides parallelization,concurrency, and intermediate data functions (by key&value) User defines keys & values User defined functions

  19. MapReduce Kmeans clustering • C-code for Kmeans(sample code with MapReduce Phoenix) • Use 10,000,000 entries from NYTimes articles • Running as full data matrix (int) ~ 45K docs x102K word tokens, ~ 20 Gb total in vSMP • Running time ~ 20 min, 32 threads

  20. MapReduce Kmeans clustering • Use ~70,000,000 entries from NYTimes articles • full data matrix (int) ~ 300K docs x102K word tokens, • ~ 120 Gb total in vSMP memory • Running time ~ 120 min, 32 threads • Running time ~ 175 min, serial version

  21. Case Study: Genomic Data (with Multi-Modal Imaging Lab, UCSD) • Genomic Sequence Database on over 1000 subjects (1000genomes.org). • Each sequence mapping is ~10Gb => 10Tb total data. • Goal: identify genetic variants and priors for analysis of brain imaging & sequence data together.

  22. Exploring Genomic Data • How does genomic clustering matching demographics? • What categories of genes (e.g. coding, regulation) account for differences? • Start small: kmeans clustering for one chromosome.

  23. Exploring Genomic Data • Starting small: Chromosone 11 aligned data • Shell script to retrieve 75 subject • wgetftp://ftp-trace.ncbi.nih.gov/ … (about 700Mb each file) • Preprocessing: • Download BAM (binary sequence alignment data) utilities • Run BAM function to get a consensus sequences (other consensus methods?) • Use perl,grep, wcetc to strip headers, get metadata about files • Pick Coding Scheme (A,C,G,T=1,2,3,4; No allele =0) • Gather summary statistics

  24. Exploring Genomic Data • Data ends up as 250M integers (alleles): • 000001243111333324000004311322224 ….. • Try subsets: 3M, 10M, 50M integers • Use Matlab and MapReduce

  25. Case Study: Cluster Exploration High correlation of Clustering despite much less alleles used Num Alleles Used . 3M O 50M + 10M Finnish subjects All others GBR Cluster ‘A’ Cluster ‘B’ 1 20 40 60 75 Subjects

  26. Case Study: Cluster Exploration Do outliers belong in their own cluster? outlier Distance to Cluster Mean Should these be reassigned? 1 20 40 60 75 Subjects

  27. Case Study: More steps • running multiple cluster sizes • visualizing clusters • use other distance functions (ie city-block) • do more data with vsmp (in progress) • other cluster algorithms, compare to PCA, MDS,

  28. How to use full genome? • Distance Speed Up Heuristics • sampling columns during distance calculation • start with subsets of data points (ie rows) and add 1 at a time • only processor outliers or in points in between clusters • HPC set up • put data on flash, some data in memory, • use distributed jobs for some processing steps, vsmp for others

  29. PACE Ongoing and Future • Continue building experience with large memory trade offs for Data Mining Algorithms • Support a variety of ways to execute a variety of tools

More Related