Vertical K Median Clustering

Vertical K Median Clustering Amal Perera, William Perrizo {amal.perera, william.perrizo}@ndsu.edu Dept. of CS, North Dakota State University. CATA 2006 – Seattle Washington

Outline • Introduction • Background • Our Approach • Results • Conclusions Vertical K Median Clustering

Introduction • Clustering: • Automated identification of groups of objects based on similarity. • Application areas include: • Datamining, Search engine indexing, Pattern recognition, Image processing, Trend analysis and many other areas • Clustering Algorithms: • Partition, Hierarchical, Density, Grid Based • Major Problem: • Scalability with respect to data set size • We propose: • A Partition Based Vertical K Median Clustering Vertical K Median Clustering

Background • Many clustering algorithms work well on small datasets. • Current approaches for Large data sets include: • Sampling eg. • CLARA : choosing a representative sample • CLARANS : Selecting a randomized sample for each iteration. • Preserve summary statistics eg. • BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters with prior knowledge • Above techniques may lead to sub optimal solutions. Vertical K Median Clustering

Background • Partition Clustering (k): • n objects in the original data set • Broken into k partitions (iteratively, each time resulting in an improved k-clustering), • to achieve a certain optimality criterion • Computational Steps: • Find a representative for each cluster component • assign others to be in cluster of best representative • Calculate error (repeat if error is too high) Vertical K Median Clustering

Our Approach • Scalability is addressed • it is a partition based approach • it uses a vertical data structure (P-tree) • the computation is efficient: • selects the partition representative using a simple directed search across bit slices rather than down rows, • assigns membership using bit slices with geometric reasoning • computes error using position based manipulation of bit slices • Solution quality is improved or maintained while increasing speed and scalability. • Uses a median rather than mean Vertical K Median Clustering

P-tree* Vertical Data Structure • Predicate-trees (P-trees) • Lossless , Compressed, Data-mining-ready • Successfully used in KNN, ARM, Bayesian Classification, etc. • A basic P-tree represents • one attribute bit slice, reorganized into a tree structure • by recursively sub-dividing, while recording the predicate truth value regarding purity for each subdivision. • Each level of the tree contains truth-bits that represent pure sub-trees • Construction is continued recursively down each tree path until a pure sub-division is reached. * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. Vertical K Median Clustering

A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 3. 2nd half is not pure1  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not  0 0 0 1 0 1 01 5. 2nd half of 2nd half is  1 0 1 0 6. 1st half of 1st of 2nd is  1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) R11 0 0 0 0 1 0 1 1 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: But it is pure (pure0) so this branch ends

Centroid: Partition Representative • Mean Vs Median • Median is usually thought to be a better Estimator ! • handles outliers much better, for example • Finding the Median Vector: • NP-hard problem • Existing median (medoid) based solutions • PAM Exhaustive search • CLARA : choosing a representative sample • CLARANS : Selecting a randomized sample for each iteration. Vertical K Median Clustering

Vector of Medians • Vector of medians (Hayford,1902): • All the median values from each individual dimension Vector of Medians mean • With a traditional Horizontal approach • Mean N scans • Median 3N scans (requires a partial sort) Vertical K Median Clustering

000 001 010 011 100 101 110 111 0 2 4 2 3 5 1,_,_ 0,1,_ 0,1,1 0,0,_ 0,1,0 0,_,_ Pi,1 P'i,2 Pi,0 Pi,1 Pi,2 Pi,2 Pi,0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 rc =4 6 4 4 3 5 5 [1] Median with Ptrees Median Pattern • Starting from the Most Significant Bit, Repeatedly AND appropriate Bit Slice until the Least Significant Bit is reached while building the Median pattern. Distribution of values _,_,_ OneCnt ZeroCnt Corresp bit of Median < 0=hi bit > 010 1 < 0 Scalability? e.g., if the cardinality= 232=4,294,967,296 Rather than scan 4 billion records, we AND log2=32 P-trees.

[2] Bulk Membership Assignment (not 1-by-1) • Find perpendicular Bi-Sector boundaries from centroids (vectors of attribute medians which are easily computed as in previous slide) • Assign membership to all the points within these boundary • Assignment is done using AND & OR of respective Bit Slices without a scan. d2 d2 Vertical K Median Clustering

C1,C2 C1,C2 C3 C1,C2 C3 C1,C2 C1,C2 Reverse HOBBit membership Assume red points (7) are not assigned Starting from the higher order bit zoom into each HOBBit rectangle where centroids exist and assign all the points to the centroids in the HBOBit Rectangles. Stop before the total assigned points is smaller than available points. May lead to multiple assignments. Motivation is efficiency over accuracy. 0111 0101 0011 0101 1000 Best d2 d2 0011

[3] Efficient Error computation Error = Sum Squared Distance from Centroid (a) to Points in Cluster (x) Where: Pi,j : P-tree for the jth bit of ith attribute COUNT(P): count of the number of truth bits. PX : Ptree (mask) for the cluster subset X Vertical K Median Clustering

Algorithm Input: DataSet, K, Threshold Output: K Clusters Initialize K clusters for DataSet Repeat Assign membership using Hyper Rec. Pruning Assign membership for points outside the boundary with Reverse HOBBIT OR a DB scan Find Error = Sum of Sq.Dist(SetCi , Centroidi) for all i Find new centroid= Vector of Median Until (Threshold < QualityGain | Max Iteration < Iteration ) Vertical K Median Clustering

Experimental Results • Objective : Quality and Scalability • Datasets • Synthetic data - Quality • Iris Plant Data - Quality • KDD-99 Network Intrusion Data - Quality • Remotely Sensed Image Data - Scalability • Quality Measured with Where: F=1 for perfect clustering Original cluster Be some cluster Vertical K Median Clustering

Results: Iterations • Synthetic Data (exec. until F-measure =1) • Iteration count for approach / dataset Vertical K Median Clustering

Results: Quality and Iterations • IRIS data for 3 classes NOTE: Quality(PAM) > Quality(CLARANS) > Quality (CLARA) Vertical K Median Clustering

Vertical K Median Clustering

Results : Quality • UCI network data for 2,4,6 classes Vertical K Median Clustering

Results: Unit Performance time in Sec. (1M RSI data for k=4) on P4 2.4 GHz & 4GB * Root MeanSqrd Error calculation overlaps with Find Membership + Best C++ implementation from std template algorithm library Vertical K Median Clustering

Results: Scalability Vertical K Median Clustering

Conclusions • Vertical bit slice based computation of median is computationally less expensive than the best horizontal approach. • Hyper rectangular quarries can be used to make bulk cluster membership assignments. • Position based manipulation and accumulation of vertical bit slices can be used to compute the squared error for the entire cluster without having to scan the DB for individual data points. • Completely-Vertical K Median Clustering is a Scalable Technique that can produce High Quality Clusters at a lower cost. Vertical K Median Clustering

Vertical K Median Clustering

Vertical K Median Clustering

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

Feature Selection in k-Median Clustering

K -MST -based clustering

K means Clustering ( Weka )

Stability Yields a PTAS for k -Median and k -Means Clustering

Canopy Clustering and K-Means Clustering

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

Initial K-Means Clustering :

K -MST -based clustering

K-means Clustering

K-means Clustering

Improved approximation for k -median

Clustering Beyond K -means

Clustering: K-Means

K-means clustering