1 / 58

Region Discovery Using Supervised Clustering Algorithms

Region Discovery Using Supervised Clustering Algorithms. Kim Keen Wee. Outline. Goals of the Thesis Introduction Supervised Clustering (SC) Fitness Function for Region Discovery An Environment for Region Discovery Experimental Results Conclusion and Future Work. Goals of the Thesis.

kyria
Download Presentation

Region Discovery Using Supervised Clustering Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Region Discovery Using Supervised Clustering Algorithms Kim Keen Wee

  2. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  3. Goals of the Thesis • Investigate using Supervised Clustering (SC) Algorithms for region discovery • Design a graphic display program to aid in visualizing the results of region discovery • Create census-based spatial datasets for the state of Wyoming • Analyze and compare the performance of SC algorithms in region discovery

  4. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  5. Introduction • Spatial Data Mining seeks to discover meaningful and interesting patterns from data where a key dimension of data is geographical location • Region Discovery subdivide a territory into different disjoint, contiguous regions minimizing some measure of interestingness

  6. Overview of Clustering • Identify groups of object (or clusters) in a dataset according to their similarity with respect to a particular distance metric • Three types of clustering: unsupervised (or traditional) clustering, semi-supervised clustering, and supervised clustering

  7. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  8. Supervised Clustering (SC) • Representative-based: • Find a set of objects (or representatives) that best represent the objects in a dataset • A solution is a set of representatives • Objects are assigned to the nearest representatives to form clusters • The goal of SC is to find a clustering that minimize the given fitness function or measure of interesting

  9. Supervised Clustering Algorithms Used • SRIDHCR • Single Representative Insertion/Deletion steepest decent Hill Climbing with Randomized Restart algorithm • SCEC • Supervised Clustering using Evolutionary Computation

  10. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  11. Fitness Functions for Region Discovery • Three fitness functions to evaluate region discovery • Traditional SC Fitness Function • Gerrymandering Fitness Function • Reward-based Fitness Function

  12. Traditional SC Fitness Function Tries to maximize purity of clusters while keeping the number of clusters low

  13. Example of Traditional SC Fitness Function • Identify majority class for each cluster • Count minority examples for each cluster

  14. Gerrymandering Fitness Function (1) Seeks for clustering in which a particular class (class of interest) dominates as many clusters as possible while minimizing the imbalance among cluster total of 15 objects: class A has 6 objects, class B has 9 objects Let, class of interest = class A

  15. Gerrymandering Fitness Function (2) • The Gerrymandering fitness function incorporates three different criteria: • Maximize the number of clusters (regions) that are dominated by a particular class • Number of regions specified by user (controlled by parameter β, denotes user-specified number of regions desired) • Maintain equality of population (controlled by parameter ζ)

  16. Gerrymandering Fitness Function (3)

  17. Reward-based Fitness Function (1) • Evaluates a clustering based on the density of a class of focus C and assigns rewards to regions in which the distribution of class C significantly deviates from the prior probability of class C in the whole dataset. • The quality of a clustering qC(X) is the sum of the rewards τC(c) associated with each cluster c in X • Reward is higher for larger cluster using β=1

  18. Reward-based Fitness Function (2)

  19. Example of Reward-based Fitness Function Parameters: γ1=0.5, γ2=1.5, R+=1, R-=1,β=1.1 Prior(Poor)=0.2 n=1000 p(c1,Poor)=20/50 = 0.4 p(c2,Poor)=40/200 = 0.2 p(c3,Poor)=10/200 = 0.05 p(c4,Poor)=30/350 = 0.0857 p(c5,Poor)=100/200 = 0.5 c3,c4 0.1 c2 0.3 c1,c5 qPoor(X) = (1/7 x 50)1.1/1000 + 0 + (1/2 x 200)1.1/1000 + (0.143 x 350)1.1/1000 + (2/7 x 200)1.1/1000 = 0.00869 + 0 + 0.15849 + 0.07402 + 0.08564 = 0.32684 0.4-0.3 0.7 0.1-0.05 0.1

  20. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  21. An Environment for Region Discovery Spatial Datasets Fitness Functions Support Graphic Display Tool RSC Algorithms Environment for Region Discovery

  22. Outline • Goals of the Thesis • Introduction • Supervised Clustering (SC) • Fitness Function for Region Discovery • An Environment for Region Discovery • Experimental Results • Conclusion and Future Work

  23. Objectives of the Experiments • To illustrate how SRIDHCR and SCEC work in region discovery for four Wyoming state spatial datasets and two artificial spatial datasets • To evaluate the performance of SRIDHCR and SCEC with three individual fitness functions in region discovery • To study how parameters values of the three fitness functions affect the clustering results (regions discovered) and to select a set of good parameters for the fitness functions • To analyze and compare the performances of SRIDHCR and SCEC in region discovery

  24. Datasets Used in the Experiments • Artificial Datasets: Matlab datasets • Wyoming Datasets are created based on U.S. Bureau Census

  25. Step 1: Obtain Boundary File of the State Step 2: Preprocess the Boundary File Step 3: Get Report of Census 2000 on Selected State (by County) Step 4: Generate Random Population (by County) Step 5: Associate Class Label Based on Census Data Step 6: Combine all Counties Six-Step Process for State Spatial Datasets Creation Creation of Wyoming Datasets

  26. Original Wyoming Datasets (Census 2000) Household Income in 1999 Poverty Status in 1999 Age Race

  27. Wyoming Poverty Dataset and Modified Poverty Dataset Wyoming Poverty Dataset Modified Poverty Dataset

  28. Example Output of Clustering • Each color represent a cluster • Classes are represented by different shape of point • Representatives are circled in white

  29. Clustering using Traditional SC Fitness Function

  30. Modified Poverty Dataset

  31. SCEC – Traditional SC Fitness Function (1) • Clustering Output of Modified Poverty Dataset (parameter: β=0.3)

  32. SCEC – Traditional SC Fitness Function (2) • Clustering Output of Modified Poverty Dataset (parameter: β=0.1)

  33. SRIDHCR – Traditional SC Fitness Function • Clustering Output of Modified Poverty Dataset (parameter: β=0.1)

  34. Clustering using Gerrymandering Fitness Function

  35. Wyoming Age Dataset

  36. Wyoming Modified Age Dataset

  37. SCEC – Gerrymandering Fitness Function (1) • Clustering Output of Wyoming Age Dataset(parameters: ǩ=7, β=30000, ζ=0.01) 5 7

  38. SCEC – Gerrymandering Fitness Function (2) • Clustering Output of Wyoming Age Dataset(parameters: ǩ=12, β=30000, ζ=0.01) 10 12

  39. SCEC – Gerrymandering Fitness Function (3) • Clustering Output of Wyoming Age Dataset(parameters: ǩ=12, β=30000, ζ=0.08) 7 12

  40. SRIDHCR – Gerrymandering Fitness Function (1) • Clustering Output of Wyoming Age Dataset(parameters: ǩ=12, β=30000, ζ=0.01) 11 12

  41. SRIDHCR – Gerrymandering Fitness Function (2) • Clustering Output of Wyoming Age Dataset(parameters: ǩ=12, β=30000, ζ=0.08) 8 12

  42. Clustering using Reward-based Fitness Function

  43. Wyoming Poverty Dataset

  44. SCEC – Reward-based Fitness Function (2) • Clustering Output of Wyoming Poverty Dataset (parameters: γ1=0.5, γ2=1.5, R+=10, R-=10, β=1.1)

  45. Modified Poverty Dataset

  46. SCEC – Reward-based Fitness Function (3) • Clustering Output of Modified Poverty Dataset (parameters: γ1=0.5, γ2=1.5, R+=10, R-=10, β=1.1)

  47. Wyoming Income Dataset

  48. SCEC – Reward-based Fitness Function (4) • Clustering Output of Wyoming Income Dataset: class of interest-class 1 (parameters: γ1=0.5, γ2=1.5, R+=10, R-=10, β=2)

  49. SCEC – Reward-based Fitness Function (5) • Clustering Output of Wyoming Income Dataset: class of interest-class 4 (parameters: γ1=0.5, γ2=1.5, R+=10, R-=10, β=2)

  50. SCEC – Reward-based Fitness Function (6) • Clustering Output of Wyoming Income Dataset: class of interest-class 4 (parameters: γ1=0.5, γ2=1.5, R+=10, R-=0, β=1.1)

More Related