Introduction to clustering
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Introduction to clustering. PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Introduction to clustering. by Scott Amack. What is data mining?. knowledge mining from data knowledge extraction data/pattern analysis knowledge discovery from data. Why is data mining important?. Data mining helps us to find interesting patterns that we may not have noticed otherwise.

Download Presentation

Introduction to clustering.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Introduction to clustering

Introduction to clustering.

  • by

  • Scott Amack


What is data mining

What is data mining?

  • knowledge mining from data

  • knowledge extraction

  • data/pattern analysis

  • knowledge discovery from data


Why is data mining important

Why is data mining important?

  • Data mining helps us to find interesting patterns that we may not have noticed otherwise.

  • Not all patterns are interesting


Applications of data mining

Applications of data mining?

  • Google Flu Trends http://www.google.org/flutrends/

  • Grocery stores (diapers imply beer)

  • Intrusion Detection Systems (outlier analysis)

  • Search Engines

  • Fraud Detection


What is clustering

What is clustering?

  • Clustering is the process of partitioning a set of data objects into subsets.

  • Each subset is a cluster

  • Each object within a cluster are similar to each other

  • Each cluster should be dissimilar to other clusters


Why use clustering

Why use clustering?

  • Clustering can lead to the discovery of previously unknown groups within a data set.


Business intelligence

Business Intelligence:

  • organize large groups of customers into groups

  • this helps develop business strategies

  • this enhances customer relationships


Image recognition

Image Recognition:

  • used to recognize different handwriting styles (all 2’s are placed in a cluster)

  • can help optical character recognition


Web searches

Web Searches:

  • Searches can be organized in clusters

  • helps organize a large number of hit in a query

  • can cluster documents into topics


Clustering is automatic classification

Clustering is Automatic Classification

  • Classification is the process of finding a model that describes and distinguishes data.

  • Clustering is unsupervised learning because class label information is not present.

  • Clustering is a form of learning by observation.


Partitioning method

Partitioning Method

  • k- Means

  • Given a data set D, of n objects and k, the number of clusters to form, a partitioning algorithm organizes the data into k partitions (k <= n) where each partition is a cluster.


K means algorithm

k means algorithm

  • each cluster’s center is represented by the mean value of the objects in a cluster.


Input

Input:

  • k: the number of cluster

  • The value of k may need to be adjusted for optimal results. This is not done automatically by the basic algorithm.

  • D: a data set containing n objects


Output

Output:

  • A set of k clusters


Method

Method:

  • arbitrarily choose k objects from D as the initial cluster centers;

  • repeat until no change;

    • (re)assign each object to the cluster to which the object is the most similar;

    • update the cluster means


Introduction to clustering

  • 28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337 34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27 30.7326 29.5054 33.0292 25.04 28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747 31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717

  • 24.8923 25.741 27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516 31.6595 27.5415 31.1887 27.4867 31.391 27.811 24.488 27.5918 35.6273 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331 33.6564 26.4244 28.4661 34.2484 32.1005 26.691

  • 31.3987 30.6316 26.3983 24.2905 27.8613 28.5491 24.9717 32.4358 25.2239 27.3068 31.8387 27.2587 28.2572 26.5819 24.0455 35.0625 31.5717 32.5614 31.0308 34.1202 26.9337 31.4781 35.0173 32.3851 24.3323 30.2001 31.2452 26.6814 31.5137 28.8778 27.3086 24.246 26.9631 25.2919 31.6114 24.7131 27.4809 24.2075 26.8059 35.1253 32.6293 31.0561 26.3583 28.0861 31.4391 27.3057 29.6082 35.9725 34.1444 27.1717 33.6318 26.5966 25.5387 32.5434 25.5772 29.9897 31.351 33.9002 29.5446 29.343

  • 25.774 30.5262 35.4209 25.6033 27.97 25.2702 28.132 29.4268 31.4549 27.32 28.9564 28.9916 29.9578 30.2773 30.4447 24.3037 24.314 35.0966 25.3679 32.0968 33.3303 25.0102 35.3155 31.6264 29.2806 34.2021 26.5077 32.2279 25.5265 24.824 27.5587 28.3714 32.3667 26.9752 35.9346 35.1146 24.3749 27.6083 27.8433 29.8557 32.4185 26.8908 31.3209 29.3849 34.3336 24.7381 35.769 31.8725 34.2054 31.156 34.6292 28.7261 28.2979 31.5787 34.6156 32.5492 30.9827 24.8938 27.3659 25.3069

  • 27.1798 29.2498 33.6928 25.6264 24.6555 28.9446 35.798 34.9446 24.5596 34.2366 27.9634 25.3216 35.4154 34.862 25.1472 29.4686 33.1739 31.1274 31.3701 26.5173 28.6486 31.6565 35.9497 33.0321 24.6081 33.2025 27.4335 32.6355 35.8773 28.0295 33.1247 33.4129 26.9245 30.2123 29.6526 30.8644 24.5119 33.9931 33.3094 33.204 31.2651 27.9072 35.111 35.0757 33.833 25.9481 29.1348 24.2875 32.3223 34.9244 27.7218 27.9601 35.7198 27.576 35.3375 29.9993 34.2149 33.1276 31.1057 31.0179

  • 25.5067 29.7929 28.0765 34.4812 33.8 27.6671 30.6122 25.6393 30.1171 26.5188 30.1524 27.8514 29.5582 32.3601 29.2064 26.1001 33.4677 33.901 29.2674 34.8311 31.9815 26.496 32.6645 27.7188 35.7385 32.8309 30.1509 30.5593 27.3321 27.4559 24.2361 34.7268 29.9207 27.273 35.9963 32.3917 27.139 26.4589 25.0466 35.5002 27.9961 25.8897 31.3951 30.7583 34.9652 28.0919 35.6706 33.4401 28.458 31.1795 26.9458 35.8381 26.7134 25.1641 27.341 25.2093 33.4669 24.1094 33.1669 35.4907

  • 28.6989 29.2101 30.9291 34.6229 31.4138 28.4636 35.9115 32.9058 28.7669 24.2868 34.8983 33.7291 29.1154 26.2804 33.4559 31.6103 33.3061 24.553 29.1587 27.8378 25.3525 25.2126 26.9565 27.9928 29.5057 31.0723 26.3605 27.7434 34.0438 25.1053 24.4462 35.4191 33.3472 32.2356 24.5244 29.4635 24.6889 28.1962 34.2994 31.6316 30.8005 35.7727 31.3444 25.5691 32.7839 32.7707 24.1047 34.006 28.8249 24.0499 29.8274 24.0323 31.0756 34.3358 25.4358 25.893 35.6732 25.1869 29.6669 26.4637

  • 30.9493 34.317 35.5674 34.8829 30.6691 35.2667 35.895 25.9022 28.8917 32.2092 28.9898 26.0572 31.7516 32.294 31.0631 24.1612 26.6554 25.2452 30.5956 31.391 32.1604 33.7765 31.1336 32.626 28.8616 27.6223 33.9381 33.9836 34.8895 29.4617 34.5734 32.4431 30.0745 25.0495 29.2942 28.2689 28.4819 29.8917 33.1162 26.4574 27.4442 33.0784 33.2286 27.5837 24.4895 26.2151 24.0331 26.4765 34.8568 30.5934 35.4341 31.1248 24.2424 29.7172 35.9365 36.0187 26.3866 33.1842 31.3025 34.523

  • 35.2538 34.6402 35.7584 28.551 25.6518 29.6442 31.94 35.9086 28.9622 24.6224 29.7635 29.5098 28.2109 34.2855 27.5473 25.4274 32.3429 34.79 33.7012 25.3495 33.7603 26.4442 24.5097 30.4135 28.4948 28.8433 32.4284 24.5071 31.7032 29.8722 35.852 35.7172 27.1922 24.3206 25.2698 29.6203 24.3243 31.1824 25.0701 31.8824 28.6468 32.857 24.7469 29.3045 27.3994 26.1497 31.9613 34.7492 24.8681 32.0285 32.7486 33.5455 24.1792 25.398 35.2296 28.3435 26.5999 28.9487 30.2372 32.3833

  • 29.1734 31.5089 33.1944 35.6177 31.589 35.1223 28.6222 34.3145 33.1096 35.6842 29.7341 28.2495 30.4642 24.7513 26.3774 27.9832 33.6256 33.9634 33.7699 24.4594 26.146 32.9271 27.3248 35.6258 31.3697 26.8933 24.1528 25.8813 30.6996 32.7152 24.3243 29.6383 27.6021 26.7215 24.3555 34.1569 27.5892 24.3209 32.5984 25.127 35.3361 36.0073 34.266 26.9172 31.7032 35.5142 28.3009 35.5839 31.8163 33.7563 35.6908 25.5129 30.5505 33.3061 32.9789 24.0496 32.8313 29.2413 35.4543 25.8732

  • 35.2623 35.6805 31.0851 30.2589 24.1366 27.0766 25.777 33.1482 24.7965 26.2804 33.3828 28.5484 26.7417 35.9152 30.3841 31.6716 24.0114 33.0446 33.5631 33.9432 31.3396 24.4979 31.7928 34.6457 26.8155 33.8731 26.529 35.2714 30.506 29.6243 32.4622 33.0465 29.1157 26.2378 28.1837 32.0623 28.3332 30.979 33.2201 24.2508 27.1735 32.1189 24.5262 30.8321 27.5047 26.0098 25.2243 24.1142 29.4958 26.2298 30.5457 24.8259 29.9075 28.0949 31.5622 26.4082 34.1331 28.1918 30.2934 29.4113

  • 26.7115 24.0969 30.0213 29.9423 31.546 33.7673 26.7519 28.7294 26.0583 35.7051 33.7974 25.0704 33.0428 25.8115 33.0476 26.1295 34.1154 31.8647 29.3071 35.8575 28.5234 28.8444 29.064 29.9277 25.607 29.2266 35.8355 25.3635 29.9563 26.7468 27.8587 34.6508 24.1421 31.6518 34.3266 25.0469 32.1332 32.3726 35.2883 33.8705 27.6212 25.3705 30.0375 31.0275 35.0871 32.3201 35.2013 28.5212 31.4061 29.7925 32.4699 28.1966 31.9195 35.4859 24.267 30.7084 28.9351 35.5112 24.769 33.8874

  • 34.2296 32.8783 24.2457 26.9036 25.8677 34.0545 30.8923 28.7287 26.8493 25.1119 35.5939 32.8941 31.6723 31.0947 27.8547 27.3891 32.38 33.363 28.6751 35.9185 28.6512 31.0771 27.9359 24.4726 32.836 33.6076 32.7391 28.5572 26.6983 29.3717 30.0338 32.171 25.1575 30.7488 33.8756 26.0792 35.3339 36.0081 32.6829 29.1686 30.0925 31.9654 30.9023 31.138 25.2074 34.7371 24.5134 25.3686 30.3896 27.58 32.5904 30.0628 35.5009 33.8415 24.3448 25.8394 30.8182 27.5833 32.3322 32.7611


Results

Results


Problems with k means

Problems with k-means

  • is not guaranteed to converge to a global optimum and often terminates at a local optimum

  • results depend on the initial random selection of cluster centers

  • must be run multiple times to obtain good results


More problems

more problems

  • The time complexity of the k-means algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations.

  • can only be applied when the mean of a set is defined

  • is very sensitive to outliers


Other clustering algorithms

Other clustering algorithms.

  • There are many other clustering algorithms to explore.


Questions

Questions?


  • Login