100 likes | 121 Views
Explore the fundamentals of clustering methods, including data preparation, representation, evaluation, and types of clustering methods to organize unlabeled data into groups. Learn the application of clustering in various fields such as data mining, information retrieval, and marketing.
E N D
College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining Chapter 6: Clustering Methods Prepared by: Mahmoud Rafeek Al-Farra 2013 www.cst.ps/staff/mfarra
Course’s Out Lines • Introduction • Data Preparation and Preprocessing • Data Representation • Classification Methods • Evaluation • Clustering Methods • Mid Exam • Association Rules • Knowledge Representation • Special Case study : Document clustering • Discussion of Case studies by students
Out Lines • Definition of Clustering • Why clustering? • Where to use clustering? • Next: Types of Data in Cluster Analysis • Next: A Categorization of Major Clustering Methods
Definition of Clustering • Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. • Clustering is “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Definition of Clustering • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes
Why clustering? • Simplifications • Pattern detection • Useful in data concept construction • Unsupervised learning process
Where to use clustering? • Data mining • Information retrieval • text mining • Web analysis • marketing • medical diagnostic
Which method should I use? • Type of attributes in data • Scalability to larger dataset • Ability to work with irregular data • Time cost • complexity • Data order dependency • Result presentation