Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

COMP207: Data Mining COMP207: Data Mining Dr. M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2010 Introduction to Data Mining

Today's Topics COMP207: Data Mining • What is Data Mining? • Definitions • KDD: Knowledge Discovery in Databases • KDD Process • Differences with Statistics • Views on the Process • Basic Functions • Why would you do this? • Motivations • Applications • Summary Introduction to Data Mining

What is Data Mining? COMP207: Data Mining • Some Definitions: • “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (Piatetsky-Shapiro) • "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi)‏ • “...the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏ • “...finding hidden information in a database.” (Dunham, pg 3)‏ • “...the process of employing one or more computer learning techniques to automatically analyse and extract knowledge from data contained within a database.” (Roiger, pg 4)‏ Introduction to Data Mining

What is Data Mining? COMP207: Data Mining Keywords from each definition: • “The nontrivialextraction of implicit, previously unknown, and potentially useful informationfrom data” (Piatetsky-Shapiro)‏ • "...the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, ... or data streams." (Han, pg xxi) • “...the process of discoveringpatterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful...” (Witten, pg 5)‏ • “...finding hidden informationin a database.” (Dunham, pg 3)‏ • “...the process of employing one or more computer learning techniques to automaticallyanalyze and extractknowledgefrom data contained within a database.” (Roiger, pg 4)‏ Introduction to Data Mining

KDD: Knowledge Discovery in Databases COMP207: Data Mining Many texts treat KDD and Data Mining as the same process, but it is also possible to think of Data Mining as the discovery part of KDD. Dunham: KDD is the process of finding useful information and patterns in data.Data Mining is the use of algorithms to extract information and patterns derived by the KDD process. For this course, we will discuss the entire process (KDD) but focus mostly on the algorithms used for discovery. Introduction to Data Mining

KDD: Knowledge Discovery in Databases COMP207: Data Mining KDD (Knowledge Discovery in Databases) is the nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) • Or KDD : non-trivial extraction of implicit, previously unknown and potentially useful information • Data mining is just a part of the KDD process • Data mining applies algorithms to large data to produce models or patterns interesting to the user. Introduction to Data Mining

The Data Mining (KDD) Process COMP207: Data Mining Introduction to Data Mining

KDD Process Components COMP207: Data Mining • Operational Data • - Day-to-day data used to run business • Clean, collect and summarise • - Most Data is not suitable for data mining • - Errors or Noise, missing data, invalid formats • Data warehouse • - Mega store of clean (analysis) data • Data Preparation • - Validating the data for mining (e.g. remove noise, formatting, running validation routines etc.) • Training Data – Data used as test case for mining • Data Mining – the process of applying mining algorithms on data to produce interesting patterns Introduction to Data Mining

Data Mining Algorithms scale to large data Data is used secondary for Data mining DM–tools use background knowledge for End-User Strategy : Exploration Cyclic Statistics Many algorithms with quadratic run time. Data is used for the Statistic (primary) Statistical background is often required Strategy: Conformational Verifying Few loops Differences with Statistics COMP207: Data Mining Introduction to Data Mining

Piatetsky-Shapiro View COMP207: Data Mining Knowledge Interpretation Data Model Data Mining Transformed Data Transformation Preprocessed Data Preprocessing Target Data Selection Initial Data (As tweaked by Dunham)‏ Introduction to Data Mining

CRISP-DM View COMP207: Data Mining Introduction to Data Mining

Data Mining Functions COMP207: Data Mining All Data Mining functions can be thought of as attempting to find a model to fit the data. Each function needs Criteria to create one model over another. Each function needs a technique to Compare the data. Two types of model: • Predictive models predict unknown values based on known data • Descriptive models identify patterns in data Each type has several sub-categories, each of which has many algorithms. We won't have time to look at ALL of them in detail. Introduction to Data Mining

Data Mining Functions COMP207: Data Mining Classification: Maps data into predefined classes Regression: Maps data into a function Prediction: Predict future data states Time Series Analysis: Analyze data over time (Supervised Learning)‏ Predictive Data Mining Clustering: Find groups of similar items Association Rules: Find relationships between items Characterisation: Derive representative information Sequence Discovery: Find sequential patterns (Unsupervised Learning)‏ Descriptive Introduction to Data Mining

Classification COMP207: Data Mining The aim of classification is to create a model that can predict the 'type' or some category for a data instance that doesn't have one. Two phases: 1. Given labelled data instances, learn model for how to predict the class label for them. (Training)‏ 2. Given an unlabelled, unseen instance, use the model to predict the class label. (Prediction)‏ Some algorithms predict only a binary split (yes/no), some can predict 1 of N classes, some give probabilities for each of N classes. Introduction to Data Mining

Clustering COMP207: Data Mining The aim of clustering is similar to classification, but without predefined classes. Clustering attempts to find clusters of data instances which are more similar to each other than to instances outside of the cluster. Unsupervised Learning: learning by observation, rather than by example. Some algorithms must be told how many clusters to find, others try to find an 'appropriate' number of clusters. Introduction to Data Mining

Association Rule Mining COMP207: Data Mining The aim of association rule mining is to find patterns that occur in the data set frequently enough to be interesting. Hence the association or correlation of data attributes within instances, rather than between instances. These correlations are then expressed as rules – if X and Y appear in an instance, then Z also appears. Most algorithms are extensions of a single base algorithm known as 'A Priori', however a few others also exist. Introduction to Data Mining

Why? COMP207: Data Mining That all sounds ... complicated. Why should I learn about Data Mining? What's wrong with just a relational database? Why would I want to go through these extra [complicated] steps? Isn't it expensive? It sounds like it takes a lot of skill, programming, computational time and storage space. Where's the benefit? Data Mining isn't just a cute academic exercise, it has very profitable real world uses. Practically all large companies and many governments perform data mining as part of their planning and analysis. Introduction to Data Mining

Why Data Mining? Some general reasons COMP207: Data Mining • We are Data rich but knowledge poor • Computing affordable • - Storage, CPU, networking • Data is too large to analyse (Very Large Databases (VLBD) • - Dimensionality (size) • - distributed (location spread) • - heterogeneous (different types of data) • Traditional techniques infeasible • - Statistics, databases • Competitive pressure in business enterprises • - Customer profiling (Need to know who is a good customer) • - Business to Business (B2B – Being “old” is not profitable) Introduction to Data Mining

Data is Everywhere! COMP207: Data Mining • Relational database—A commodity of every enterprise • Huge data warehouses are under construction • POS (Point of Sales): Transactional DBs in terabytes • Object, relational, distributed, heterogeneous and legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases (Genetic data etc) • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW:A huge, hyper-linked, dynamic, global information system (XML, Web content and Web usage data) • Crime data – terrorist data  more recent applications Introduction to Data Mining

The Data Explosion COMP207: Data Mining The rate of data creation is accelerating each year. In 2003, UC Berkeley estimated that the previous year generated 5 exabytes of data, of which 92% was stored on electronically accessible media.Mega < Giga < Tera < Peta < Exa ... All the data in all the books in the US Library of Congress is ~136 Terabytes. So 37,000 New Libraries of Congress in 2002. VLBI Telescopes produce 16 Gigabytes of data every second. Each engine of each plane of each company produces ~1 Gigabyte of data every trans-atlantic length journey. Google searches 18 billion+ accessible web pages. Introduction to Data Mining

Data Explosion Implications COMP207: Data Mining As the amount of data increases, the proportion of information decreases. As more and more data is generated automatically, we need to find automatic solutions to turn those stored raw results into information. Companies need to turn stored data into profit ... otherwise why are they storing it? Let's look at some real world examples. Introduction to Data Mining

Classification COMP207: Data Mining The data generated by airplane engines can be used to determine when it needs to be serviced. By discovering the patterns that are indicative of problems, companies can service working engines less often (increasing profit) and discover faults before they materialise (increasing safety). Loan companies can “give you results in minutes” by classifying you into a good credit risk or a bad risk, based on your personal information and a large supply of previous, similar customers. Cell phone companies can classify customers into those likely to leave, and hence need enticement, and those that are likely to stay regardless. Introduction to Data Mining

Clustering COMP207: Data Mining Discover previously unknown groups of customers/items. By finding clusters of customers, companies can then determine how best to handle that particular cluster. For example, this could be used for targeted advertising, special offers, transferring information gathered by association rule mining to other members of the cluster, and so forth. The concept of 'Similarity' is often used for determining other items that you might be interested in, eg 'More Like This' links. Introduction to Data Mining

Association Rule Mining COMP207: Data Mining By finding association rules from shopping baskets, supermarkets can use this information for many things, including: • Product placement in the store • What to put on sale • What to create as 'joint special offers' • What to offer the customer in terms of coupons • What to advertise together It shouldn't be surprising that your Tesco coupons are for things that you sometimes buy, rather than things you always or never buy. Wal-Mart in the US records every transaction at every store -- petabytes of information to sift through. (TeraData)‏ Introduction to Data Mining

Data/Information/Knowledge/Wisdom COMP207: Data Mining Note well that data mining applications have no wisdom. They cannot apply the knowledge that they discover appropriately. For example, a data mining application may tell you that there is a correlation between buying music magazines and beer, but it doesn't tell you how to use that knowledge. Should you put the two close together to reinforce the tendency, or should you put them far apart as people will buy them anyway and thus stay in the store longer? Data mining can help managers plan strategies for a company, it does not give them the strategies. Introduction to Data Mining

Summary COMP207: Data Mining • What is data mining? • KDD - knowledge discovery in databases: nontrivial extraction of implicit, previously unknown and potentially useful information • Why do we need data mining? • - Very large data - data explosion, • - Dimensionality of data • - Heterogeneity of data • - Technology rich • - Traditional techniques infeasible Introduction to Data Mining

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010