Introduction to Data Mining: Overview & Methods

Includes Review of Syllabus Introduction: Overview of the Class Last updated 1/14/19

What is this Class About? • This class introduces data mining • The types of problems that can be addressed and the methods that can be used • Focus will be on basic concepts, understanding how the methods work, and when they are appropriate • Some hands-on experience will be provided • A significant project is required • At the graduate level, this class is a key core class in the MS in Data Analytics (MSDA) • More mathematical aspects of data mining are covered in the Machine Learning class

Textbook • We will use “Intro to Data Mining” by Tan, Steinbach, Karpatne, and Kumar (2nd edition) • One of the commonly used DM textbooks for CS • Not always perfectly clear, but other books are not really any better • Class slides include material from other sources, integrated into the course over several years

The Class Website • The class website is: • http://storm.cis.fordham.edu/~gweiss/classes/cisc4631/ • Includes the syllabus and is linked to class schedule • Class schedule is subject to change. • Blackboard will only be used for grading and attendance • Now lets visit the class website …

Data Mining Introduction to Data Mining Much of the material in this presentation is not from textbook

Let’s Start By Seeing What You Know • Quick Quiz • Do you know what Data Mining is? • Do you know of any examples of Data Mining?

What is Data Mining? • Data Mining has many definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Alternative Names • What are some alternative names for Data Mining? • Recent new names (maybe with different emphases): • Data Science • Big Data • Data Mining was/is known by these other names (although many of these have lost favor over time): • Knowledge discovery in databases (KDD) • Knowledge extraction • Data/pattern analysis • Data archeology, data dredging, information harvesting, business intelligence, etc.

What is Big Data? • So what is meant by “Big Data”? • Technically big data means data is so large that conventional data mining methods cannot be applied in normal manner • Using this definition, in most cases a data set with 10,000,000 cases is not big data • Term is overused and not always used this way • Much of Data Mining is not Big Data • Big data may require newer technologies • For distributing work on large datasets across machines

Some Examples • Netflix and Amazon use data mining to recommend products (recommender systems) • Companies use data mining for marketing • Who should be mailed a catalog • Who should see what online ads (Google Adwords) • Online advertising: large impact • Financial companies use credit scoring; fraud detection • Customer Churn: who will leave • Fordham’s WISDM project uses smartphone/watch accelerometer data to classify user activities and perform biometric identification • Some search engines cluster retrieved documents into meaningful groups • Group pages about Jaguar into “car” pages and “cat” pages

Interesting specific example • Wal-Mart used data mining to find out what is needed when a hurricane is coming • Strawberry PopTarts increase in sales 7X ahead of a hurricane and the pre-hurricane top selling item is beer. (Example from “Data Science for Business” page 3)

A Significant Example • Signet bank convinced that modelling profitability, not just default probability, is the way to go • But they did not have the proper data • Constrained by having data only for strategies they already used • Decided to purposefully offer loans in new cases (explore new strategies) • Initially poor results but eventually learned from data and got it right • Became one of the most successful credit card issuers: Capital One

Characteristics of Big Data • 3 V’s • Volume (Scale) • Variety (Complexity) • Velocity (Speed) • Some add “Veracity” to make 4 V’s

Volume • Data Volume • 44x increase from 2009 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially Exponential increase in collected/generated data

Variety • Various formats, types, and structures. Can you name a few? • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data • A single application can be generating/collecting many types of data

Variety (cont.) • Types of Data • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data • Social Network, Semantic Web (RDF), … • Streaming Data • You can only scan the data once • A single application can be generating/collecting many types of data • Big Public Data (online, weather, finance, etc.)

Velocity (Speed) • Data generated fast and must be processed fast • Online Data Analytics • Late decisions  missing opportunities • Examples • E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you • Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction

Summary of 3V’s • Big Data is more than just volume

Why Data Mining and Why Now? • Data Mining did not arise as a field until the 1990s (prior to that closest thing was machine learning) Quick Quiz: What do you think changed?

Large Amounts of Available Data • Tremendous amounts of data automatically collected and warehoused • Web data, e-commerce • Store purchases • Bank/Credit Card transactions • Smartphone GPS information • Smartphone and Smartwatch Sensor Data • Data from Home Assistants (e.g., Amazon Echo)

Enabling Technologies • What technological changes have helped make data mining so prevalent now? • Computers: cheaper and more powerful • Smaller mobile devices are exploding in popularity • Disk and other storage: greater capacity and cheaper • Increased use of on-line resources and Internet • Advances in algorithms (but most data mining algorithms are relatively mature)

Useful Knowledge • Often info “hidden” in data is not evident • Analysts may take weeks to discover useful information • Much of the data is never analyzed at all • There is just too much data to analyze without “assistance”

Scientific Need • Focus is often on business and consumer applications, but science needs data mining • Many scientific sensors collect huge amounts of data • remote sensors on satellite • telescopes scanning the skies • microarrays generating gene expression data • scientific simulations • CERN’s Large Hadron Collider generates 15 PB per year • Traditional techniques infeasible

How Big is the Data? • Select examples of large data sets • AT&T’s 26TB call detail database (2003) • Ebay 6PB, IRS 150TB data warehouse • Yahoo has a 2PB DB to analyze behavior of ½ billion web visitors/month (24 billion events/day) • Wal-Mart has a 583 TB database (2006) • Google knows about over 100 Trillion web pages (2016) • Sites like Facebook, Flicker & Twitter contain lots of data

How Much Data is Being Created? • 5 Exabytesnew data created (2002, UC Berkeley) • Humans created/copied 161/281 Exabytes in 06/07 (IDC) • 1 Exabyte = 1018 • 12 stacks of books stretching from Earth to Sun • 3 million times the books ever written • Not all data stored at once (includes temporary data) • In 2012 2.8 ZB (2800EB) of data will be created/copied • Forecast for 2020: 40 ZB, or (57X number of grains of sand on Earth) OK, we get the point already.! Head hurts.

Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems* • Traditional techniquesmay be unsuitable due to • Enormity of data • High dimensionality • Heterogeneous & distributed data Artificial Intelligence Machine Learning Pattern Recognition Statistics Data Mining Note: databases have limited impact; data mining rarely done in a database but rather on “flat files”. In fact, 2nd edition of textbook does not include databases in the “origin chart” Database systems

Statistics vs. Data Mining • Students familiar from statistics are often confused if differences aren’t highlighted • When compared to Data Mining: • Statistics is more theory-based • Data mining methods are based on heuristic algorithms • Statistics is based firmly on mathematics (e.g., probability) • Statistics is more focused on testing hypotheses vs. finding interesting relationships • Statistics makes more assumptions about the data

The Process of Data Mining Data Mining is a process, formerly referred to as a knowledge discovery process. In this process there is a data mining step that applies data mining algorithms to extract knowledge. About 80% of our class will focus on the data mining step but in the real world 80% of the time is spent on the other steps. The process below was articulated by Fayyad in a seminal paper on Data Mining and KDD. There should be a loop since the process is iterative. The diagram in our textbook includes only three main steps since it combines all steps before data mining into “Data Preprocessing” and all after into “Postprocessing”

CRISP Data Mining Process CRISP = Cross Industry Standard Process for Data Mining

Data Mining Tasks Second Part of Introduction:

Top-Level Data Mining Tasks • At highest level, data mining tasks can be divided into: • Prediction Tasks (supervised learning) • Use some variables to predict unknown or future values of other variables • Description Tasks (unsupervised learning) • Find human-interpretable patterns that describe the data

Key Data Mining Tasks • Overview of the major data mining tasks studied in this course: • Prediction Tasks • Classification (and class probability estimation) • Regression • Description Tasks • Clustering • Association Analysis • Anomaly Detection (I do not view this as basic as the other two and we may or may not cover this)

Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class, which is to be predicted. • Find a model for class attribute as a function of the values of other attributes. • Model maps record to a class value • Goal: previously unseen records should be assigned a class as accurately as possible. • A test setis used to determine accuracy of the model • Class Probability Estimation: estimate the probability that an object belongs to a class • Can you think of classification tasks?

Test Set Model Classification Paradigm categorical categorical continuous class Learn Classifier Training Set

Direct Marketing Application • Direct Marketing • Goal: Reduce cost of mailing by targetinga set of consumers likely to buy a product (or in this case donate to a charity) • Approach: • Use historical data about past solicitations and outcomes • We know which customers decided to donate • We also have some demographic information • Use this info as input attributes to learn a classifier model • Specific Example • KDD Cup is a competition associated with top DM conference • The KDD CUP 1998 competition was about direct marketing for a charity. Lots of information is provided • http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

Classifying Galaxies Courtesy: http://aps.umn.edu Early • Attributes: • Image features, • Characteristics of light waves received, etc. • Class: • Stages of Formation Intermediate Late • Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB

Regression • Predict a value of a given continuous (numerical) variable based on the values of other variables • Greatly studied in statistics • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices

Clustering • Given a set of data points find clusters so that • Data points in same cluster are similar • Data points in different clusters are dissimilar You try it on the Simpsons. How can we cluster these 5 “data points”?

What is a natural grouping among these objects?

What is a natural grouping among these objects? Clustering is subjective Simpson's Family Females Males School Employees

Clustering Application • Can you name any clustering applications? • Market Segmentation: • Goal: subdivide a market into distinct subsets of similar customers • Approach: • Collect different attributes of customers based on their geographical and lifestyle related information. • Find clusters of similar customers. • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Diapers beer

Association Rule Discovery Application • Can you think of any applications? • Marketing and Sales Promotion Applications • When items purchased together one can be used to drive sales of the other • Can help determine where to position store items • Supermarket shelf management • Some stores place bananas in the cereal aisle

Challenges of Data Mining • Scalability • High Dimensionality • Heterogeneous and Complex Data • Streaming Data • Data Quality • Data Ownership and Distribution • Includes data security and privacy

What is (and is not) Data Mining? • Based on the definitions of data mining, are these DM or not? • Finding a phone number in a directory • Not data mining (trivial?, DB query) • Grouping related documents returned by search engine • Is data mining (not trivial, clustering) • Identifying who has a disease based on symptoms • Is data mining (not trivial, classification) • Web search on keyword using search engine • May be data mining** ** More of an information retrieval task than data mining task. However, since Google does much more than keyword matching, there will be a data mining component. For example, Google mines the link structure of the Web to decide which pages are important (link mining is a type of data mining).

If you are Interested in Data Mining • Data sets • NYC open data (https://nycopendata.socrata.com/) • UCI Data Repository (http://archive.ics.uci.edu/ml/) • Short list near top of our class schedule web page • Visit kdnuggets, an online newsletter and more • http://www.kdnuggets.com • You can arrange to have newsletter emailed to you • Also includes job openings • Also lists many dataset repositories • ACM SIGKDD professional organization associated with data mining • ACM Special Interest Group (SIG) on data mining • Can join SIGKDD for $22 or for $54 can also join ACM as student member

Introduction to Data Mining: Overview & Methods

Introduction to Data Mining: Overview & Methods

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7