- 76 Views
- Uploaded on
- Presentation posted in: General

Introduction to Data Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Donghui Zhang

CCIS, Northeastern University

Data Mining: Concepts and Techniques

The current talk slide was extracted and modified from Dr. Han’s lecture slides.

Data Mining: Concepts and Techniques

- Data explosion problem
- Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories

- We are drowning in data, but starving for knowledge!
- Solution: Data warehousing and data mining
- Data warehousing and on-line analytical processing
- Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Data Mining: Concepts and Techniques

- 1960s:
- Data collection, database creation, IMS and network DBMS

- 1970s:
- Relational data model, relational DBMS implementation

- 1980s:
- RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
- Application-oriented DBMS (spatial, scientific, engineering, etc.)

- 1990s:
- Data mining, data warehousing, multimedia databases, and Web databases

- 2000s
- Stream data management and mining
- Data mining with a variety of applications
- Web technology and global information systems

Data Mining: Concepts and Techniques

Database

Systems

Statistics

Data Mining

Machine

Learning

Visualization

Algorithm

Other

Disciplines

Data Mining: Concepts and Techniques

- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)patterns or knowledge from huge amount of data
- Data mining: a misnomer?

- Alternative names
- Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

- Watch out: Is everything “data mining”?
- (Deductive) query processing.
- Expert systems or small ML/statistical programs

Data Mining: Concepts and Techniques

- Data analysis and decision support
- Market analysis and management
- Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation

- Risk analysis and management
- Forecasting, customer retention, improved underwriting, quality control, competitive analysis

- Fraud detection and detection of unusual patterns (outliers)

- Market analysis and management
- Other Applications
- Text mining (news group, email, documents) and Web mining
- Stream data mining
- DNA and bio-data analysis

Data Mining: Concepts and Techniques

Knowledge

- Data mining—core of knowledge discovery process

Pattern Evaluation

Data Mining

Task-relevant Data

Selection

Data Warehouse

Data Cleaning

Data Integration

Databases

Data Mining: Concepts and Techniques

- Learning the application domain
- relevant prior knowledge and goals of application

- Creating a target data set: data selection
- Data cleaning and preprocessing: (may take 60% of effort!)
- Data reduction and transformation
- Find useful features, dimensionality/variable reduction, invariant representation.

- Choosing functions of data mining
- summarization, classification, regression, association, clustering.

- Choosing the mining algorithm(s)
- Data mining: search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant patterns, etc.

- Use of discovered knowledge

Data Mining: Concepts and Techniques

Graphical user interface

Pattern evaluation

Data mining engine

Knowledge-base

Database or data warehouse server

Filtering

Data cleaning & data integration

Data

Warehouse

Databases

Data Mining: Concepts and Techniques

- Relational database
- Data warehouse
- Transactional database
- Advanced database and information repository
- Object-relational database
- Spatial and temporal data
- Time-series data
- Stream data
- Multimedia database
- Heterogeneous and legacy database
- Text databases & WWW

Data Mining: Concepts and Techniques

- Concept description: Characterization and discrimination
- Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

- Association (correlation and causality)
- Diaper à Beer [0.5%, 75%]

- Classification and Prediction
- Construct models (functions) that describe and distinguish classes or concepts for future prediction
- E.g., classify countries based on climate, or classify cars based on gas mileage

- Presentation: decision-tree, classification rule, neural network
- Predict some unknown or missing numerical values

- Construct models (functions) that describe and distinguish classes or concepts for future prediction

Data Mining: Concepts and Techniques

- Cluster analysis
- Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns
- Maximizing intra-class similarity & minimizing interclass similarity

- Mining complex types of data

Data Mining: Concepts and Techniques

- Descriptive vs. predictive data mining
- Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms
- Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data

- Concept description:
- Characterization: provides a concise and succinct summarization of the given collection of data
- Comparison: provides descriptions comparing two or more collections of data

Initial Relation

Prime Generalized Relation

Customer

buys both

Customer

buys diaper

Customer

buys beer

- Itemset X={x1, …, xk}
- Find all the rules XYwith min confidence and support
- support, s, probability that a transaction contains XY
- confidence, c,conditional probability that a transaction having X also contains Y.

- Let min_support = 50%, min_conf = 50%:
- A C (50%, 66.7%)
- C A (50%, 100%)

Data Mining: Concepts and Techniques

- Any subset of a frequent itemset must be frequent
- if {beer, diaper, nuts} is frequent, so is {beer, diaper}
- Every transaction having {beer, diaper, nuts} also contains {beer, diaper}

- Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!
- Method:
- generate length (k+1) candidate itemsets from length k frequent itemsets, and
- test the candidates against DB

Data Mining: Concepts and Techniques

Database TDB

L1

C1

1st scan

C2

C2

L2

2nd scan

L3

C3

3rd scan

Data Mining: Concepts and Techniques

- Given a set of sequences, find the complete set of frequent subsequences

A sequence : < (ef) (ab) (df) c b >

A sequence database

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Data Mining: Concepts and Techniques

- Classification:
- predicts categorical class labels (discrete or nominal)
- classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

- Prediction:
- models continuous-valued functions, i.e., predicts unknown or missing values

- Typical Applications
- credit approval
- target marketing
- medical diagnosis
- treatment effectiveness analysis

Data Mining: Concepts and Techniques

This follows an example from Quinlan’s ID3

Data Mining: Concepts and Techniques

age?

<=30

overcast

>40

30..40

student?

credit rating?

yes

no

yes

fair

excellent

no

yes

no

yes

Data Mining: Concepts and Techniques

- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive divide-and-conquer manner
- At start, all the training examples are at the root
- Attributes are categorical (if continuous-valued, they are discretized in advance)
- Examples are partitioned recursively based on selected attributes
- Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

- Conditions for stopping partitioning
- All samples for a given node belong to the same class
- There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
- There are no samples left

Data Mining: Concepts and Techniques

- Classification by decision tree induction
- Bayesian Classification
- Classification by Neural Networks
- Classification by Support Vector Machines (SVM)
- Classification based on concepts from association rule mining

Data Mining: Concepts and Techniques

- Cluster: a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters

- Cluster analysis
- Grouping a set of data objects into clusters

- Clustering is unsupervised classification: no predefined classes
- Typical applications
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms

- A good clustering method will produce high quality clusters with
- high intra-class similarity
- low inter-class similarity

- The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
- The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Data Mining: Concepts and Techniques

- Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
- Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
- Density-based: based on connectivity and density functions
- Grid-based: based on a multiple-level granularity structure
- Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Data Mining: Concepts and Techniques

- Given k, the k-means algorithm is implemented in four steps:
- Partition objects into k nonempty subsets
- Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)
- Assign each object to the cluster with the nearest seed point
- Go back to Step 2, stop when no more new assignment

Data Mining: Concepts and Techniques

- Mining spatial databases
- Mining multimedia databases
- Mining time-series and sequence data
- Mining stream data
- Mining text databases
- Mining the World-Wide Web

Data Mining: Concepts and Techniques

Time-series plot

Data Mining: Concepts and Techniques

- Predict whether increase or decrease
- Long-term or trend movements (trend curve)
- Cyclic movements or cycle variations, e.g., business cycles
- Seasonal movements or seasonal variations
- i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

- Irregular or random movements

Data Mining: Concepts and Techniques

- Normal database query finds exact match
- Similarity search finds data sequences that differ only slightly from the given query sequence
- Two categories of similarity queries
- find a sequence that is similar to the query sequence
- find all pairs of similar sequences

Data Mining: Concepts and Techniques

Data Warehouse

Data Mining: Concepts and Techniques

- Defined in many different ways, but not rigorously.
- A decision support database that is maintained separately from the organization’s operational database
- Support information processing by providing a solid platform of consolidated, historical data for analysis.

- “A data warehouse is asubject-oriented, integrated, time-variant, and nonvolatilecollection of data in support of management’s decision-making process.”—W. H. Inmon
- Data warehousing:
- The process of constructing and using data warehouses

Data Mining: Concepts and Techniques

- Modeling data warehouses: dimensions & measures
- Star schema: A fact table in the middle connected to a set of dimension tables
- Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
- Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Data Mining: Concepts and Techniques

item

time

item_key

item_name

brand

type

supplier_type

time_key

day

day_of_the_week

month

quarter

year

location

branch

location_key

street

city

state_or_province

country

branch_key

branch_name

branch_type

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

Data Mining: Concepts and Techniques

supplier

item

time

item_key

item_name

brand

type

supplier_key

supplier_key

supplier_type

time_key

day

day_of_the_week

month

quarter

year

city

location

branch

location_key

street

city_key

city_key

city

state_or_province

country

branch_key

branch_name

branch_type

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

Data Mining: Concepts and Techniques

item

time

item_key

item_name

brand

type

supplier_type

time_key

day

day_of_the_week

month

quarter

year

location

location_key

street

city

province_or_state

country

shipper

branch

shipper_key

shipper_name

location_key

shipper_type

branch_key

branch_name

branch_type

Shipping Fact Table

time_key

Sales Fact Table

item_key

time_key

shipper_key

item_key

from_location

branch_key

to_location

location_key

dollars_cost

units_sold

units_shipped

dollars_sold

avg_sales

Measures

Data Mining: Concepts and Techniques

- Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time

Hierarchical summarization paths

Region

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Product

Month

Data Mining: Concepts and Techniques

all

0-D(apex) cuboid

region

product

month

1-D cuboids

product, month

product, region

month, region

2-D cuboids

3-D(base) cuboid

product, month, region

Data Mining: Concepts and Techniques

- Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces
- Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
- greater scalability

- Multidimensional OLAP (MOLAP)
- Array-based multidimensional storage engine (sparse matrix techniques)
- fast indexing to pre-computed summarized data

- Hybrid OLAP (HOLAP)
- User flexibility, e.g., low level: relational, high-level: array

- Specialized SQL servers
- specialized support for SQL queries over star/snowflake schemas

Data Mining: Concepts and Techniques

- Data extraction:
- get data from multiple, heterogeneous, and external sources

- Data cleaning:
- detect errors in the data and rectify them when possible

- Data transformation:
- convert data from legacy or host format to warehouse format

- Load:
- sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

- Refresh
- propagate the updates from the data sources to the warehouse

Data Mining: Concepts and Techniques

- Data mining: discovering interesting patterns from large amounts of data
- A natural evolution of database technology, in great demand, with wide applications
- Data mining functionalities: characterization, association, classification, clustering, mining complex data, etc.
- Data warehousing

Data Mining: Concepts and Techniques

- Data mining and KDD (SIGKDD: CDROM)
- Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
- Journal: Data Mining and Knowledge Discovery, KDD Explorations

- Database systems (SIGMOD: CD ROM)
- Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
- Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.

- AI & Machine Learning
- Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
- Journals: Machine Learning, Artificial Intelligence, etc.

- Statistics
- Conferences: Joint Stat. Meeting, etc.
- Journals: Annals of statistics, etc.

- Visualization
- Conference proceedings: CHI, ACM-SIGGraph, etc.
- Journals: IEEE Trans. visualization and computer graphics, etc.

Data Mining: Concepts and Techniques