- 79 Views
- Uploaded on
- Presentation posted in: General

Data Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data Mining

Lecture 6

- Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1)
- Data Analysis Techniques (Week 5)
- Statistical Background
- Trends/ Outliers/Normalizations
- Principal Component Analysis
- Discretization Techniques

- Case Study 2: Working and experiencing on the properties of discretization infrastructure of The Retail Banking Data Mart (Week 5 –Assignment 2)
- Lecture Talk: Searching/Matching Engine

- Clustering Techniques (Week 6)
- K-Means Clustering
- Condorcet Clustering
- Other Clustering Techniques

- Case Study 3: Working and experiencing on the properties of the clustering infrastructure for The Retail Banking (Week 6 – Assignment3)
- Lecture Talk: Different Perspectives on Searching/Matching

- Three types of attributes:
- Nominal — values from an unordered set, e.g., color, profession
- Ordinal — values from an ordered set, e.g., military or academic rank
- Continuous — real numbers, e.g., integer or real numbers

- Discretization:
- Divide the range of a continuous attribute into intervals
- Some classification algorithms only accept categorical attributes.
- Reduce data size by discretization
- Prepare for further analysis

- Discretization
- Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals
- Interval labels can then be used to replace actual data values
- Supervised vs. unsupervised
- Split (top-down) vs. merge (bottom-up)
- Discretization can be performed recursively on an attribute

- Concept hierarchy formation
- Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior)

- Typical methods: All the methods can be applied recursively
- Binning (covered above)
- Top-down split, unsupervised,

- Histogram analysis (covered above)
- Top-down split, unsupervised

- Clustering analysis (covered above)
- Either top-down split or bottom-up merge, unsupervised

- Entropy-based discretization: supervised, top-down split
- Interval merging by 2 Analysis: unsupervised, bottom-up merge
- Segmentation by natural partitioning: top-down split, unsupervised

- Binning (covered above)

- Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the information gain after partitioning is
- Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is
- where pi is the probability of class i in S1

- The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization
- The process is recursively applied to partitions obtained until some stopping criterion is met
- Such a boundary may reduce data size and improve classification accuracy

- Merging-based (bottom-up) vs. splitting-based methods
- Merge: Find the best neighboring intervals and merge them to form larger intervals recursively
- ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
- Initially, each distinct value of a numerical attr. A is considered to be one interval
- 2 tests are performed for every pair of adjacent intervals
- Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair indicate similar class distributions
- This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, max-interval, max inconsistency, etc.)

- Χ2 (chi-square) test
- The larger the Χ2 value, the more likely the variables are related
- The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count
- Correlation does not imply causality

- Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)
- It shows that like_science_fiction and play_chess are correlated in the group

- A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
- If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals
- If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
- If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

count

-$351-$159profit $1,838 $4,700

Step 1:

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

Step 2:

msd=1,000Low=-$1,000High=$2,000

(-$1,000 - $2,000)

Step 3:

(-$1,000 - 0)

($1,000 - $2,000)

(0 -$ 1,000)

($2,000 - $5, 000)

($1,000 - $2, 000)

(-$400 - 0)

(0 - $1,000)

(0 -

$200)

($1,000 -

$1,200)

(-$400 -

-$300)

($2,000 -

$3,000)

($200 -

$400)

($1,200 -

$1,400)

(-$300 -

-$200)

($3,000 -

$4,000)

($1,400 -

$1,600)

($400 -

$600)

(-$200 -

-$100)

($4,000 -

$5,000)

($600 -

$800)

($1,600 -

$1,800)

($1,800 -

$2,000)

($800 -

$1,000)

(-$100 -

0)

(-$400 -$5,000)

Step 4:

- Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts
- street < city < state < country

- Specification of a hierarchy for a set of values by explicit data grouping
- {Urbana, Champaign, Chicago} < Illinois

- Specification of only a partial set of attributes
- E.g., only street < city, not others

- Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values
- E.g., for a set of attributes: {street, city, state, country}

15 distinct values

country

365 distinct values

province_or_ state

3567 distinct values

city

674,339 distinct values

street

- Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
- The attribute with the most distinct values is placed at the lowest level of the hierarchy
- Exceptions, e.g., weekday, month, quarter, year

- Cluster: a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters

- Cluster analysis
- Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

- Unsupervised learning: no predefined classes
- Typical applications
- As a stand-alone tool to get insight into data distribution
- As a preprocessing step for other algorithms

- Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
- Land use: Identification of areas of similar land use in an earth observation database
- Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
- City-planning: Identifying groups of houses according to their house type, value, and geographical location
- Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

- A good clustering method will produce high quality clusters with
- high intra-class similarity
- low inter-class similarity

- The quality of a clustering result depends on both the similarity measure used by the method and its implementation
- The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

- Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
- There is a separate “quality” function that measures the “goodness” of a cluster.
- The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.
- Weights should be associated with different variables based on applications and data semantics.
- It is hard to define “similar enough” or “good enough”
- the answer is typically highly subjective.

- Scalability
- Ability to deal with different types of attributes
- Ability to handle dynamic data
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to determine input parameters
- Able to deal with noise and outliers
- Insensitive to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability

- Data matrix
- (two modes)

- Dissimilarity matrix
- (one mode)

- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types

- Standardize data
- Calculate the mean absolute deviation:
- where
- Calculate the standardized measurement (z-score)

- Using mean absolute deviation is more robust than using standard deviation

- Distances are normally used to measure the similarity or dissimilarity between two data objects
- Some popular ones include: Minkowski distance:
- where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

- If q = 1, d is Manhattan distance

- If q = 2, d is Euclidean distance:
- Properties
- d(i,j) 0
- d(i,i)= 0
- d(i,j)= d(j,i)
- d(i,j) d(i,k)+ d(k,j)

- Properties
- Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

- Partitioning approach:
- Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
- Typical methods: k-means, k-medoids, CLARANS

- Hierarchical approach:
- Create a hierarchical decomposition of the set of data (or objects) using some criterion
- Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

- Density-based approach:
- Based on connectivity and density functions
- Typical methods: DBSCAN, OPTICS, DenClue

- Grid-based approach:
- based on a multiple-level granularity structure
- Typical methods: STING, WaveCluster, CLIQUE

- Model-based:
- A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
- Typical methods:EM, SOM, COBWEB

- Frequent pattern-based:
- Based on the analysis of frequent patterns
- Typical methods: pCluster

- User-guided or constraint-based:
- Clustering by considering user-specified or application-specific constraints
- Typical methods: COD (obstacles), constrained clustering

- Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
- Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
- Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
- Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
- Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)
- Medoid: one chosen, centrally located object in the cluster

Centroid: the “middle” of a cluster

Radius: square root of average distance from any point of the cluster to its centroid

Diameter: square root of average mean squared distance between all pairs of points in the cluster

- assignment 2 (please share your ideas with your group)
- choose freely a dataset my advice: http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
- use Weka
http://www.cs.waikato.ac.nz/ml/weka/

- apply different discretization strategies that you have learned in class (equi–width, equi-depth, entropy based, merging, splitting,...)

- read
- Course Text Book Chapter 7