230 likes | 507 Views
FINAL LECTURE. Prof. Navneet Goyal CSIS Department, BITS- Pilani. Topics. Grid-based Clustering Anomaly/Outlier Analysis Research Trends in Data Mining Stream Data Mining Unstructured Data Mining Multi-relational Data Mining Applications Sciences Social WWW. FINAL LECTURE.
E N D
FINAL LECTURE Prof. NavneetGoyal CSIS Department, BITS-Pilani
Topics • Grid-based Clustering • Anomaly/Outlier Analysis • Research Trends in Data Mining • Stream Data Mining • Unstructured Data Mining • Multi-relational Data Mining • Applications • Sciences • Social • WWW
FINAL LECTURE This is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning! - Winston Churchill
Grid-based Clustering • DBSCAN: simple but effective algorithm for finding density-based clusters, i.e., dense regions of objects that are surrounded by low-density regions • We now look at additional density-based clustering techniques that address issues of: • Efficiency (GRID METHODS – STING) • Finding clusters in subspaces (CLIQUE) • More accurately modeling density (DENCLUE)
Grid-based Clustering • Using multi-resolution grid data structure • Several interesting methods • STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98)
STING • STatistical INformation Grid approach) by Wang, Yang and Muntz (VLDB’97) • The spatial area area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution
STING • Each cell at a high level is partitioned into a number of smaller cells in the next lower level • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. • Use a top-down approach to answer spatial data queries • Start from a pre-selected layer—typically with a small number of cells • For each cell in the current level compute the confidence interval
STING • Remove the irrelevant cells from further consideration • When finish examining the current layer, proceed to the next lower level • Repeat this process until the bottom layer is reached • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based • It partitions each dimension into the same number of equal length interval • It partitions an m-dimensional data space into non-overlapping rectangular units • A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter • A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps • Partition the data space and find the number of points that lie inside each cell of the partition. • Identify the subspaces that contain clusters using the Apriori principle • Identify clusters: • Determine dense units in all subspaces of interests • Determine connected dense units in all subspaces of interests. • Generate minimal description for the clusters • Determine maximal regions that cover a cluster of connected dense units for each cluster • Determination of minimal cover for each cluster
Vacation(week) 7 6 5 4 3 2 1 age 0 20 30 40 50 60 Vacation 30 50 Salary age Salary (10,000) 7 6 5 4 3 2 1 age 0 20 30 40 50 60 = 3
Anomaly/Outlier Analysis What are outliers? • Outlier (noun) : something that is situated away from or classed differently from a main or related body • A statistical observation that is markedly different in value from the others of the sample • The set of objects are considerably dissimilar from the remainder of the data • Example: Sports: Michael Jordon, Sachin Tendulkar, Tiger Woods
Anomaly/Outlier Analysis • Applications: • Credit card fraud detection • Telecom fraud detection • IDS • Terrorism Prevention
Anomaly/Outlier Detection • What are anomalies/outliers? • The set of data points that are considerably different than the remainder of the data • Variants of Anomaly/Outlier Detection Problems • Given a database D, find all the data points x D with anomaly scores greater than some threshold t • Given a database D, find all the data points x D having the top-n largest anomaly scores f(x) • Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
Importance of Anomaly Detection Ozone Depletion History • In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels • Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? • The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Anomaly Detection • Challenges • How many outliers are there in the data? • Method is unsupervised • Validation can be quite challenging (just like for clustering) • Finding needle in a haystack • Working assumption: • There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
Outlier Discovery: Statistical Approaches • Assume a model underlying distribution that generates data set (e.g. normal distribution) • Use discordancy tests depending on • data distribution • distribution parameter (e.g., mean, variance) • number of expected outliers • Drawbacks • most tests are for single attribute • In many cases, data distribution may not be known
Outlier Discovery: Distance-Based Approach • Introduced to counter the main limitations imposed by statistical methods • We need multi-dimensional analysis without knowing data distribution. • Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O • Algorithms for mining distance-based outliers • Index-based algorithm • Nested-loop algorithm • Cell-based algorithm
Anomaly Detection Schemes • General Steps • Build a profile of the “normal” behavior • Profile can be patterns or summary statistics for the overall population • Use the “normal” profile to detect anomalies • Anomalies are observations whose characteristicsdiffer significantly from the normal profile • Types of anomaly detection schemes • Graphical & Statistical-based • Distance-based • Model-based
Convex Hull Method • Extreme points are assumed to be outliers • Use convex hull method to detect extreme values • What if the outlier occurs in the middle of the data?
Nearest-Neighbor Based Approach • Approach: • Compute the distance between every pair of data points • There are various ways to define outliers: • Data points for which there are fewer than p neighboring points within a distance D • The top n data points whose distance to the kth nearest neighbor is greatest • The top n data points whose average distance to the k nearest neighbors is greatest