density based clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Density based Clustering PowerPoint Presentation
Download Presentation
Density based Clustering

Loading in 2 Seconds...

play fullscreen
1 / 35

Density based Clustering - PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on

Density based Clustering. Anushree Garg, krithika chandramouli. Types of Clustering algorithms. Partitioning based K-Means, K- Medoids Hierarchical based BIRCH, Chameleon Density based DBScan, DenCLUE , D-Stream Grid Based STING, WaveCluster. DBScan.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Density based Clustering' - zeal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
density based clustering

Density based Clustering

Anushree Garg, krithikachandramouli

types of clustering algorithms
Types of Clustering algorithms
  • Partitioning based
    • K-Means, K-Medoids
  • Hierarchical based
    • BIRCH, Chameleon
  • Density based
    • DBScan, DenCLUE, D-Stream
  • Grid Based
    • STING, WaveCluster
dbscan
DBScan

“A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”

- Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu

  • Density- Based
  • Used to discover cluster with arbitrary shape
  • Minimum requirements of Domain Knowledge
definitions
Definitions
  • Core Point
    • A point having more than the MinPts in its EpsNeighborhood
  • Boundary Point
    • Not a core point
  • Direct Density Reachable
    • Point p is directly density reachable from q if q is a core point and q is in EpsNeighborhood of p
  • Density Reachable
    • Point P is density reachable from q is there are a chain of points p1,…, pM, such that p(i+1) is directly density reachable from pi
  • Density Connected
    • P and Q are density connected if there is a point O such that p and q are density reachable from O
algorithm
Algorithm
  • Start with arbitrary point p
    • Retrieve all points density-reachable from p
    • If p is a core point it includes a cluster
    • If p in a border point NO cluster and next point is visited in the database
  • Repeat process till all points are visited
conclusion dbscan
Conclusion (DBScan)
  • Based on Density Based Clustering
  • Can effectively find arbitrary shaped clusters
  • Does not need major domain knowledge
denclue
Denclue

“An Efficient Approach to Clustering in Large Multimedia Databases with Noise”

-Alexander Hinneburg, Daniel A. Keim

  • Density based clustering
  • Uses Influence function
  • Handle large amount of noise
slide10
Idea
  • Each data point has an influence that extends over a range
    • Influence function
  • Add all influence functions
    • Density function
definitions1
Definitions
  • Density Attractor x*
    • Local maximum of the density function
  • Density attracted points
    • Points from which a path to x* exists for which the gradient is continuously positive
center defined clusters
Center Defined Clusters
  • All points that are density attracted to a given density attractor x*
  • Density function at the maximum must exceed x
  • Points that are attracted to smaller maxima are considered outliers
arbitrary shape clusters
Arbitrary-Shape Clusters

Merges center defined clusters if a path exists for which the density function continuously exceeds x

algorithm1
Algorithm
  • Step 1: Construct a map of data points
    • Uses hypercubes of with edge length 2s
    • Only populated cubes are saved
  • Step 2: Determine density attractors for all points using hill-climbing
    • Keeps track of paths that have been taken and points close to them
step 1 constructing the map
Step 1: Constructing the map
  • Hypercubes contain
    • Number of data points
    • Pointers to data points
    • Sum of data values (for mean)
  • Save populated hypercubes in B+ tree
step 2 clustering step
Step 2: Clustering Step
  • Uses only highly populated cubes and cubes that are connected to them
  • Hill-climbing based on local density function and its gradient
  • Points within s/2 of each hill-climbing path are attached to clusters as well
time complexity efficiency
Time Complexity / Efficiency
  • Worst case, for N data points
    • O(N log(N))
  • Average case
    • O(log(N))
    • Explanation: Only highly populated areas are considered
  • Up to 45 times faster than DBSCAN
comparison with dbscan
Comparison with DBSCAN
  • Corresponding setup
    • Square wave influence function radius s models neighborhood e in DBSCAN
    • Definition of core objects in DBSCAN involves MinPts <=> x
    • Density reachable in DBSCAN becomes density attracted in DENCLUE
conclusion denclue
Conclusion (DenClue)
  • Denclue is faster than most other algorithms
    • Efficient Data Structure
  • Used for large multimedia databases
  • Can work well with large number of outliers
d stream
D-STREAM
  • Chen, Yixin, and Li Tu. "Density-based clustering for real-time stream data."Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.
data stream clustering
Data stream clustering
  • High dimensional stream in real time – a challenging task
  • Massive volumes of raw data arrives real time – can be scanned only once
  • Applications – stocks, weather monitoring ..
clustering algorithms then vs now
Clustering algorithms – then vs now
  • Then
    • Used single phase model
    • Treat data stream clustering as continuous version of static clustering
    • Divide and conquer
    • Weigh outdated and recent data equally
    • Don’t capture evolving characteristics of the data
    • CluStream: 2 phase framework
    • Offline component based on k-means – identifies spherical clusters, not arbitrary
    • Requires multiple scans of data
clustering algorithms then vs now1
Clustering algorithms – then vs now
  • Now
    • D-stream is density based
    • Doesn’t treat data stream as long sequence of static data
    • Dynamism of stream – decay factor
    • Doesn’t require user to specify the number of clusters
    • Discretize the data space into grids – new data maps to these grids
the d stream algorithm
The D-stream algorithm
  • Key features of the algorithm
    • Timestamp of data point labelled by integer
    • Online component + Offline component
  • Online component
    • Reads incoming data record
    • Places this multi-dimensional record into appropriate density grid
    • Updates characteristic vector of grid
  • Offline component
    • Dynamically adjusts clusters in the time gap (time between arrival of data)
    • Periodically regulated clusters
d stream definitions
D-stream definitions
  • Input – d dimensions defined in space S = S1 X S2 X .. Sd
  • Density grid – space Siis divided onto pipartitions
  • Grid g = S1,j1 X S2,j2 .. Sd,jd= (j1, j2, .. jd)
  • Every data record x = (x1, x2, .. xd) mapped onto g
  • Timestamp of arrival T(x)
  • Density coefficient at time t is given by λ∈ (0, 1)
  • λ ∈ (0, 1) decay factor
  • Grid Density
  • For each grid the time when the last data was received is recorded so that density is updated
d stream definitions1
D-stream definitions
  • Characteristic vector of a grid is (tg, tm , D, label, status)
    • tg – last time of update of g
    • tm – last time when g was removed from grid_List
    • D – grid density
    • Label – class label
    • Status - SPORADIC or NORMAL to remove sporadic grids
  • Dense grid
  • Sparse grid
  • Transitional grid
  • Sporadic grids – contain very few data points
components of d stream
Components of D-stream
  • New data x, mapped to grid g, and density is updates
    • Scheme gradually reduces density of record & grid
  • Periodically form clusters
    • Time interval of inspecting grid cant be too long or short
    • Compute minimum time for dense grid to become sparse grid
  • Remove sporadic grids
    • Grid containing very few data points
    • Removed by density thresolding
    • Grid_List keeps track of all grids under analysis
results
Results
  • Data – network intrusion data stream, synthetic
  • Data points – 30K – 85K
conclusion
Conclusion
  • D Stream is a clustering technique for fast changing data streams
  • Finds clusters in arbitrary shapes
  • Sporadic grids are dynamically removed