1.32k likes | 1.81k Views
Mining Complex Types of Data. 2004/10/29. Outline. 1. Generalization of Structured Data 2. Mining Spatial Databases 3. Mining Time-Series and Sequence Data 4. Mining Text Databases 5. Mining the World Wide Web. 1. Generalization of Structured Data.
E N D
Mining Complex Types of Data 2004/10/29
Outline • 1. Generalization of Structured Data • 2. Mining Spatial Databases • 3. Mining Time-Series and Sequence Data • 4. Mining Text Databases • 5. Mining the World Wide Web
1. Generalization of Structured Data • Generalization means a reduction of attribute value to a certain (small) set of categories (concept hierarchy). • This reduction often require the existence of background knowledge. • E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games}
Generalization Based Knowledge Discovery • Requires existence of background knowledge (concept hierarchies) for both spatial and non-spatial data. • Concept hierarchies are typically given by domain experts.
An Example: Plan Mining by Divide and Conquer • Plan: a variable sequence of actions • E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price, seat> • Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) • E.g., Discover travel patterns in an air flight database, or • find significant patterns from the sequences of actions in the repair of automobiles • Method • Attribute-oriented induction on sequence data • A generalized travel plan: <small-big*-small> • Divide & conquer:Mine characteristics for each subsequence • E.g., big*: same airline, small-big: nearby region
A Travel Database for Plan Mining • Example: Mining a travel planbase Travel plans table
Strategy Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans Multidimensional Analysis A multi-D model for the planbase
Multidimensional Generalization Multi-D generalization of the planbase Merging consecutive, identical actions in plans
Generalization-Based Sequence Mining • Generalize planbase in multidimensional way using dimension tables • Use # of distinct values (cardinality) at each level to determine the right level of generalization (level-“planning”) • Use operators merge“+”, option“[]” to further generalize patterns • Retain patterns with significant support
Generalized Sequence Patterns • AirportSize-sequence survives the min threshold (after applying merge operator): S-L+-S [35%], L+-S [30%], S-L+ [24.5%], L+ [9%] • After applying option operator: [S]-L+-[S] [98.5%] • Most of the time, people fly via large airports to get to final destination • Other plans: 1.5% of chances, there are other patterns: S-S, L-S-L
2. Mining Spatial Databases • Introduction • Spatial Association Rules • Spatial Clustering • Spatial Classification
Introduction • Spatial data • spatial data contain some geometrical information • Objects are defined by points, lines, polygons. • Objects in the spatial database represent real-world entities (e.g., rivers) with associated attributes (e.g., flow, depth, etc.). • Objects usually are described with both spatial and nonspatial attributes. • Multidimentional trees are used to build indices for spatial data in spatial databases • E.g., quad trees, k-d trees, R-trees.
Database primitives for spatial mining • Topology A covers B B covered-by A
Database primitives for spatial mining • Distance
Database primitives for spatial mining • Direction
Spatial data mining • Discover interesting spatial patterns and features • Capture intrinsic relationships between spatial and non-spatial data • Applications • GIS • Image database exploration
Spatial Association Rules • A spatial association rule is an association rule containing at least one spatial neighborhood relation • Topological relations: intersects, overlaps, disjoins, etc. • Direction relations: north, east, south_west, etc. • Distance relations: close_to, far_away, etc.
Example: Spatial Associations Answers: and
oasis → elephants in neighbourhood wildebeests → lions in neighbourhood
lots of cheetahs → fewer zebras no zebras → fewer cheetahs
Hierarchy of spatial neighborhood relations • "g_close_to" may be specialized to near_by, touch, intersect, contain, etc. • Basic idea: if two objects do not fulfill a rough relationship (such as intersect), they cannot fulfill a refined relationship (such as meet).
Using tree to explore: • Collect task-relevant data. • Computation starts at high level of spatial predicates like g_close_to. • Utilize spatial indexing methods. • For those pattern that pass the filtering at the high levels, do further refinements at the lower levels, like adjacent_to, intersects, distance_less_than_x, etc. • Filter out those patterns that do not exceed Minimum Support Threshold or Minimum Confidence Threshold. • Derive the strong association rules!
40 large towns in B.C. min_support=50%
Level-2 min_support is reduced to 25%
Level-3 min_support is reduced to 15%
Two-step procedure for discovering spatial neighborhood relations • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Step 2: detailed spatial algorithm (as refinement) • Is very expensive (e.g. intersect test). • Apply only to those objects which have passed the rough spatial association test (no less than min_support).
Spatial Classification • A number of questions can be associated with spatial classification • Which attributes or predicates are relevant to the classification process? • How should one determine the size of the buffers that produce classes with high purity? • Can one accelerate the process of finding relevant predicates?
Example: What Kind of Houses Are Highly Valued? H H H L H H H L L H L L L H L L H H H C01 H H H H H H L H H L L H L H L L L H H L L L Highway lake
An efficient two-step method for classification of spatial data • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Using nearest neighbor approach to find relevant predicates • Step 2: detailed computation (as refinement) • Only the relevant predicates are computed in detail for all classified objects • In the construction of the decision tree, the information gain utilized in ID3 is used
High_value High_value High_value
Spatial Clustering • Clustering in spatial data mining is to group similar objects based on their distance, connectivity or their relative density in space. • In the real word, there exist many physical obstacles such as rivers, lakes and highways ,and their presence may affect the result of clustering substantially.
Infected water pump ? Disease Cluster Example: Spatial Cluster • 1854 cholera epidemic London map
Planning the locations of ATMs C3 C2 Bridge C1 River Mountain C4 Spatial data with obstacles Clustering without taking obstacles into consideration
Not Taking obstacles into account Taking obstacles into account