Mining Complex Types of Data

Mining Complex Types of Data 2004/10/29

Outline • 1. Generalization of Structured Data • 2. Mining Spatial Databases • 3. Mining Time-Series and Sequence Data • 4. Mining Text Databases • 5. Mining the World Wide Web

1. Generalization of Structured Data • Generalization means a reduction of attribute value to a certain (small) set of categories (concept hierarchy). • This reduction often require the existence of background knowledge. • E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games}

Generalization Based Knowledge Discovery • Requires existence of background knowledge (concept hierarchies) for both spatial and non-spatial data. • Concept hierarchies are typically given by domain experts.

Spatial Attribute Concept Hierarchy

An Example: Plan Mining by Divide and Conquer • Plan: a variable sequence of actions • E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price, seat> • Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) • E.g., Discover travel patterns in an air flight database, or • find significant patterns from the sequences of actions in the repair of automobiles • Method • Attribute-oriented induction on sequence data • A generalized travel plan: <small-big*-small> • Divide & conquer:Mine characteristics for each subsequence • E.g., big*: same airline, small-big: nearby region

A Travel Database for Plan Mining • Example: Mining a travel planbase Travel plans table

Strategy Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans Multidimensional Analysis A multi-D model for the planbase

Multidimensional Generalization Multi-D generalization of the planbase Merging consecutive, identical actions in plans

Generalization-Based Sequence Mining • Generalize planbase in multidimensional way using dimension tables • Use # of distinct values (cardinality) at each level to determine the right level of generalization (level-“planning”) • Use operators merge“+”, option“[]” to further generalize patterns • Retain patterns with significant support

Generalized Sequence Patterns • AirportSize-sequence survives the min threshold (after applying merge operator): S-L+-S [35%], L+-S [30%], S-L+ [24.5%], L+ [9%] • After applying option operator: [S]-L+-[S] [98.5%] • Most of the time, people fly via large airports to get to final destination • Other plans: 1.5% of chances, there are other patterns: S-S, L-S-L

2. Mining Spatial Databases • Introduction • Spatial Association Rules • Spatial Clustering • Spatial Classification

Introduction • Spatial data • spatial data contain some geometrical information • Objects are defined by points, lines, polygons. • Objects in the spatial database represent real-world entities (e.g., rivers) with associated attributes (e.g., flow, depth, etc.). • Objects usually are described with both spatial and nonspatial attributes. • Multidimentional trees are used to build indices for spatial data in spatial databases • E.g., quad trees, k-d trees, R-trees.

Database primitives for spatial mining • Topology A covers B B covered-by A

Database primitives for spatial mining • Distance

Database primitives for spatial mining • Direction

Spatial data mining • Discover interesting spatial patterns and features • Capture intrinsic relationships between spatial and non-spatial data • Applications • GIS • Image database exploration

Spatial Association Rules • A spatial association rule is an association rule containing at least one spatial neighborhood relation • Topological relations: intersects, overlaps, disjoins, etc. • Direction relations: north, east, south_west, etc. • Distance relations: close_to, far_away, etc.

Example: Spatial Associations Answers: and

oasis → elephants in neighbourhood wildebeests → lions in neighbourhood

lots of cheetahs → fewer zebras no zebras → fewer cheetahs

Hierarchy of spatial neighborhood relations • "g_close_to" may be specialized to near_by, touch, intersect, contain, etc. • Basic idea: if two objects do not fulfill a rough relationship (such as intersect), they cannot fulfill a refined relationship (such as meet).

Using tree to explore: • Collect task-relevant data. • Computation starts at high level of spatial predicates like g_close_to. • Utilize spatial indexing methods. • For those pattern that pass the filtering at the high levels, do further refinements at the lower levels, like adjacent_to, intersects, distance_less_than_x, etc. • Filter out those patterns that do not exceed Minimum Support Threshold or Minimum Confidence Threshold. • Derive the strong association rules!

Example

The map of British Columbia

Representation of spatial objects

Hierarchies for data relations

40 large towns in B.C. min_support=50%

Level-1

Level-2 min_support is reduced to 25%

Level-3 min_support is reduced to 15%

Two-step procedure for discovering spatial neighborhood relations • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Step 2: detailed spatial algorithm (as refinement) • Is very expensive (e.g. intersect test). • Apply only to those objects which have passed the rough spatial association test (no less than min_support).

Spatial Classification • A number of questions can be associated with spatial classification • Which attributes or predicates are relevant to the classification process? • How should one determine the size of the buffers that produce classes with high purity? • Can one accelerate the process of finding relevant predicates?

Example: What Kind of Houses Are Highly Valued? H H H L H H H L L H L L L H L L H H H C01 H H H H H H L H H L L H L H L L L H H L L L Highway lake

An efficient two-step method for classification of spatial data • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Using nearest neighbor approach to find relevant predicates • Step 2: detailed computation (as refinement) • Only the relevant predicates are computed in detail for all classified objects • In the construction of the decision tree, the information gain utilized in ID3 is used

High_value High_value High_value

Spatial Clustering • Clustering in spatial data mining is to group similar objects based on their distance, connectivity or their relative density in space. • In the real word, there exist many physical obstacles such as rivers, lakes and highways ,and their presence may affect the result of clustering substantially.

Infected water pump ? Disease Cluster Example: Spatial Cluster • 1854 cholera epidemic London map

Clustering data objects with constraints

Planning the locations of ATMs C3 C2 Bridge C1 River Mountain C4 Spatial data with obstacles Clustering without taking obstacles into consideration

Not Taking obstacles into account Taking obstacles into account

Mining Complex Types of Data

Mining Complex Types of Data

Presentation Transcript

Chapter 9. Mining Complex Types of Data

Types of Data

Data Warehousing/Mining Comp 150 DW Chapter 9. Mining Complex Types of Data

Mining Complex Evolutionary Phenomena

Various Types Of Mining

Types of Mining

Different types of Spatio -temporal Data Mining

Types of Data

Types of Data

Chapter 9. Mining Complex Types of Data

Mining Complex Types of Data

Complex data types

Need for Complex Data Types

Types of Data

Mining Complex Data

DATA MINING SERVICES IN VARIOUS TYPES

Types of Complex Circuits

Types of Data