1 / 104

1.32k likes | 1.81k Views

Mining Complex Types of Data. 2004/10/29. Outline. 1. Generalization of Structured Data 2. Mining Spatial Databases 3. Mining Time-Series and Sequence Data 4. Mining Text Databases 5. Mining the World Wide Web. 1. Generalization of Structured Data.

Download Presentation
## Mining Complex Types of Data

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Mining Complex Types of Data**2004/10/29**Outline**• 1. Generalization of Structured Data • 2. Mining Spatial Databases • 3. Mining Time-Series and Sequence Data • 4. Mining Text Databases • 5. Mining the World Wide Web**1. Generalization of Structured Data**• Generalization means a reduction of attribute value to a certain (small) set of categories (concept hierarchy). • This reduction often require the existence of background knowledge. • E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games}**Generalization Based Knowledge Discovery**• Requires existence of background knowledge (concept hierarchies) for both spatial and non-spatial data. • Concept hierarchies are typically given by domain experts.**An Example: Plan Mining by Divide and Conquer**• Plan: a variable sequence of actions • E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price, seat> • Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) • E.g., Discover travel patterns in an air flight database, or • find significant patterns from the sequences of actions in the repair of automobiles • Method • Attribute-oriented induction on sequence data • A generalized travel plan: <small-big*-small> • Divide & conquer:Mine characteristics for each subsequence • E.g., big*: same airline, small-big: nearby region**A Travel Database for Plan Mining**• Example: Mining a travel planbase Travel plans table**Strategy**Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans Multidimensional Analysis A multi-D model for the planbase**Multidimensional Generalization**Multi-D generalization of the planbase Merging consecutive, identical actions in plans**Generalization-Based Sequence Mining**• Generalize planbase in multidimensional way using dimension tables • Use # of distinct values (cardinality) at each level to determine the right level of generalization (level-“planning”) • Use operators merge“+”, option“[]” to further generalize patterns • Retain patterns with significant support**Generalized Sequence Patterns**• AirportSize-sequence survives the min threshold (after applying merge operator): S-L+-S [35%], L+-S [30%], S-L+ [24.5%], L+ [9%] • After applying option operator: [S]-L+-[S] [98.5%] • Most of the time, people fly via large airports to get to final destination • Other plans: 1.5% of chances, there are other patterns: S-S, L-S-L**2. Mining Spatial Databases**• Introduction • Spatial Association Rules • Spatial Clustering • Spatial Classification**Introduction**• Spatial data • spatial data contain some geometrical information • Objects are defined by points, lines, polygons. • Objects in the spatial database represent real-world entities (e.g., rivers) with associated attributes (e.g., flow, depth, etc.). • Objects usually are described with both spatial and nonspatial attributes. • Multidimentional trees are used to build indices for spatial data in spatial databases • E.g., quad trees, k-d trees, R-trees.**Database primitives for spatial mining**• Topology A covers B B covered-by A**Database primitives for spatial mining**• Distance**Database primitives for spatial mining**• Direction**Spatial data mining**• Discover interesting spatial patterns and features • Capture intrinsic relationships between spatial and non-spatial data • Applications • GIS • Image database exploration**Spatial Association Rules**• A spatial association rule is an association rule containing at least one spatial neighborhood relation • Topological relations: intersects, overlaps, disjoins, etc. • Direction relations: north, east, south_west, etc. • Distance relations: close_to, far_away, etc.**Example: Spatial Associations**Answers: and**oasis → elephants in neighbourhood**wildebeests → lions in neighbourhood**lots of cheetahs → fewer zebras**no zebras → fewer cheetahs**Hierarchy of spatial neighborhood relations**• "g_close_to" may be specialized to near_by, touch, intersect, contain, etc. • Basic idea: if two objects do not fulfill a rough relationship (such as intersect), they cannot fulfill a refined relationship (such as meet).**Using tree to explore:**• Collect task-relevant data. • Computation starts at high level of spatial predicates like g_close_to. • Utilize spatial indexing methods. • For those pattern that pass the filtering at the high levels, do further refinements at the lower levels, like adjacent_to, intersects, distance_less_than_x, etc. • Filter out those patterns that do not exceed Minimum Support Threshold or Minimum Confidence Threshold. • Derive the strong association rules!**40 large towns in B.C.**min_support=50%**Level-2**min_support is reduced to 25%**Level-3**min_support is reduced to 15%**Two-step procedure for discovering spatial neighborhood**relations • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Step 2: detailed spatial algorithm (as refinement) • Is very expensive (e.g. intersect test). • Apply only to those objects which have passed the rough spatial association test (no less than min_support).**Spatial Classification**• A number of questions can be associated with spatial classification • Which attributes or predicates are relevant to the classification process? • How should one determine the size of the buffers that produce classes with high purity? • Can one accelerate the process of finding relevant predicates?**Example: What Kind of Houses Are Highly Valued?**H H H L H H H L L H L L L H L L H H H C01 H H H H H H L H H L L H L H L L L H H L L L Highway lake**An efficient two-step method for classification of spatial**data • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Using nearest neighbor approach to find relevant predicates • Step 2: detailed computation (as refinement) • Only the relevant predicates are computed in detail for all classified objects • In the construction of the decision tree, the information gain utilized in ID3 is used**High_value**High_value High_value**Spatial Clustering**• Clustering in spatial data mining is to group similar objects based on their distance, connectivity or their relative density in space. • In the real word, there exist many physical obstacles such as rivers, lakes and highways ,and their presence may affect the result of clustering substantially.**Infected water pump ?**Disease Cluster Example: Spatial Cluster • 1854 cholera epidemic London map**Planning the locations of ATMs**C3 C2 Bridge C1 River Mountain C4 Spatial data with obstacles Clustering without taking obstacles into consideration**Not Taking obstacles into account**Taking obstacles into account

More Related