460 likes | 554 Views
Agenda Today. We will discuss a few interesting spatial data mining patterns Then come back to summarize what we have learned in this course so far. Spatial Data Management: Summary. Course Summary. 1. Introduction to Spatial Databases 2. Spatial Concepts and Data Models
E N D
Agenda Today • We will discuss a few interesting spatial data mining patterns • Then come back to summarize what we have learned in this course so far
Course Summary • 1. Introduction to Spatial Databases • 2. Spatial Concepts and Data Models • 3. Spatial Query Languages: SQL3 • 4. Spatial Storage and Indexing: R-tree, Grid File • 5. Query Processing and Query Optimization • Strategies for range query, nearest neighbor query • Spatial joins (e.g. tree matching), cost models • 6. Spatial Network Model • 7. Spatial Data Mining • Spatial auto-correlation, co-location patterns, spatial outliers, classification methods • 8. Trends in Spatial Database (Moving Object)
1. Introduction • Traditional (non-spatial) database management systems provide: • Persistence across failures • Allows concurrent access to data • Scalability to search queries on very large datasets which do not fit inside main memories of computers • Efficient for non-spatial queries, but not for spatial queries • Non-spatial queries: • List the names of all bookstore with more than ten thousand titles. • List the names of ten customers, in terms of sales, in the year 2001 • Use an index to narrow down the search • Spatial Queries: • List the names of all bookstores with ten miles of Minneapolis • List all customers who live in Tennessee and its adjoining states • List all the customers who reside within fifty miles of the company headquarter
1. Spatial Data Examples • Examples of non-spatial data • Names, phone numbers, … • Examples of Spatial data • Census Data • NASA satellites imagery - terabytes of data per day • Weather and Climate Data • Rivers, Farms, ecological impact • Medical Imaging
2. Spatial Object Model • Object model concepts • Objects: distinct identifiable things relevant to an application • Objects have attributes and operations • Attribute: a simple (e.g. numeric, string) property of an object • Operations: function maps object attributes to other objects • Example from a roadmap • Objects: roads, landmarks, ... • Attributes of road objects: • spatial: location, e.g. polygon boundary of land-parcel • non-spatial: name (e.g. Route 66), type (e.g. interstate, residential street), number of lanes, speed limit, … • Operations on road objects: determine center line, determine length, determine intersection with other roads, ...
2. Classifying Spatial objects • Spatial objets are spatial attributes of general objects • Spatial objects are of many types • Simple • 0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces) • Example given at the bottom of this slide • Collections • Polygon collection (e.g. boundary of Japan or Hawaii), … • See more complete list in Figure 2.2
2. Spatial Object Types in OGIS Data Model Fig 2.2: Each rectangle shows a distinct spatial object type
2. Classifying Operations on spatial objects in Object Model • Classifying operations • Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points • A set operation (e.g. intersection) of 2 polygons produce another polygon • Topological operations: Boundary of USA touches boundary of Canada • Directional: New York city is to east of Chicago • Metric: Chicago is about 700 miles from New York city.
2. Specifying topological operation Fig 2.3: 9 intersection matrices for a few topological operations
2. Conceptual DM: The ER Model • 3 basic concepts • Entities have an independent conceptual or physical existence. • Examples: Forest, Road, Manager, ... • Entities are characterized by Attributes • Example: Forest has attributes of name, elevation, etc. • An Entity interacts with another Entity through relationships. • Road allow access to Forest interiors. • This relationship may be name “Accesses”
2. ER Diagram for “State-Park” Fig 2.4
2. Mapping ER to Relational • Highlights of translation rules • Entity becomes Relation • Attributes become columns in the relation • Multi-valued attributes become a new relation • includes foreign key to link to relation for the entity • Relationships (1:1, 1:N) become foreign keys • M:N Relationships become a relation • containing foreign keys or relations from participating entities
3. Three Components of SQL? • Data Definition Language (DDL) • Creation and modification of relational schema • Schema objects include relations, indexes, etc. • Data Manipulation Language (DML) • Insert, delete, update rows in tables • Query data in tables • Data Control Language (DCL) • Concurrency control, transactions • Administrative tasks, e.g. set up database users, security permissions
3. Creating Tables in SQL • Table definition • “CREATE TABLE” statement • Specifies table name, attribute names and data types • Create a table with no rows. • See an example at the bottom • Related statements • ALTER TABLE statement modifies table schema if needed • DROP TABLE statement removes an empty table
3. Populating Tables in SQL • Adding a row to an existing table • “INSERT INTO” statement • Specifies table name, attribute names and values • Example: • INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000) • Related statements • SELECT statement with INTO clause can insert multiple rows in a table • Bulk load, import commands also add multiple rows • DELETE statement removes rows • UPDATE statement can change values within selected rows
3. SELECT Statement- General Information • Clauses • SELECT specifies desired columns • FROM specifies relevant tables • WHERE specifies qualifying conditions for rows • ORDER BY specifies sorting columns for results • GROUP BY, HAVING specifies aggregation and statistics • Operators and functions • arithmetic operators, e.g. +, -, … • comparison operators, e.g. =, <, >, BETWEEN, LIKE… • logical operators, e.g. AND, OR, NOT, EXISTS, • set operators, e.g. UNION, IN, ALL, ANY, … • statistical functions, e.g. SUM, COUNT, ... • many other operators on strings, date, currency, ...
4. Query Operation & Spatial Index • Filter Step: • Select the objects whose mbb satisfies the spatial predicate • Traverse the index apply the spatial test on the mbb • Output: set of oids • Refinement Step: • Spatial test is done on the actual geometries of objects whose mbb satisfied the filter step • Costly operation • Executed only on a limited number of objects • Concentrate on the design of efficient SAMs for the filter step
4. Why spatial index method? • B-tree & hash tables • Guarantee the number of I/O operations is respectively logarithmic and constant in the collection sized • Index a collection on a key • Rely on a total order on the key domain, the order of natural numbers, or the lexicographic order on strings • There is no such total order for geometric objects • SAMs were designed to try as much as possible to preserve spatial object proximity
4. Space-Driven v.s. Data-Driven SAMs • Space-Driven structures: • Partition the embedding 2D Space into rectangular cells • Independently of the distribution of the objects • Objects are mapped to the cells based on some geometric criterion • Grid file, linear structure • Data-Driven structures: • Organized by partitioning the set of objects, as opposed to the embedding space • Adapts to the objects’ distribution in the embedding space • R-tree, R* tree, R+ tree
4. Grid File – point indexing • One page is associated with each cell • When a cell overflow, it is split into two cells and the points are assigned to the new cell • Two adjacent cells can reference the same page • The cells are of different size and the partition adapts to the point distribution
4. The Quad tree • The index is represented as a quaternary tree • Each internal node has four children, one per quadrant • NW, NE, SW, SE • Each leaf is associated a disk page, which stores the index entries
4. The original R-Tree • A leaf entry is a pair (mbb, oid) • A non-leaf node contains an array of node entries • The number of entries is between m and M • For each entry (dr, node_id) in a non-leaf node N, dr is the directory rectangle of a child node of N, whose page address is node_id • All leaves are at the same level • An object appears in one, and only one of the tree leaves
4. The R+ Tree • The directory rectangles at a given level do not overlap • For a point query, a single path is followed from the root to a leaf • The I/O complexity is bounded by the depth of the tree
5. What is Query Processing and Optimization (QPO)? • Basic idea of QPO • In SQL, queries are expressed in high level declarative form • QPO translates a SQL query to an execution plan • over physical data model • using operations on file structures, indices, etc. • Ideal execution plan answers Q in as little time as possible • Constraints: QPO overheads are small • Computation time for QPO steps << that for execution plan
5. QPO Challenges in SDBMS • Building Blocks for spatial queries • Rich set of spatial data types, operations • A consensus on “building blocks” is lacking • Current choices include spatial select, spatial join, nearest neighbor • Choice of strategies • Limited choice for some building blocks, e.g. nearest neighbor • Choosing best strategies • Cost models are more complex since • Spatial Queries are both CPU and I/O intensive • While traditional queries are I/O intensive • Cost models of spatial strategies are not mature.
5. Choice of building blocks • Choice of building blocks • Varies across software vendors and products • List of representative building blocks • Point Query- Name a highlighted city on a digital map. • Return one spatial object out of a table • Range Query- List all countries crossed by of the river Amazon. • Returns several objects within a spatial region from a table • Spatial Join: List all pairs of overlapping rivers and countries. • Return pairs from 2 tables satisfying a spatial predicate • Nearest Neighbor: Find the city closest to Mount Everest. • Return one spatial object from a collection
5. Strategies for Spatial Joins • Recall Spatial Join Example: • List all pairs of overlapping rivers and countries. • Return pairs from 2 tables satisfying a spatial predicate • List of strategies • Nested loop: • Test all possible pairs for spatial predicate • All rivers are paired with all countries • Space Partitioning: • Test pairs of objects from common spatial regions only • Rivers in Africa are tested with countries in Africa only! • Tree Matching • Hierarchical pairing of object groups from each table, section 5.1.6 pp.121 • Other, e.g. spatial-join-index based, external plane-sweep, …
5. Query Processing and Optimizer process • A site-seeing trip • Start: A SQL Query • End: An execution plan • Intermediate Stopovers • query trees • logical tree transforms • strategy selection • What happens after the journey? • Execution plan is executed • Query answer returned Fig 5.2
5. Query Trees • Nodes = building blocks of (spatial) queries • See section 3.2 (pp.55) for symbols sigma, pi and join • Children = inputs to a building block • Leafs = Tables • Example SQL query and its query tree follows: Fig 5.3
5. Logical Transformation of Query Trees • Motivation • Transformation do not change the answer of the query • But can reduce computational cost by • reducing data produced by sub-queries • reducing computation needs of parent node • Example Transformation • Push down select operation below join • Example: Fig. 5.4 (compare w/ Fig 5.3, last slide) • Reduces size of table for join operation • Other common transformations • Push project down • Reorder join operations • ... Fig 5.4
5. Execution Plans • An execution plan has 3 components • A query tree • An ordering of evaluation of non-leaf nodes • A strategy selected for each non-leaf node • Example • Strategies for Query tree in Fig. 5.5 • Use scan for Area(L.Geometry) > 20 • Use index for Fa.Name = ‘Campground’ • Use space-partitioning join for • Distance(Fa, L) < 50 • Use on-the-fly for projection • Ordering • As listed above Fig 5.5
7. What is Spatial Data Mining? • Non-trivial search for interesting and unexpected spatial pattern • Non-trivial Search • Large (e.g. exponential) search space of plausible hypothesis • Ex. Asiatic cholera : causes: water, food, air, insects, …; water delivery mechanisms - numerous pumps, rivers, ponds, wells, pipes, ... • Interesting • Useful in certain application domain • Ex. Shutting off identified Water pump => saved human life • Unexpected • Pattern is not common knowledge • May provide a new understanding of world • Ex. Water pump - Cholera connection lead to the “germ” theory
7. Choice of Methods • Two Approaches to mining Spatial Data • Pick spatial features; use classical DM methods • Use novel spatial data mining techniques • Possible Approach: • Define the problem: capture special needs • Explore data using maps, other visualization • Try reusing classical DM methods • If classical DM perform poorly, try new methods • Evaluate chosen methods rigorously • Performance tuning as needed
7. Location Prediction as a classification problem Given: 1. Spatial Framework 2. Explanatory functions: 3. A dependent class: 4. A family of function mappings: Find: Classification model: Objective:maximize classification_accuracy Constraints: Spatial Autocorrelation exists Nest locations Distance to open water Vegetation durability Water depth Color version of Fig. 7.3, pp. 188
7. Techniques for Location Prediction • Classical method: • logistic regression, decision trees, bayesian classifier • assumes learning samples are independent of each other • Spatial auto-correlation violates this assumption! • Q? What will a map look like where the properties of a pixel was independent of the properties of other pixels? (see below - Fig. 7.4, pp. 189) • New spatial methods • Spatial auto-regression (SAR), • Markov random field • bayesian classifier
7. Spatial AutoRegression (SAR) • Spatial Autoregression Model (SAR) • y = Wy + X + • W models neighborhood relationships • models strength of spatial dependencies • error vector • Solutions • and - can be estimated using ML or Bayesian stat. • e.g., spatial econometrics package uses Bayesian approach using sampling-based Markov Chain Monte Carlo (MCMC) method. • Likelihood-based estimation requires O(n3) ops. • Other alternatives – divide and conquer, sparse matrix, LU decomposition, etc.
7. Associations, Spatial associations, Co-location Answers: and
7. Association Rules: Formal Definitions • Consider a set of items, • Consider a set of transactions • where each is a subset of I. • Support of C • Then iff • Support: occurs in at least s percent of the transactions: • Confidence: At least c% • Example: Table 7.4 (pp. 202) using data in Section 7.4
Association rules Co-location rules Underlying space discrete sets continuous space item-types item-types events /Boolean spatial features collection Transaction (T) Neighborhood (N) prevalence measure support participation index conditional probability metric Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at location L ] 7. Co-location rules vs. association rules Participation index = min{pr(fi, c)} Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: = fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby N(L) = neighborhood of location L
7. Spatial Outlier Detection • Compute where • Select points (e.g. S with Z(S(x)) above 3
7. Spatial Outlier Detection: Example Color version of Fig. 7.19 pp. 219 Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> R Find O = {vi | vi V, vi is a spatial outlier} Spatial Outlier Detection Test 1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))] 2. Test for Outlier Detection | (S(x) - s) / s | > Rationale: Theorem: S(x) is normally distributed if f(x) is normally distributed
8. Spatiotemporal Data • Two types of problems: • Indexing the current positions and movements of objects and querying their anticipated future positions. • Indexing and querying the past movements of mobile objects. • On Indexing Mobile Objects • Indexing the Positions of Continuously Moving Objects
Spatiotemporal Data (cont’d) • Indexing current/future locations mobile objects • The TPR-tree • Like the R-tree, but the MBRs are time-parameterized to conservative bounding intervals (CBI). • How are the CBI computed? What is the best way to group objects into a CBI? • By minimizing an objective function (e.g., overlap) over the time the TPR-tree is valid. • How do we answer queries using the TPR-tree?
Conclusion • Good progress… still more work is needed: • Devising clean and complete semantics for data models and operators for spatial data, spatial-temporal data • Efficient implementation • Indexing, query processing, query optimization, cost model • Develop efficient algorithms to mine spatial data • Alternatives architectures • spatial-temporal data, moving objects • mobile, wireless applications • web GIS