Nearest Neighbor Search in Spatial and Spatiotemporal Databases

Nearest Neighbor Search in Spatial and Spatiotemporal Databases Dimitris Papadias Hong Kong University of Science and Technology

Spatial and spatiotemporal databases • Spatial databases manage large collection of multi-dimensional objects. • Important query types • Window query: Retrieve all rivers in CA • Nearest neighbor: Find my nearest gas station • Spatial join: Report pairs of (city C, river R) such that R crosses C • Spatiotemporal databases deal with the same queries assuming, however, moving objects • Mobile computing • Traffic supervision • Flight control • Weather forecasting

R-trees [Guttman SIGMOD 84. Sellis et al VLDB 87, Beckman et al SIGMOD 00]

TPR-trees [Saltenis et al., SIGMOD 00, our group VLDB 03] • Extends the R-tree by introducing the velocity bounding rectangle (VBR) in non-leaf entries. • Objects are grouped together based on both their location and velocities.

Conventional NN search with R-(TPR-) trees • Depth-first [Roussopoulos et al., SIGMOD 95] • Best-first traversal Hjaltason and Samet TODS 99], incremental and optimal

NN search - other approaches • Several algorithms and theoretical performance bounds have been devised for exact and approximate processing in main memory. Here we care about I/O efficiency (minimization of node and page accesses) as well as cost models about the practical performance (suitable for query optimization). • Several approaches for NN in high-dimensional spaces (but the problem is different due to the dimensionality curse). Here we consider low dimensional spaces (spatial and spatiotemporal databases). • Ferhatosmanoglu et al [SSTD 01] discover the NN in a constrained area of the data space (e.g., find the NN to the south of the query point). • Korn and Muthukrishnan [SIGMOD 00 ] discuss reverse nearest neighbor queries, where the goal is to retrieve the data points whose nearest neighbor is a specified query point. • Korn et al. [VLDB 02] study the same problem in the context of data streams, where the data are not known in advance.

NN search for mobile queries • [Zheng and Lee, SSTD 01]: return the current NN and the validity time of the result. • Restrictions: (i) assumes a maximum speed (ii) applicable only to single NN (iii) requires voronoi diagrams. • [Song and Roussopoulos, SSTD 01]: minimize the number of queries for moving clients by returning m>k NNs. • Problem: how to determine m. IF 2dist(q,q') dist(q,b)-dist(q,a), THEN the 2 NN at q' be among the 4 NN of the first query.

Time parameterized NN (our group, SIGMOD 02) • Assuming a constant and known velocity, a TPNN returns: • The current query result R • The validity period T of R • The change C of the result at the end of T Result: R={i}, T=2, C={j}

Some objects have “infinite” influence time. The object that will become the next nearest neighbor is the one with the minimum influence time. TP NN queries: Influence Time

Processing TP NN with R- (TPR-) trees • Influence time of a MBR: the earliest possible time that any object in the MBR will become the new NN. • Algorithm: traverse the R-tree using depth-first or best-first traversal using the influence time instead of themindist . • Cost of TPNN queries about the same as that of conventional queries because we have to visit the influencing nodes anyway (to find the NN).

Continuous Nearest Neighbors (CNN) (our group, VLDB 02) Given a line segment q=[s,e], find the NN of every point on q. Result representation: {s(.NN=a), s1(.NN=c), s2(.NN=f), s3(.NN=h), e}. The points (s, s1, s2, s3,e) are the split points.

Maintain the set of split points incrementally. Main idea After processing a After processing c

Avoid examination of all points. Given an MBR E and query segment q, E must be searched if and only if there exists a split point siSL such that dist(si,si.NN) > mindist(si, E). Processing TP NN with an R- (TPR-) tree

Location Based NN queries (LBNN) (our group, SIGMOD 03) • A location-based kNN queryq returns • The current k NNs • A validity regionsuch that the result remains the same as long as q remains in the region. • The validity region of q is the Voronoi Cell (VC) of the NN o.

Computing the Voronoi Cell on-the-fly • Step 1 – Find the current NN • Step 2 – Use time TP NN queries to tighten the validity region

NN queries in road networks (our group, VLDB 03) • Find my nearest gas station in terms of driving distance. • Answer: Hotel b (the Euclidean NN is d) Assumptions: • We can incrementally compute Euclidean NN using conventional NN algorithms. • We can compute the network distance between the query and any point (i.e., the length of the shortest path connecting them) using Dijkstra's algorithm.

Euclidean Restriction Algorithm 1st Euclidean NN 2nd Euclidean NN

Network Expansion Algorithm

NN in the presence of obstacles (not published) • The NN of q in terms of obstructed distance is b, although the Euclidean NN is a.

Visibility graphs • Have been used widely in Computational Geometry for shortest path problems (e.g., find the shortest path from pstart to pend that does not cross any obstacle). • Problem: We cannot maintain the entire visibility graph in memory for real spatial datasets. • Solution: We only need the obstacles and objects that affect the result of the query.

Obstacle nearest neighbor algorithm • Idea: Similar to the Euclidean Restriction algorithm for road networks. • BUT how do we perform the obstructed distance computations?

Obstructed distance computation • Goal: compute the obstructed distance between p and q. • First retrieve obstacles o1, o2 in the Euclidean range. • Compute a provisional distance d1(p,q) using only o1, o2. • d1(p,q) is not enough because the shortest path is obstructed by o3. • Perform a second Euclidean range query on the obstacle R-tree using d1(p,q) and retrieve o3, o4. • Compute a new obstructed distance d2(p,q) taking o3, o4 into account. • Repeat the process until the obstructed distance remains the same for two consecutive iterations.

Other related work By our group: Similar concepts to the ones presented here, apply to several other spatial queries, i.e., TP spatial joins, Continuous window queries. • Cost Models for TP and continuous queries [TODS 03]. • Analysis of predictive NN (and other) queries [TODS to appear]. • An Efficient Cost Model for Optimization of Nearest Neighbor Search in Low and Medium Dimensional Spaces [TKDE to appear]. By other groups: increasing interest for novel types of NN search in the context of mobile computing and data streams applications • Iwerks et al [VLDB03] discuss continuous NN in the presence of object updates. • Shekhar et al [ACM GIS 03] discuss the in-route nearest neighbor query, which, given a trajectory, retrieves the single NN (e.g., gas station) that results in the minimum diversion from the trajectory. • Jensen et al [ACM GIS 03] discuss NN for objects moving on road networks.

Group NN queries (our group, ICDE 04) • Input: a set P={p1,…,pN} of static data points in multidimensional space and a group of query points Q={q1,…,qn}. • Output: the k (1) data point(s) with the smallest sum of distances to all points in Q. The distance between a data point p and Q is defined as dist(p,Q)=i=1~n|pqi|, where |pqi| is the Euclidean distance between p and query point qi. • Example: three users at locations q1, q2 and q3 want to find a meeting point (e.g., a restaurant); the corresponding query returns the data point p that minimizes the sum of Euclidean distances |pqi| for 1i3 • Assumption: the data points are indexed by an R-trees. Q may or may not fit in main memory.

Multiple Query Method (MQM) • Idea: Perform incrementalNN queries for each point in Q and combine their results. • <p10, 7>, <p11, 6>, T=5 (2+3) • <p11, 7> T=6 (3+3) MQM terminates • Problem: MQM may visit the same node and discover the same data point many times (for different query points).

Minimum Bounding Method (MBM) • Applies the MBR of Q to prune the search space. • Heuristic 1: Let M be the MBR of Q, and best_dist be the distance of the best GNN found so far. A node N cannot contain qualifying points, if: • Heuristic 2: A node N cannot contain qualifying points, if:

File Multiple Query Method (F-MQM) What happens if Q does not fit in memory. • F-MQM sorts query points according to their Hilbert value and splits Q into blocks {Q1, .., Qm} that fit in memory. • For each block, it computes the GNN using one of the main memory algorithms • It finally combines their results using MQM. Complication: once a NN of a group has been retrieved, we cannot compute its global distance (i.e., with respect to all data points) immediately.

F-MQM (cont) Solution: lazy evaluation: • First we find the GNN p1 of the first group Q1 • Then, we load in memory the second group Q2 and retrieve its NN p2. At the same time, we also compute the distance between p1 and Q2. • Similarly, when we load Q3, we update the current distances of p1 and p2 taking into account the objects of the third group. • After the end of the first round, we only have one data point (p1), whose global distance with respect to all query points has been computed.

File Minimum Bounding Method (F-MBM) • First, the points of Q are sorted by their Hilbert value and are assigned to groups (that fit in memory) according to this order. • For each group Qi, F-MBM keeps in memory its MBR Mi and cardinality ni (but not its contents). • F-MBM descends the R-tree of P (in depth-first or best-first traversal), only following nodes that may contain qualifying points. Heuristic: Let best_dist be the distance of the best GNN found so far. A node N can be safely pruned if:

Nearest Neighbor Search in Spatial and Spatiotemporal Databases