Multimedia DBs

Multimedia DBs

Multimedia dbs • A multimedia database stores text, strings and images • Similarity queries (content based retrieval) • Given an image find the images in the database that are similar (or you can “describe” the query image) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help! (Used QBIC –IBM Almaden)

Image Features • Features extracted from an image are based on: • Color distribution • Shapes and structure • …..

Images - color what is an image? A: 2-d RGB array

Images - color Color histograms, and distance function

Images - color Mathematically, the distance function between a vector x and a query q is: D(x, q) = (x-q)T A (x-q) = S aij (xi-qi) (xj-qj) A=I ?

Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color

possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color

Images - color time performance: seq scan w/ avg RGB selectivity

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? Images - shapes

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes

Performance: ~10x faster Images - shapes log(# of I/Os) all kept # of features kept

Dimensionality Reduction • Many problems (like time-series and image similarity) can be expressed as proximity problems in a high dimensional space • Given a query point we try to find the points that are close… • But in high-dimensional spaces things are different!

Effects of High-dimensionality • Assume a uniformly distributed set of points in high dimensions [0,1]d • Let’s have a query with length 0.1 in each dimension  query selectivity in 100-d 10-100 • If we want constant selectivity (0.1) the length of the side must be ~1!

Effects of High-dimensionality • Surface is everything! • Probability that a point is closer than 0.1 to a (d-1) dimensional surface • D=2 0.36 • D = 10 ~1 • D=100 ~1

Effects of High-dimensionality • Number of grid cells and surfaces • Number of k-dimensional surfaces in a d-dimensional hypercube • Binary partitioning  2d cells • Indexing in high-dimensions is extremely difficult “curse of dimensionality”

X-tree • Performance impacted by the amount of overlap between index nodes • Need to follow different paths • Overlap, multi-overlap, weighted overlap • R*-tree when overlap is small • Sequential access when overlap is large • When an overflow occurs • Split into two nodes if overlap is small • Otherwise create a super-node with twice the capacity • Tradeoffs made locally over different regions of data space • No performance comparisons with linear scan!

Pyramid Tree • Designed for Range queries • Map each d-dimensional point to 1-d value • Build B+-tree on 1-d values • A range query is transformed into a set of 1-d ranges • More efficient than X-tree, Hilbert order, and sequential scan

Pyramid transformation pyramids • 2d pyramids with top at • center of data-space • points in different pyramids • ordered based on pyramid id • points within a pyramid • ordered based on height • value(v) = pyramid(v) + height(v)

Vector Approximation (VA) file • Tile d-dimensional data-space uniformly • A fixed number of bits in each dimensions (8) • 256 partitions along each dimension • 256d tiles • Approximate each point by corresponding tile • size of approximation = 8d bits = d bytes • size of each point = 4d bytes (assuming a word per dimension) • 2-step approach, the first using VA file

Simple NN searching • δ = distance to kth NN so far • For each approximation ai • If lb(q,ai) < δ then • Compute r = distance(q,vi) • If r < δ then • Add point i to the set of NNs • Update δ • Performance based on ordering of vectors and their approximations

Near-optimal NN searching • δ = kth distant ub(q,a) so far • For each approximation ai • Compute lb(q,ai) and ub(q,ai) • If lb(q,ai) <= δ then • If ub(q,ai) < δ then • Add point i to the set of NNs • Update δ • InsertHeap(Heap,lb(q,ai),i)

Near-optimal NN searching (2) • δ = distance to kth NN so far • Repeat • Examine the next entry (li,i) from the heap • If δ < li then break • Else • Compute r = distance(q,vi) • If r < δ then • Add point i to the set of NNs • Update δ • Forever • Sub-linear (log n) vectors after first phase

SS-tree and SR-tree • Use Spheres for index nodes (SS-tree) • Higher fanout since storage cost is reduced • Use rectangles and spheres for index nodes • Index node defined by the intersection of two volumes • More accurate representation of data • Higher storage cost

Metric Tree (M-tree) • Definition of a metric • d(x,y) >= 0 • d(x,y) = d(y,x) • d(x,y) + d(y,z) >= d(x,z) • d(x,x) = 0 • Non-vector spaces • Edit distance • d(u,v) = sqrt ((u-v)TA(u-v) ) used in QBIC

Basic idea x,d(x,p),r(x) y,d(y,p),r(y) Parent p y x d(y,z) <= r(y) z Index entry = (routing object, distance to parent,covering radius) All objects in subtree are within a distance of “covering radius” from routing object.

Range queries x,d(x,p),r(x) y,d(y,p),r(y) Parent p y Query q with range t x t q z d(q,z) >= d(q,y) - d(y,z) d(y,z) <= r(y) So, d(q,z) >= d(q,y) -r(y) if d(q,y) - r(y) > t then d(q,z) > t Prune subtree y if d(q,y) - r(y) > t (C1)

Range queries x,d(x,p),r(x) y,d(y,p),r(y) Parent p y Query q with range t x t q z Prune subtree y if d(q,y) - r(y) > t (C1) d(q,y) >= d(q,p) - d(p,y) d(q,y) >= d(p,y) - d(q,p) So, d(q,y) >= |d(q,p) - d(p,y)| if |d(q,p) - d(p,y)| - r(y) > t then d(q,y) - r(y) > t Prune subtree y if |d(q,p) - d(p,y)| - r(y) > t (C2)

Range query algorithm • RQ(q, t, Root, Subtrees S1, S2, …) • For each subtree Si • prune if condition C2 holds • otherwise compute distance to root of Si and prune if condition C1 holds • otherwise search the children of Si

Nearest neighbor query • Maintain a priority list of k NN distances • Minimum distance to a subtree with root x dmin(q,x) = max(d(q,x) - r(x), 0) • |d(q,p) - d(p,x)| - r(x) <= d(q,x) - r(x) • may not need to compute d(q,x) • Maximum distance to a subtree with root x dmax(q,x) = d(q,x) + r(x) x q d(q,z) + r(x) >= d(q,x) d(q,z) >= d(q,x) - r(x) r(x) d(q,z) <= d(q,x) + r(x) z

Nearest neighbor query • Maintain an estimate dp of the kth smallest maximum distance • Prune a subtree x if dmin(q,x) >= dp

References • Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, William Equitz: Efficient and Effective Querying by Image Content. JIIS 3(3/4): 231-262 (1994) • Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel: The X-tree : An Index Structure for High-Dimensional Data. VLDB 1996: 28-39 • Stefan Berchtold, Christian Böhm, Hans-Peter Kriegel: The Pyramid-Technique: Towards Breaking the Curse of Dimensionality. SIGMOD Conference 1998: 142-153 • Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205 • Paolo Ciaccia, Marco Patella, Pavel Zezula: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB 1997: 426-435

Multimedia DBs

Multimedia DBs

Presentation Transcript

DBS 5048

DBS Development

NoSQL DBs

Alternatives to relational DBs

DBS Full Storyboard

Alterative DBs

Column-based dbs

DBS Cases

system IS422ABC@dbs

Multimedia DBs

DBS Program Presentation

Relational DBs

DBS UXI Strategy

DBS Full Storyboard

DBS UPDATE

Multimedia DBs

DBS Residential