- 81 Views
- Uploaded on
- Presentation posted in: General

Multimedia DBs

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Multimedia DBs

- A multimedia database stores text, strings and images
- Similarity queries (content based retrieval)
- Given an image find the images in the database that are similar (or you can “describe” the query image)

- Extract features, index in feature space, answer similarity queries using GEMINI
- Again, average values help!
(Used QBIC –IBM Almaden)

- Features extracted from an image are based on:
- Color distribution
- Shapes and structure
- …..

what is an image?

A: 2-d RGB array

Color histograms,

and distance function

Mathematically, the distance function between

a vector x and a query q is:

D(x, q) = (x-q)T A (x-q) = S aij (xi-qi) (xj-qj)

A=I ?

Problem: ‘cross-talk’:

Features are not orthogonal ->

SAMs will not work properly

Q: what to do?

A: feature-extraction question

possible answers:

avg red, avg green, avg blue

it turns out that this lower-bounds the histogram distance ->

no cross-talk

SAMs are applicable

time

performance:

seq scan

w/ avg RGB

selectivity

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

(Q: how to normalize them?

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

(Q: how to normalize them?

A: divide by standard deviation)

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

(Q: other ‘features’ / distance functions?

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

(Q: other ‘features’ / distance functions?

A1: turning angle

A2: dilations/erosions

A3: ... )

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Q: how to do dim. reduction?

distance function: Euclidean, on the area, perimeter, and 20 ‘moments’

Q: how to do dim. reduction?

A: Karhunen-Loeve (= centered PCA/SVD)

Performance: ~10x faster

log(# of I/Os)

all kept

# of features kept

- Many problems (like time-series and image similarity) can be expressed as proximity problems in a high dimensional space
- Given a query point we try to find the points that are close…
- But in high-dimensional spaces things are different!

- Assume a uniformly distributed set of points in high dimensions [0,1]d
- Let’s have a query with length 0.1 in each dimension query selectivity in 100-d 10-100
- If we want constant selectivity (0.1) the length of the side must be ~1!

- Surface is everything!
- Probability that a point is closer than 0.1 to a (d-1) dimensional surface
- D=2 0.36
- D = 10 ~1
- D=100 ~1

- Number of grid cells and surfaces
- Number of k-dimensional surfaces in a d-dimensional hypercube
- Binary partitioning 2d cells

- Indexing in high-dimensions is extremely difficult “curse of dimensionality”

- Performance impacted by the amount of overlap between index nodes
- Need to follow different paths
- Overlap, multi-overlap, weighted overlap

- R*-tree when overlap is small
- Sequential access when overlap is large
- When an overflow occurs
- Split into two nodes if overlap is small
- Otherwise create a super-node with twice the capacity
- Tradeoffs made locally over different regions of data space

- No performance comparisons with linear scan!

- Designed for Range queries
- Map each d-dimensional point to 1-d value
- Build B+-tree on 1-d values
- A range query is transformed into a set of 1-d ranges
- More efficient than X-tree, Hilbert order, and sequential scan

pyramids

- 2d pyramids with top at
- center of data-space
- points in different pyramids
- ordered based on pyramid id
- points within a pyramid
- ordered based on height
- value(v) = pyramid(v) + height(v)

- Tile d-dimensional data-space uniformly
- A fixed number of bits in each dimensions (8)
- 256 partitions along each dimension
- 256d tiles
- Approximate each point by corresponding tile
- size of approximation = 8d bits = d bytes
- size of each point = 4d bytes (assuming a word per dimension)
- 2-step approach, the first using VA file

- δ = distance to kth NN so far
- For each approximation ai
- If lb(q,ai) < δ then
- Compute r = distance(q,vi)
- If r < δ then
- Add point i to the set of NNs
- Update δ

- If lb(q,ai) < δ then
- Performance based on ordering of vectors and their approximations

- δ = kth distant ub(q,a) so far
- For each approximation ai
- Compute lb(q,ai) and ub(q,ai)
- If lb(q,ai) <= δ then
- If ub(q,ai) < δ then
- Add point i to the set of NNs
- Update δ
- InsertHeap(Heap,lb(q,ai),i)

- If ub(q,ai) < δ then

- δ = distance to kth NN so far
- Repeat
- Examine the next entry (li,i) from the heap
- If δ < li then break
- Else
- Compute r = distance(q,vi)
- If r < δ then
- Add point i to the set of NNs
- Update δ

- Forever

- Sub-linear (log n) vectors after first phase

- Use Spheres for index nodes (SS-tree)
- Higher fanout since storage cost is reduced

- Use rectangles and spheres for index nodes
- Index node defined by the intersection of two volumes
- More accurate representation of data
- Higher storage cost

- Definition of a metric
- d(x,y) >= 0
- d(x,y) = d(y,x)
- d(x,y) + d(y,z) >= d(x,z)
- d(x,x) = 0

- Non-vector spaces
- Edit distance
- d(u,v) = sqrt ((u-v)TA(u-v) ) used in QBIC

x,d(x,p),r(x)

y,d(y,p),r(y)

Parent p

y

x

d(y,z) <= r(y)

z

Index entry = (routing object, distance to parent,covering radius)

All objects in subtree are within a distance of “covering radius”

from routing object.

x,d(x,p),r(x)

y,d(y,p),r(y)

Parent p

y

Query q with range t

x

t

q

z

d(q,z) >= d(q,y) - d(y,z)

d(y,z) <= r(y)

So, d(q,z) >= d(q,y) -r(y)

if d(q,y) - r(y) > t then d(q,z) > t

Prune subtree y if d(q,y) - r(y) > t (C1)

x,d(x,p),r(x)

y,d(y,p),r(y)

Parent p

y

Query q with range t

x

t

q

z

Prune subtree y if d(q,y) - r(y) > t (C1)

d(q,y) >= d(q,p) - d(p,y)

d(q,y) >= d(p,y) - d(q,p)

So, d(q,y) >= |d(q,p) - d(p,y)|

if |d(q,p) - d(p,y)| - r(y) > t then d(q,y) - r(y) > t

Prune subtree y if |d(q,p) - d(p,y)| - r(y) > t (C2)

- RQ(q, t, Root, Subtrees S1, S2, …)
- For each subtree Si
- prune if condition C2 holds
- otherwise compute distance to root of Si and prune if condition C1 holds
- otherwise search the children of Si

- For each subtree Si

- Maintain a priority list of k NN distances
- Minimum distance to a subtree with root xdmin(q,x) = max(d(q,x) - r(x), 0)
- |d(q,p) - d(p,x)| - r(x) <= d(q,x) - r(x)
- may not need to compute d(q,x)

- Maximum distance to a subtree with root xdmax(q,x) = d(q,x) + r(x)

x

q

d(q,z) + r(x) >= d(q,x)

d(q,z) >= d(q,x) - r(x)

r(x)

d(q,z) <= d(q,x) + r(x)

z

- Maintain an estimate dp of the kth smallest maximum distance
- Prune a subtree x if dmin(q,x) >= dp

- Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, William Equitz: Efficient and Effective Querying by Image Content. JIIS 3(3/4): 231-262 (1994)
- Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel: The X-tree : An Index Structure for High-Dimensional Data. VLDB 1996: 28-39
- Stefan Berchtold, Christian Böhm, Hans-Peter Kriegel: The Pyramid-Technique: Towards Breaking the Curse of Dimensionality. SIGMOD Conference 1998: 142-153
- Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205
- Paolo Ciaccia, Marco Patella, Pavel Zezula: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. VLDB 1997: 426-435