1 / 29

Multimedia DBs

Multimedia DBs. Multimedia dbs. A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find the images in the database that are similar (or you can “describe” the query image)

page
Download Presentation

Multimedia DBs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia DBs

  2. Multimedia dbs • A multimedia database stores text, strings and images • Similarity queries (content based retrieval) • Given an image find the images in the database that are similar (or you can “describe” the query image) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help!

  3. Image Features • Features extracted from an image are based on: • Color distribution • Shapes and structure • …..

  4. Images - color what is an image? A: 2-d RGB array

  5. Images - color Color histograms, and distance function

  6. Images - color Mathematically, the distance function between a vector x and a query q is: D(x, q) = (x-q)T A (x-q) = S aij (xi-qi) (xj-qj) A=I ?

  7. Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color

  8. possible answers: avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color

  9. Images - color time performance: seq scan w/ avg RGB selectivity

  10. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? Images - shapes

  11. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes

  12. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? Images - shapes

  13. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes

  14. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? Images - shapes

  15. distance function: Euclidean, on the area, perimeter, and 20 ‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes

  16. Performance: ~10x faster Images - shapes log(# of I/Os) all kept # of features kept

  17. Dimensionality Reduction • Many problems (like time-series and image similarity) can be expressed as proximity problems in a high dimensional space • Given a query point we try to find the points that are close… • But in high-dimensional spaces things are different!

  18. Effects of High-dimensionality • Assume a uniformly distributed set of points in high dimensions [0,1]d • Let’s have a query with length 0.1 in each dimension  query selectivity in 100-d 10-100 • If we want constant selectivity (0.1) the length of the side must be ~1!

  19. Effects of High-dimensionality • Surface is everything! • Probability that a point is closer than 0.1 to a (d-1) dimensional surface • D=2 0.36 • D = 10 ~1 • D=100 ~1

  20. Effects of High-dimensionality • Number of grid cells and surfaces • Number of k-dimensional surfaces in a d-dimensional hypercube • Binary partitioning  2d cells • Indexing in high-dimensions is extremely difficult “curse of dimensionality”

  21. Dimensionality Reduction • The main idea: reduce the dimensionality of the space. • Project the d-dimensional points in a k-dimensional space so that: • k << d • distances are preserved as well as possible • Solve the problem in low dimensions • (the GEMINI idea of course…)

  22. DR requirements • The ideal mapping should: • Be fast to compute: O(N) or O(N logN) but not O(N2) • Preserve distances leading to small discrepancies • Provide a fast algorithm to map a new query (why?)

  23. MDS (multidimensional scaling) • Input: a set of N items, the pair-wise (dis) similarities and the dimensionality k • Optimization criterion: stress = (ij(D(Si,Sj) - D(Ski, Skj) )2 / ijD(Si,Sj) 2) 1/2 • where D(Si,Sj) be the distance between time series Si, Sj, and D(Ski, Skj) be the Euclidean distance of the k-dim representations • Steepest descent algorithm: • start with an assignment (time series to k-dim point) • minimize stress by moving points

  24. MDS • Disadvantages: • Running time is O(N2), because of slow convergence • Also it requires O(N) time to insert a new point, not practical for queries

  25. FastMap[Faloutsos and Lin, 1995] • Maps objects to k-dimensional points so that distances are preserved well • It is an approximation of Multidimensional Scaling • Works even when only distances are known • Is efficient, and allows efficient query transformation

  26. FastMap • Find two objects that are far away • Project all points on the line the two objects define, to get the first coordinate

  27. FastMap - next iteration

  28. Results Documents /cosine similarity -> Euclidean distance (how?)

More Related