Advanced topics in databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 57

Advanced topics in databases PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Advanced topics in databases. V. Megalooikonomou Generic Multimedia Indexing (slides are based on notes by C. Faloutsos). General Overview. Multimedia Indexing Spatial Access Methods (SAMs) k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees Generic Multimedia Indexing.

Download Presentation

Advanced topics in databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Advanced topics in databases

Advanced topics in databases

V. Megalooikonomou

Generic Multimedia Indexing

(slides are based on notes by C. Faloutsos)


General overview

General Overview

  • Multimedia Indexing

    • Spatial Access Methods (SAMs)

      • k-d trees

      • Point Quadtrees

      • MX-Quadtree

      • z-ordering

      • R-trees

    • Generic Multimedia Indexing


Mutlimedia indexing detailed outline

Mutlimedia Indexing – Detailed outline

  • Generic Multimedia Indexing

    • problem dfn

    • Distance function

    • Similarity queries – Types

    • Requirements (ideal method)

    • Basic idea, Lower-bounding

    • Gemini approach

    • Applications

      • 1-D Time sequences

      • 2-D Color images


Generic multimedia indexing problem

Generic Multimedia Indexing - problem

  • Given a database of multimedia objects

  • Design fast search algorithms that locate objects that match a query object, exactly or approximately

    • Objects:

      • 1-d time sequences

      • Digitized voice or music

      • 2-d color images

      • 2-d or 3-d gray scale medical images

      • Video clips

  • E.g.: “Find companies whose stock prices move similarly”


Mutlimedia indexing detailed outline1

Mutlimedia Indexing – Detailed outline

  • Generic Multimedia Indexing

    • problem dfn

    • Distance function

    • Similarity queries – Types

    • Requirements (ideal method)

    • Basic idea, Lower-bounding

    • Gemini approach

    • Applications

      • 1-D Time sequences

      • 2-D Color images


Generic multimedia indexing problem1

Generic Multimedia Indexing- problem

  • 1st step: provide a measure for the distance between two objects

    • Distance function D():

      • Given two objects OA, OB the distance (=dis-similarity) of the two objects is denoted by

        D(OA, OB)

        E.g., Euclidean distance (sum of squared differences) of two equal-length time series


Mutlimedia indexing detailed outline2

Mutlimedia Indexing – Detailed outline

  • Generic Multimedia Indexing

    • problem dfn

    • Distance function

    • Similarity queries

    • Requirements (ideal method)

    • Basic idea, Lower-bounding

    • Gemini approach

    • Applications

      • 1-D Time sequences

      • 2-D Color images


Types of similarity queries

Types of Similarity Queries

std

S1

F(S1)

1

365

day

F(Sn)

Sn

avg

day

1

365

  • Similarity queries are classified into:

    • Whole match queries:

      • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance  from Q

    • Sub-pattern Match:

      • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance  identify the parts of the data objects that match the query Q


Types of similarity queries1

Types of Similarity Queries

std

S1

F(S1)

1

365

day

F(Sn)

  • Similarity queries are classified into:

    • Whole match queries:

      • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance  from Q

    • Sub-pattern Match:

      • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance  identify the parts of the data objects that match the query Q

Sn

avg

day

1

365


Types of similarity queries2

Types of Similarity Queries

std

S1

F(S1)

1

365

day

F(Sn)

  • Similarity queries are classified into:

    • Whole match queries:

      • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance  from Q

    • Sub-pattern Match:

      • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance  identify the parts of the data objects that match the query Q

Sn

avg

day

1

365


Types of similarity queries3

Types of Similarity Queries

  • Similarity queries are classified into:

    • Whole match queries:

      • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance  from Q

    • Sub-pattern Match:

      • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance  identify the parts of the data objects that match the query Q


Types of similarity queries4

Types of Similarity Queries

std

S1

F(S1)

1

365

day

F(Sn)

Sn

avg

day

1

365

  • Additional types of queries:

    • K- Nearest Neighbor queries:

      • Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q

    • All pairs queries (or ‘spatial joins’):

      • Given a collection of N objects O1,…, ON find all objects that are within distance  from each other


Types of similarity queries5

Types of Similarity Queries

std

S1

F(S1)

1

365

day

F(Sn)

Sn

avg

day

1

365

  • Additional types of queries:

    • K- Nearest Neighbor queries:

      • Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q

    • All pairs queries (or ‘spatial joins’):

      • Given a collection of N objects O1,…, ON find all objects that are within distance  from each other


Mutlimedia indexing detailed outline3

Mutlimedia Indexing – Detailed outline

  • Generic Multimedia Indexing

    • problem dfn

    • Distance function

    • Similarity queries – Types

    • Requirements (ideal method)

    • Basic idea, Lower-bounding

    • Gemini approach

    • Applications

      • 1-D Time sequences

      • 2-D Color images


Idea method requirements

Idea method – requirements

  • Fast: sequential scanning and distance calculation with each and every object too slow for large databases

  • “Correct”: No false dismissals. False alarms are acceptable. Why?

  • Small space overhead

  • Dynamic: easy to insert, delete, and update objects


Approach outline

Approach Outline

  • Use k feature extraction functions to map objects into k-dimensional space (applying a mapping F () )

  • Use highly fine-tuned database SAMs (Spatial Access Methods) like R-trees to accelerate the search (by pruning out large portions of the database that are not promising)…


Mutlimedia indexing detailed outline4

Mutlimedia Indexing – Detailed outline

  • Generic Multimedia Indexing

    • problem dfn

    • Distance function

    • Similarity queries – Types

    • Requirements (ideal method)

    • Basic idea, Lower-bounding

    • Gemini approach

    • Applications

      • 1-D Time sequences

      • 2-D Color images


Basic idea

Basic idea

  • Focus on ‘whole match’ queries

    • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance  from Q

  • Sequential scanning?


  • Basic idea1

    Basic idea

    • Focus on ‘whole match’ queries

      • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance  from Q

  • Sequential scanning?

    May be too slow.. Why?


  • Basic idea2

    Basic idea

    • Focus on ‘whole match’ queries

      • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance  from Q

  • Sequential scanning?

    May be too slow.. for the following reasons:

    • Distance computation is expensive (e.g., editing distance in DNA strings)

    • The Database size N may be huge

  • Faster alternative?


  • Basic idea3

    Basic idea

    • Faster alternative:

      • Step 1:a ‘quick and dirty’ test to discard quickly the vast majority of non-qualifying objects

      • Step 2: use of SAMs to achieve faster than sequential searching

    • Example:

      • Database of yearly stock price movements

      • Euclidean distance function

      • Characterize with a single number (‘feature’)

      • Or use two or more features


    Basic idea illustration

    Basic idea - illustration

    Feature2

    S1

    F(S1)

    1

    365

    day

    F(Sn)

    Sn

    Feature1

    1

    365

    day

    • A query with tolerance  becomes a sphere with radius


    Basic idea caution

    Basic idea – caution!

    • The mapping F() from objects to k-d points should not distort the distances

    • D(): distance of two objects

    • Df(): distance of their corresponding feature vectors

    • Ideally, perfect preservation of distances

    • In practice, a guarantee of no false dismissals

    • How?


    Basic idea caution1

    Basic idea – caution!

    • The mapping F() from objects to k-d points should not distort the distances

    • D(): distance of two objects

    • Df(): distance of the corresponding feature vectors

    • Ideally, perfect preservation of distances

    • In practice, a guarantee of no false dismissals

    • How? If the distance in f-space matches or underestimates the distance between two objects in the original space


    Basic idea lower bounding

    Basic idea – Lower bounding

    • Let O1, O2 be two objects with distance function D() and F(O1), F(O2), be their feature vectors with distance function Df(), then:

      To guarantee no false dismissals for whole match queries, the feature extraction function F() should satisfy:

      Df(F(O1), F(O2))  D(O1, O2)

      for every pair of objects O1, O2


    Lower bounding proof

    Lower bounding - Proof

    • Let Q be the query object and O be the qualifying object and  be the tolerance.

    • Prove: If object O qualifies it will be retrieved by a range query in the f-space

    • Or, D(Q, O)    Df(F(Q), F(O))  

    • However, Df(F(Q), F(O))  D(Q, O)   

    • What about ‘all-pairs’?

    • What about ‘nearest-neighbor’ queries?


    Lower bounding proof1

    Lower bounding - Proof

    • Let Q be the query object and O be the qualifying object and  be the tolerance.

    • Prove: If object O qualifies it will be retrieved by a range query in the f-space

    • Or, D(Q, O)    Df(F(Q), F(O))  

    • However, Df(F(Q), F(O))  D(Q, O)   

    • What about ‘all-pairs’? (‘spatial join’ on f-space)

    • What about ‘nearest-neighbor’ queries?


    Lower bounding proof2

    Lower bounding - Proof

    • Let Q be the query object and O be the qualifying object and  be the tolerance.

    • Prove: If object O qualifies it will be retrieved by a range query in the f-space

    • Or, D(Q, O)    Df(F(Q), F(O))  

    • However, Df(F(Q), F(O))  D(Q, O)   

    • What about ‘all-pairs’? (‘spatial join’ on f-space)

    • What about ‘nearest-neighbor’ queries? ??


    Mutlimedia indexing detailed outline5

    Mutlimedia Indexing – Detailed outline

    • Generic Multimedia Indexing

      • problem dfn

      • Distance function

      • Similarity queries – Types

      • Requirements (ideal method)

      • Basic idea, Lower-bounding

      • Gemini approach

      • Applications

        • 1-D Time sequences

        • 2-D Color images


    Generic multimedia object indexing

    GEneric Multimedia object INdexIng

    • GEMINI approach:

      • Determine distance function D()

      • Find one or more numerical feature-extraction functions (to provide a ‘quick and dirty’ test)

      • Prove that Df() lower-bounds D() to guarantee no false dismissals

      • Use a SAM (e.g., R-tree) to store and retrieve k-d feature vectors

    • !!! The methodology focuses on the speed of search only; not on the quality of the results which relies on the distance function


    Generic multimedia object indexing1

    Generic Multimedia Object Indexing

    • Applications:

      • 1-d time sequences

      • 2-d color images

    • Problems to solve:

      • How to apply the lower-bounding lemma

      • ‘Curse of Dimensionality’ (time sequences)

      • ‘Cross-talk’ of features (color images)


    Mutlimedia indexing detailed outline6

    Mutlimedia Indexing – Detailed outline

    • Generic Multimedia Indexing

      • problem dfn

      • Distance function

      • Similarity queries – Types

      • Requirements (ideal method)

      • Basic idea, Lower-bounding

      • Gemini approach

      • Applications

        • 1-D Time sequences

        • 2-D Color images


    1 d time sequences

    1-D Time Sequences

    • Distance function: Euclidean distance

    • Find features that:

      • Preserve/lower-bound the distance

      • Carry as much information as possible(reduce false alarms)

    • If we are allowed to use only one feature what would this be?


    1 d time sequences1

    1-D Time Sequences

    • Distance function: Euclidean distance

    • Find features that:

      • Preserve/lower-bound the distance

      • Carry as much information as possible(reduce false alarms)

    • If we are allowed to use only one feature what would this be? The average.

    • … extending it…


    1 d time sequences2

    1-D Time Sequences

    • Distance function: Euclidean distance

    • Find features that:

      • Preserve/lower-bound the distance

      • Carry as much information as possible(reduce false alarms)

    • If we are allowed to use only one feature what would this be? The average.

    • … extending it…

    • The average of 1st half, of the 2nd half, of the 1st quarter, etc.

    • Coefficients of the Fourier transform (DFT), wavelet transform, etc.


    1 d time sequences3

    1-D Time Sequences

    • Show that the distance in feature space lower-bounds the actual distance

    • What about DFT?


    1 d time sequences4

    1-D Time Sequences

    • Show that the distance in feature space lower-bounds the actual distance

    • What about DFT?

      Parseval’s Theorem: DFT preserves the energy of the signal as well as the distances between two signals.

      D(x,y) = D(X,Y)

      where X and Y are the Fourier transforms of x and y

    • If we keep the first k  n coefficients of DFT we lower-bound the actual distance


    1 d time sequences5

    1-D Time Sequences

    • Response time improves as the transform concentrates more the energy of the signal

    • DFT concentrates the energy for a large class of signals, the colored noises

    • Colored noises: skewed energy spectrum that drops as O(f -b)

    • Energy spectrum or power spectrum of a signal is the square of the amplitude |Xf| as a function of the frequency f

    • b = 2: random walks or brown noise (very predictable)

    • b  2: black noises

    • b = 1: pink noise

    • b = 0: white noise (completely unpredictable)

    • Colored noises even in images (photographs)


    Mutlimedia indexing detailed outline7

    Mutlimedia Indexing – Detailed outline

    • Generic Multimedia Indexing

      • problem dfn

      • Distance function

      • Similarity queries – Types

      • Requirements (ideal method)

      • Basic idea, Lower-bounding

      • Gemini approach

      • Applications

        • 1-D Time sequences

        • 2-D Color images


    2 d color images

    2-D color images

    • Image features for Content Based Image Retrieval (CBIR):

      • Low Level:

        • Color – color histograms

        • Texture – directionality, granularity, contrast

        • Shape – turning angle, moments of inertia, pattern spectrum

        • Position – 2D strings method

        • …etc

      • Object Level:

        • Regions


    2 d color images color histograms

    2-D color images – Color histograms

    • Each color image – a 2-d array of pixels

    • Each pixel – 3 color components (R,G,B)

    • h colors – each color denoting a point in 3-d color space (as high as 224 colors)

    • For each image compute the h-element color histogram – each component is the percentage of pixels that are most similar to that color

    • The histogram of image I is defined as:

      For a color Ci , Hci(I) represents the number of pixels of color Ci in image I

      OR:

      For any pixel in image I, Hci(I) represents the possibility of that pixel having color Ci.


    2 d color images color histograms1

    2-D color images – Color histograms

    • Usually cluster similar colors together and choose one representative color for each ‘color bin’

    • Most commercial CBIR systems include color histogram as one of the features (e.g., QBIC of IBM)

    • No space information


    Color histograms distance

    Color histograms - distance

    • One method to measure the distance between two histograms x and y is:

      where the color-to-color similarity matrix A has entries aij that describe the similarity between color i and color j


    Color histograms lower bounding

    Color histograms – lower bounding

    • Two obstacles for using color-histograms as feature vectors in GEMINI:

      • ‘Dimensionality curse’ (h is large 64, 128)

      • Distance function is quadratic

        • It involves all cross terms (‘cross-talk’ among features)

          - expensive to compute

          - precludes the use of SAMs

    bright red

    pink

    orange

    x

    q

    e.g.,64 colors


    Color histograms lower bounding1

    Color histograms – lower bounding

    • 1st step: define the distance function between two color images D()=dh()

    • 2nd step: find numerical features (one or more) whose Euclidean distance lower-bounds dh()

    • If we allowed to use one numerical feature to describe the color image what should it be?

      • Avg. amount for each color component (R,G,B)

      • Where … , similarly for G and B

        Where P is the number of pixels in the image, R(p) is the red component (intensity) of the p-th pixel


    Color histograms lower bounding2

    Color histograms – lower bounding

    • Given the average color vectors and of two images we define davg() as the Euclidean distance between the 3-d average color vectors

    • 3rd step: to prove that the feature distance davg() lower-bounds the actual distance dh()

    • Main idea of approach:

      • First a filtering using the average (R,G,B) color,

      • then a more accurate matching using the full h-element histogram


    Color auto correlogram

    Color auto-correlogram

    • pick any pixel p1 of color Ciin the image I

    • at distance k away from p1 pick another pixel p2

    • what is the probability that p2 is also of color Ci ?

    Red ?

    k

    P2

    P1

    Image: I


    Color auto correlogram1

    Color auto-correlogram

    • The auto-correlogram of image I for color Ci , distance k:

    • Integrate both color information and space information.


    Color auto correlogram2

    Color auto-correlogram


    Implementations

    Implementations

    • Pixel Distance Measures

      • Use D8 distance (also called chessboard distance):

      • Choose distance k=1,3,5,7

      • Computation complexity:

        • Histogram:

        • Correlogram:


    Implementations1

    Implementations

    • Features Distance Measures:

      • D( f(I1) - f(I2) ) is small  I1 and I2 are similar.

      • Example: f(a)=1000, f(a’)=1050; f(b)=100, f(b’)=150

      • For histogram:

      • For correlogram:


    Color histogram vs correlogram

    Color Histogram vs Correlogram

    • If there is no difference between the query and the target images, both methods have good performance.

    Correlogram method

    Query Image

    (512 colors)

    1st

    2nd

    3rd

    4th

    5th

    Histogram method

    1st

    2nd

    3rd

    4th

    5th


    Color histogram vs correlogram1

    Color Histogram vs Correlogram

    • The correlogram method is more stable to color change than the histogram method.

    Query

    Correlogram method: 1st

    Histogram method: 48th

    Target


    Color histogram vs correlogram2

    Color Histogram vs Correlogram

    • The correlogram method is more stable to large appearance change than the histogram method

    Query

    Correlogram method: 1st

    Histogram method: 31th

    Target


    Color histogram vs correlogram3

    Color Histogram vs Correlogram

    • The correlogram method is more stable to contrast & brightness change than the histogram method.

    Query 3

    Query 1

    Query 2

    Query 4

    C: 178th

    H: 230th

    C: 1st

    H: 1st

    C: 1st

    H: 3rd

    C: 5th

    H: 18th

    Target


    Color histogram vs correlogram4

    Color Histogram vs Correlogram

    • The color correlogram describes the global distribution of local spatial correlations of colors.

    • It’s easy to compute

    • It’s more stable than the color histogram method


    Mutlimedia indexing conclusions

    Mutlimedia Indexing – Conclusions

    • GEMINI is a popular method

    • Whole matching problem

    • Should pay attention to:

      • Distance functions

      • Feature Extraction functions

      • Lower Bounding

      • Particular application

    • Sub-pattern matching?


  • Login