## Multimedia DBs

**PAA and APCA**• Another approach: segment the time series into equal parts, store the average value for each part. • Use an index to store the averages and the segment end points**X**X X X' X' X' SVD DFT DWT eigenwave 0 0 Haar 0 eigenwave 1 1 0 0 0 20 20 20 80 80 80 100 100 100 40 40 40 140 140 140 60 60 60 120 120 120 Haar 1 2 eigenwave 2 Haar 2 3 eigenwave 3 Haar 3 4 eigenwave 4 5 Haar 4 6 eigenwave 5 Haar 5 7 eigenwave 6 Haar 6 eigenwave 7 Haar 7 Feature Spaces Korn, Jagadish, Faloutsos 1997 Chan & Fu 1999 Agrawal, Faloutsos, Swami 1993**sv6**sv1 value axis sv7 sv5 sv4 sv2 sv3 sv8 time axis Piecewise Aggregate Approximation (PAA) Original time series (n-dimensional vector) S={s1, s2, …, sn} n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svn’} PAA representation satisfies the lower bounding lemma (Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)**sv6**sv1 sv7 sv5 sv4 sv2 sv3 sv8 Adaptive Piecewise Constant Approximation (APCA) sv3 n’/2-segment APCA representation (n’-d vector) S= { sv1, sr1, sv2, sr2, …, svM , srM } (M is the number of segments = n’/2) sv1 sv2 sv4 sr1 sr2 sr3 sr4 Can we improve upon PAA? n’-segment PAA representation (n’-d vector) S = {sv1 ,sv2, …, svN}**Reconstruction error PAAReconstruction error APCA**APCA approximates original signal better than PAA Improvement factor = 3.77 1.69 1.21 1.03 3.02 1.75**APCA Representation can be computed efficiently**• Near-optimal representation can be computed in O(nlog(n)) time • Optimal representation can be computed in O(n2M) (Koudas et al.)**Exact (Euclidean) distance D(Q,S)**S Q S S Q Q’ DLB(Q’,S) D(Q,S) D(Q,S) DLB(Q’,S) Distance Measure Lower bounding distance DLB(Q,S)**R1**R1 R3 R2 R4 S2 S5 S3 R3 S1 S4 S6 R4 R2 R3 R2 S8 R4 S9 S8 S7 S9 S1 S2 S3 S4 S5 S6 S7 2M-dimensional APCA space Index on 2M-dimensional APCA space Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)**MINDIST(Q,R2)**MINDIST(Q,R3) R1 S5 S2 R3 S3 S1 S4 Q S6 MINDIST(Q,R4) R2 S8 R4 S9 S7 k-nearest neighbor Algorithm • For any node U of the index structure with MBR R, MINDIST(Q,R) £ D(Q,S) for any data item S under U**smax3**smax1 smax2 smax4 smin1 smin3 smin2 smin4 Index Modification for MINDIST Computation APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM } R1 S2 S5 sv3 R3 S3 S1 S6 S4 sv1 R2 S8 R4 sv2 S9 sv4 S7 sr2 sr3 sr1 sr4 APCA rectangle S= (L,H) where L= { smin1, sr1, smin2, sr2, …, sminM, srM } and H = { smax1, sr1, smax2, sr2, …, smaxM, srM }**REGION 2**H= { h1, h2, h3, h4 , h5, h6 } h3 value axis l3 h1 l1 h5 REGION 3 l5 REGION 1 l2 l4 h4 l6 h2 h6 L= { l1, l2, l3, l4 , l5, l6 } time axis MBR Representation in time-value space We can view the MBR R=(L,H) of any node U as two APCA representations L= { l1, l2, …, l(N-1), lN }and H= { h1, h2, …, h(N-1), hN }**REGION i**h(2i-1) l(2i-1) h2i l(2i-2)+1 REGION 2 h3 l3 h1 value axis REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis Regions M regions associated with each MBR; boundaries of ith region:**t1**t2 Regions • ith region is active at time instant t if it spans across t • The value st of any time series S under node U at time instant t must lie in one of the regions active at t (Lemma 2) REGION 2 h3 value axis l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 time axis**t1**MINDIST(Q,R) = MINDIST Computation For time instant t, MINDIST(Q, R, t) = minregion G active at t MINDIST(Q,G,t) MINDIST(Q,R,t1) =min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1)) =min((qt1 - h1)2 , (qt1 - h3)2 ) =(qt1 - h1)2 REGION 2 h3 l3 h1 REGION 3 h5 l1 l5 REGION 1 l2 l4 h4 h6 h2 l6 Lemma3: MINDIST(Q,R) £ D(Q,C) for any time series C under node U**Approximate Search**• A simpler definition of the distance in the feature space is the following: • But there is one problem… what? DLB(Q’,S)**Multimedia dbs**• A multimedia database stores also images • Again similarity queries (content based retrieval) • Extract features, index in feature space, answer similarity queries using GEMINI • Again, average values help!**Images - color**what is an image? A: 2-d array**Images - color**Color histograms, and distance function**Images - color**Mathematically, the distance function is:**Problem: ‘cross-talk’:**Features are not orthogonal -> SAMs will not work properly Q: what to do? A: feature-extraction question Images - color**possible answers:**avg red, avg green, avg blue it turns out that this lower-bounds the histogram distance -> no cross-talk SAMs are applicable Images - color**Images - color**time performance: seq scan w/ avg RGB selectivity**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ (Q: how to normalize them? Images - shapes**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ (Q: how to normalize them? A: divide by standard deviation) Images - shapes**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ (Q: other ‘features’ / distance functions? Images - shapes**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ (Q: other ‘features’ / distance functions? A1: turning angle A2: dilations/erosions A3: ... ) Images - shapes**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ Q: how to do dim. reduction? Images - shapes**distance function: Euclidean, on the area, perimeter, and 20**‘moments’ Q: how to do dim. reduction? A: Karhunen-Loeve (= centered PCA/SVD) Images - shapes**Performance: ~10x faster**Images - shapes log(# of I/Os) all kept # of features kept

