Probabilistic similarity query on dimension incomplete data
Download
1 / 26

Probabilistic Similarity Query on Dimension Incomplete Data - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Probabilistic Similarity Query on Dimension Incomplete Data. Wei Cheng 1 , Xiaoming Jin 1 , and Jian-Tao Sun 2. Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2. ICDM 2009, Miami. Outline. Motivation & Problem Our Solution Experiments

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Probabilistic Similarity Query on Dimension Incomplete Data' - horace


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Probabilistic similarity query on dimension incomplete data

Probabilistic Similarity Query on Dimension Incomplete Data

Wei Cheng1, Xiaoming Jin1, and Jian-Tao Sun2

Intelligent Data Engineering Group, School of Software, Tsinghua University1

Microsoft Research Asia2

ICDM 2009, Miami


Outline
Outline

  • Motivation & Problem

  • Our Solution

  • Experiments

  • Related Work

  • Summary and Future Work


Motivation
Motivation

  • Multidimensional data are everywhere

    • Time series

      • stock data

      • data collected from sensor monitor

    • Feature vectors extracted from images or texts

    • ……

  • Similarity query on multidimensional data is important

    • data mining

    • database

    • information retrieval


Similarity query is challenging when the data is incomplete
Similarity query is challenging when the data is incomplete

  • Data incompleteness happens when:

    • Sensors do not work properly

    • Certain features are missing from particular feature vectors

    • …….

In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values)

Sensor data

2

X

3

12

… …

Query

Text vector

4

1

C1

Y

9

… …

Image vector

Z

5

2

11

… …

… …

… …


Dimension incomplete data
Dimension incomplete data

  • Dimension incomplete data satisfies:

    • (a) At least one of its data elements is missing;

    • (b) The dimension of the missing data element can not be determined.

    • E.g.

      • Observed data:

      • But we know the complete data should be of three dimensions

      • Data missing might happen on the first, second or third dimension.

3

6


Causes of dimension incomplete
Causes of dimension incomplete

  • Dimension incompleteness happens when:

    • Data missing happens while using the order as the implicit dimension indicator

    • The dimension indicator itself may also be lost

    • ……


Similarity query is more challenging when the dimension is incomplete
Similarity query is more challenging when the dimension is incomplete

  • To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data.

  • Enumerating all combination cases? – Time costing

    • E.g. Xobs :

For an m-dimensional data object which has n elements missing, there will be Cmn cases to recover it.

lost one dimension

3

6

3 possible results after data recovery

X

3

6

Imputed

element

3

6

X

3

6

X


Problem statement
Problem statement: incomplete


Two assumptions
Two assumptions: incomplete

  • The probability of using each recovery result is equal.

  • The missing values obey normal distribution.


Efficient approach for psq did
Efficient approach for PSQ-DID incomplete

  • A gradual refinement search strategy including two pruning methods:

    • Lower/upper bounds of confidence

    • Probability triangle inequality

  • Our Overall Query Process


Lower and upper bounds of confidence
Lower and upper bounds of confidence incomplete

  • The missing part and the observed part of the dimension incomplete data are treated separately. Since we use Euclidean distance, we have:

Lower/upper bounds of the observed part, denoted by δLBobs and δUBobs.

Lower/upper bounds of the missing part, denoted by δLBmis and δUBmis.


E.g. incomplete

  • Xobs=(2,8,7), Q=(1,4,5,6,7)

  • δ2LBobs(Q, Xobs)=(2-1)2+(8-6)2+(7-7)2 = 5 corresponding recovery version: (2,8,7,x1,x2)

  • For the imputed random variables Xmis={x1,x2}, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then δ2LBmis(Q , Xmis )=(4-x1)2+(5-x2)2,(E(x1)=E(x2)=5), corresponding to

    Xrv =(2, , , 8, 7).

5

5


Lower and upper bounds of confidence1
Lower and upper bounds of confidence incomplete

  • We prove that

Denoted by: ,


Probability triangle inequality
Probability triangle inequality incomplete

  • Given a query Q and a multidimensional data object R (|Q| = |R|). For a dimension incomplete data object Xobs whose underlying complete version is X, we have:

    • (1)

    • (2)

Calculated in advance and stored in the database O(|Xobs|(|Q|-|Xobs|)2)

Calculated during query processing O(|Q|)


Experiments
Experiments incomplete

  • Data sets:

    • Standard and Poor 500 index historical stock data(S&P500) (251 dimensions)

      • A new data set with 30 dimensions

        • by segmenting the S&P500 data set, resulting in 4328 data objects.

    • Corel Color Histogram data (IMAGE)

      • 68040 images

      • 32 dimensions

  • Dimension incomplete data set:

    • randomly removing some dimensions of each data object.


Experiment setup
Experiment Setup incomplete

  • Ground truth:

    • Similarity query results on the complete data

  • Performance measures

    • Precision, recall, pruning power

  • Pruning power=Ndefinite/Nprocessed

    • Nprocessed : number of all data objects

    • Ndefinite: number of data objects judged as dismissals or search results by the pruner.

  • Query: 100 data objects randomly sampled from the data set


Effectiveness of probabilistic similarity query on dimension incomplete data
Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on S&P500 data set

Query recall on S&P500 data set


Effectiveness of probabilistic similarity query on dimension incomplete data1
Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on IMAGE data set

Query recall on IMAGE data set


Effect of the confidence threshold
Effect of the confidence threshold incomplete data

  • Missing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data

Confidence threshold vs precision-recall


Effectiveness of different pruners
Effectiveness of different pruners incomplete data

Pruning power of probability triangle inequality


Pruning power of four pruners
Pruning Power of Four Pruners incomplete data

  • Pruner1: probability triangle inequality using confidence lower bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound

  • missing ratio=10%, c= 0.1, number of assistant objects=20

Pruning power of four pruners


Comparison of query quality when neglecting na ve verification
Comparison of query quality when neglecting naïve verification

  • For data objects that the four pruners can not judge, Pos simply outputs as query results, Neg, by contrast, judges them as dismissals.

  • c=0.1

Comparison of query quality


Performance analysis
Performance analysis verification

Time cost


Related work
Related Work verification

  • Few research papers discuss similarity search on dimension incomplete data

  • Incomplete data

    • Recovery

      • D. Williams et al. [ICML’05], K. Lakshminarayan et al. [Applied Intelligence’99],…

    • Indexing

      • G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],…

  • Uncertain data

    • J. Pei et al.[Sigmod’08], D. Burdick et al. [VLDB’05],…

  • Dimension incomplete data

    • Symbolic sequences

      • J. Gu et al. [DEXA’07]


Summary and future work
Summary and Future Work verification

  • Problem:

    • Tackle the similarity query on a new uncertain form (dimension incomplete)

  • Solution:

    • Lower and upper bounds of confidence

      • So that we can avoid enumerate all C|Q||Xmis| recovery cases

    • Probability triangle inequality

      • Further boost the performance in query processing procedure

  • Future work

    • Other similarity measurements

    • Index dimension incomplete data


Many thanks! verification


ad