Loading in 2 Seconds...

Probabilistic Similarity Query on Dimension Incomplete Data

Loading in 2 Seconds...

- 67 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Probabilistic Similarity Query on Dimension Incomplete Data' - horace

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Probabilistic Similarity Query on Dimension Incomplete Data

Wei Cheng1, Xiaoming Jin1, and Jian-Tao Sun2

Intelligent Data Engineering Group, School of Software, Tsinghua University1

Microsoft Research Asia2

ICDM 2009, Miami

Outline

- Motivation & Problem
- Our Solution
- Experiments
- Related Work
- Summary and Future Work

Motivation

- Multidimensional data are everywhere
- Time series
- stock data
- data collected from sensor monitor
- Feature vectors extracted from images or texts
- ……
- Similarity query on multidimensional data is important
- data mining
- database
- information retrieval

Similarity query is challenging when the data is incomplete

- Data incompleteness happens when:
- Sensors do not work properly
- Certain features are missing from particular feature vectors
- …….

In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values)

Sensor data

2

X

3

12

… …

Query

Text vector

4

1

C1

Y

9

… …

Image vector

Z

5

2

11

… …

… …

… …

Dimension incomplete data

- Dimension incomplete data satisfies:
- (a) At least one of its data elements is missing;
- (b) The dimension of the missing data element can not be determined.
- E.g.
- Observed data:
- But we know the complete data should be of three dimensions
- Data missing might happen on the first, second or third dimension.

3

6

Causes of dimension incomplete

- Dimension incompleteness happens when:
- Data missing happens while using the order as the implicit dimension indicator
- The dimension indicator itself may also be lost
- ……

Similarity query is more challenging when the dimension is incomplete

- To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data.
- Enumerating all combination cases? – Time costing
- E.g. Xobs :

For an m-dimensional data object which has n elements missing, there will be Cmn cases to recover it.

lost one dimension

3

6

3 possible results after data recovery

X

3

6

Imputed

element

3

6

X

3

6

X

Two assumptions:

- The probability of using each recovery result is equal.
- The missing values obey normal distribution.

Efficient approach for PSQ-DID

- A gradual refinement search strategy including two pruning methods:
- Lower/upper bounds of confidence
- Probability triangle inequality
- Our Overall Query Process

Lower and upper bounds of confidence

- The missing part and the observed part of the dimension incomplete data are treated separately. Since we use Euclidean distance, we have:

Lower/upper bounds of the observed part, denoted by δLBobs and δUBobs.

Lower/upper bounds of the missing part, denoted by δLBmis and δUBmis.

E.g.

- Xobs=(2,8,7), Q=(1,4,5,6,7)
- δ2LBobs(Q, Xobs)=(2-1)2+(8-6)2+(7-7)2 = 5 corresponding recovery version: (2,8,7,x1,x2)
- For the imputed random variables Xmis={x1,x2}, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then δ2LBmis(Q , Xmis )=(4-x1)2+(5-x2)2,(E(x1)=E(x2)=5), corresponding to

Xrv =(2, , , 8, 7).

5

5

Probability triangle inequality

- Given a query Q and a multidimensional data object R (|Q| = |R|). For a dimension incomplete data object Xobs whose underlying complete version is X, we have:
- (1)
- (2)

Calculated in advance and stored in the database O(|Xobs|(|Q|-|Xobs|)2)

Calculated during query processing O(|Q|)

Experiments

- Data sets:
- Standard and Poor 500 index historical stock data(S&P500) (251 dimensions)
- A new data set with 30 dimensions
- by segmenting the S&P500 data set, resulting in 4328 data objects.
- Corel Color Histogram data (IMAGE)
- 68040 images
- 32 dimensions
- Dimension incomplete data set:
- randomly removing some dimensions of each data object.

Experiment Setup

- Ground truth:
- Similarity query results on the complete data
- Performance measures
- Precision, recall, pruning power
- Pruning power=Ndefinite/Nprocessed
- Nprocessed : number of all data objects
- Ndefinite: number of data objects judged as dismissals or search results by the pruner.
- Query: 100 data objects randomly sampled from the data set

Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on S&P500 data set

Query recall on S&P500 data set

Effectiveness of probabilistic similarity query on dimension incomplete data

Query precision on IMAGE data set

Query recall on IMAGE data set

Effect of the confidence threshold

- Missing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data

Confidence threshold vs precision-recall

Effectiveness of different pruners

Pruning power of probability triangle inequality

Pruning Power of Four Pruners

- Pruner1: probability triangle inequality using confidence lower bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound
- missing ratio=10%, c= 0.1, number of assistant objects=20

Pruning power of four pruners

Comparison of query quality when neglecting naïve verification

- For data objects that the four pruners can not judge, Pos simply outputs as query results, Neg, by contrast, judges them as dismissals.
- c=0.1

Comparison of query quality

Performance analysis

Time cost

Related Work

- Few research papers discuss similarity search on dimension incomplete data
- Incomplete data
- Recovery
- D. Williams et al. [ICML’05], K. Lakshminarayan et al. [Applied Intelligence’99],…
- Indexing
- G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],…
- Uncertain data
- J. Pei et al.[Sigmod’08], D. Burdick et al. [VLDB’05],…
- Dimension incomplete data
- Symbolic sequences
- J. Gu et al. [DEXA’07]

Summary and Future Work

- Problem:
- Tackle the similarity query on a new uncertain form (dimension incomplete)
- Solution:
- Lower and upper bounds of confidence
- So that we can avoid enumerate all C|Q||Xmis| recovery cases
- Probability triangle inequality
- Further boost the performance in query processing procedure
- Future work
- Other similarity measurements
- Index dimension incomplete data

Download Presentation

Connecting to Server..