Efficiently evaluating order preserving similarity queries over historical market basket data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data. Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada. Travel assistance provided by the Mary Louise Imrie Graduate Student Award. Overview. Introduction

Download Presentation

Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficiently evaluating order preserving similarity queries over historical market basket data

Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data

Reza Sherkat and Davood Rafiei

Department of Computing Science

University of Alberta

Canada

Travel assistance provided by the Mary Louise Imrie Graduate Student Award

ICDE06


Overview

Overview

  • Introduction

    • Histories and Time-series

    • Similarity model for histories

  • Problem Definition

  • Proposed Approach

  • Results Highlight

  • Conclusions

ICDE06


Querying histories introduction

Querying Histories: Introduction

  • Querying multiple snapshots of data

    • Temporal selection, projection, and join queries

  • Finding similar time-series

    • Finding companies having similar stocks

  • Is it possible to define a notion of similarity for objects based on the similarity of their histories?

ICDE06


Histories

  • the history of a web-page

: bag of word

Histories

History: A sequence of time-stamped observations

  • Time-series: observations are real-values

  • Observations can be more general

the history of a patient

ICDE06


Similarity model for histories

Similarity Model for Histories

History for 3 patients

  • Similarity of two histories depends on:

    • Pair-wise similarity of their observations

ICDE06


Similarity model for histories1

Similarity Model for Histories

History for 3 patients

  • Similarity of two histories depends on:

    • Pair-wise similarity of their observations

  • The order that similar observations are recorded

    • Constraints on time-stamps of observations

ICDE06


Problem definition

Problem Definition

Given a history as a query:

  • Evaluate k-NN and Range queries efficiently.

  • For each history in the result, find its common signature with the query - where the similarity comes from?

ICDE06


Similarity measure for histories

Similarity Measure for Histories

Alignment of histories:

  • An approach to line-up subsequences of two histories

  • Denoted by a sequence of matches:

  • is an observation in A (B) or a gap ( ).

  • is the score of a match.

  • Alignment score measures the quality of an alignment.

ICDE06


Alignments of histories

The best alignment of two histories:

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

ICDE06


Alignments of histories1

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

The best alignment of two histories:

What is the best alignment of length 3?

ICDE06


Alignments of histories2

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

The best alignment of two histories:

What is the best alignment of length 3?

If the match could not be considered, what would be the best alignment of length 2?

ICDE06


Constraints on the alignments of histories

Constraints on the Alignments of Histories

  • The number of matches in the alignment.

    • l-alignment: alignment with l matches

  • The r-neighborhood constraint

    • For each match

  • r ,l : parameters of the similarity query.

ICDE06


Principle of optimality

Principle of Optimality

p(A)

p(B)

s(A)

s(B)

: optimal alignment of p(A) and p(B)

: optimal alignment of s(A) and s(B)

: optimal alignment of A and B

: concatenation operator

The principle of optimality holds if:

ICDE06


Score of optimal l alignment

b

,

,

b

,

b

,

,

b

K

K

+

1

1

j

j

n

  • Optimal l-alignment of suffixes can formed by:

    • Concatenating with optimal (l-1)-alignment of suffixes

  • Matching with gap, and considering l-alignment of suffixes

  • Matching with gap, and considering l-alignment of suffixes

Score of Optimal l-alignment

ICDE06


Similarity measure for histories1

  • can be used to find common signature of histories:

    • A sequence of observations that appear in the same order in

    • two histories.

    • Generalizes the notion of longest common subsequence.

Similarity Measure for Histories

: the score of optimal l-alignment of two histories.

ICDE06


Similarity queries over collection of histories

Similarity Queries over Collection of Histories

  • Straightforward (not practical) approach: naïve scan

  • Indexing techniques are proposed for metric spaces,

    but is not metric:

    • when the distance between observations is not metric.

    • when an r-neighberhood constraint is specified.

  • We propose upper bounds to prune history search space.

ICDE06


A general upper bound for the similarity measure

A General Upper Bound for the Similarity Measure

Intuition: The score of an optimal relaxed l-alignment is not less than the score of optimal l-alignment.

  • For each observation, find an optimal match.

  • Aggregate the scores for top l optimal matches to find an upper bound for .

This upper bound can prune some extra computations,

but still all histories will be accessed to evaluate a query.

ICDE06


An index based upper bound for the similarity measure

This upper bound can be evaluated efficiently by exploiting an inverted index if is Cosine or Extended Jaccard Coefficient.

An Index-based Upper Bound for the Similarity Measure

  • Intuitions:

    • Observations are sparse in real life applications.

    • The score of an optimal relaxed match is not less

    • than the score of an optimal match.

    • The score of an optimal relaxed alignment is not

    • less than the score of optimal relaxed l-alignment.

ICDE06


Experiments

Experiments

  • Experiments performed on AMD/XP 2600 512 Mb RAM

  • Datasets:

    • DBLP

    • Synth1: Our synthetic data

    • Synth2: Modified IBM synthetic data generator

  • Investigated:

    • Effectiveness of similarity measure

    • Efficiency of our approach

      • Pruning power, Running time, Saleability

ICDE06


Efficiently evaluating order preserving similarity queries over historical market basket data

ICDE06


Effectiveness of our similarity measure

: Poisson distribution

V(i+1): bit string following V(i)

in a pre-determined order

[Cho et al. VLDB 2000]

V(1)

V( i+1 )

V( n )

V( i )

  • Synth2 dataset contains:

    • 20,000 histories

    • for each history is selected randomly from {1,…,10}

    • Length of histories: {32,…,64}

Effectiveness of our Similarity Measure

observation: document modeled as bit string

First observation: randomly selected

ICDE06


Effectiveness of our similarity measure cnt

Effectiveness of our Similarity Measure (cnt.)

Mean deviation of from for k-NN queries:

* For 2,000 randomly generated queries

ICDE06


Pruning power vs k

Pruning Power vs. k

Fraction of database examined

0 20 40 60 80 100

1 10 100 1024

No. of neighbours in k-NN query (LOG scale)

ICDE06


Running time vs k

Running Time vs. k

Time (msec)

0 100 200 300 400 500 600

1 10 100 1024

Dataset: Synth2, 8,000 Histories, 1,000 items

No. of neighbours in k-NN query (LOG scale)

ICDE06


Scalability for 1 nn queries

Scalability for 1-NN queries

Time (msec)

8,000 16,000 32,000 64,000

No. of histories in the collection

ICDE06


Running time vs sparseness of observations

Running time vs. Sparseness of Observations

Time (msec)

256 512 1,024 2,048 4,096 8,092

No. of items (LOG scale)

ICDE06


Conclusions

Conclusions

  • Introduced a domain-independent framework to formulate and evaluate similarity queries over historical data.

  • Generalized few concepts, including edit distance and longest common subsequence to histories.

  • Developed upper bounds to efficiently evaluate queries. One of our upper bounds can directly take advantage of an index even though it is not metric.

  • Our experiments confirm the effectiveness and efficiency of our approach.

ICDE06


Efficiently evaluating order preserving similarity queries over historical market basket data

Thank you for your attention!

ICDE06


Related works

Related Works

  • Detecting, representing, querying histories

    • [Chawathe 1998], [Chien 2001]

  • Similarity-based sequence matching

    • [Altschul 1990], [Pearson 1990], [Bieganski 1994]

  • Finding similar sequence of events

    • [Wang 2003]

  • Finding similar time series

    • [Agrawal 1995], [Rafiei 1997], [Keogh 2002], [Vlachos 2002, 2003], ...

ICDE06


  • Login