reverse spatial and textual k nearest neighbor search n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Reverse Spatial and Textual k Nearest Neighbor Search PowerPoint Presentation
Download Presentation
Reverse Spatial and Textual k Nearest Neighbor Search

Loading in 2 Seconds...

play fullscreen
1 / 39

Reverse Spatial and Textual k Nearest Neighbor Search - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Reverse Spatial and Textual k Nearest Neighbor Search. Presentation in Aalborg University. Jiaheng Lu Renmin University of China August 11 2011. Research experience. Associate Professor: Renmin University of China XML data management, Spatial data management, Cloud data management

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Reverse Spatial and Textual k Nearest Neighbor Search' - nevada-meyer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reverse spatial and textual k nearest neighbor search

Reverse Spatial and Textual k Nearest Neighbor Search

Presentation in Aalborg University

Jiaheng Lu

Renmin University of China

August 11 2011

research experience
Research experience

Associate Professor: Renmin University of China

  • XML data management, Spatial data management, Cloud data management

Post-doc: University of California, Irvine

  • Data integration, Approximate string match

PhD National University of Singapore

  • XML data management
outline
Outline

XML data management

  • XML twig query processing
  • XML keyword search

Approximate string matching

Reverse Spatial and Textual k Nearest Neighbor Search

xml twig query processing
XML twig query processing
  • XPath: Section[Title]/Paragraph//Figure
  • Twig pattern

Section

Paragraph

Title

Figure

xml twig query processing cont
XML twig query processing (Cont.)
  • Problem Statement
    • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D.
  • E.g. Consider Query and Document:

Query solutions:

(s1, t1, f1)

(s2, t2, f1)

(s1, t2, f1)

Query:

Section

Document:

s1

t1

s2

title

figure

t2

p1

f1

xml twig query processing cont1
XML twig query processing (Cont.)
  • Several efficient pattern matching algorithms
    • TJFast (VLDB 05)
    • iTwigJoin (SIGMOD 05)
    • TwigStackList (CIKM 04)
    • TreeMatch (TKDE 10)
  • Current works: distributed XML twig pattern processing
xml twig query processing1
XML twig query processing
  • Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542
  • Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204
  • Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189
  • Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119
  • Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309
  • Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178
  • Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263
  • Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298
  • Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466
  • ……
xquery vs
课题背景: XQuery vs. 关键字查询

XML keyword search

Query papers by “Mike”

XQuery:for $a in doc(“bib.xml”)//author

$n in $a/name

where $n=”Mike”

return $a//inproceedings

Keyword search:

Mike,inproceedings

Complicated

slide9

XML keyword search

  • The proposed keyword search returns the set of smallest trees containing all keywords.

Keywords:

bib

Mike

hobby

Paper

author

author

article

2009

name

publications

hobby

name

publications

hobby

Mike

ward

Paper

folding

John

Hopking

Read

book

inproceedings

articles

inproceedings

article

title

year

title

year

title

year

title

year

2002

Information

Retrival

Base line of

XML key

2002

Data

Mining

2007

Keyword

Search

in XML

2009

effectiveness
Effectiveness

Capture user’s search intention

  • Identify the target that users intend to search for
  • Infer the predicate constraint that user intends to search via

Result ranking

  • Rank the query results according to their objective relevance to user search intention
slide11

XML keyword search

  • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934
  • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109
  • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010)
  • Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754
  • Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
  • Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537
  • Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716
  • ……
outline1
Outline

XML data management

  • XML twig query processing
  • XML keyword search

Approximate string matching

Reverse Spatial and Textual k Nearest Neighbor Search

slide13

Motivation: Data Cleaning

Should clearly be “Niels Bohr”

  • Real-world data is dirty
  • Typos
  • Inconsistent representations
  • (PO Box vs. P.O. Box)
  • Approximately check against clean dictionary

Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

slide14

Motivation: Record Linkage

We want to link records belonging to the same entity

No exact match!

The same entity may have similar representations

Arnold Schwarzeneger versus

Arnold Schwarzenegger

Forrest Whittaker versus

Forest Whittacker

slide15

Motivation: Query Relaxation

  • Errors in queries
  • Errors in data
  • Bring query and meaningful results closer together

Actual queries gathered by Google

http://www.google.com/jobs/britney.html

slide16

What is Approximate String Search?

Queries against collection:

Find all entries similarto“Forrest Whitaker”

Find all entries similarto“Arnold Schwarzenegger”

Find all entries similarto“Brittany Spears”

String Collection: (People)

Brad Pitt

Forest Whittacker

George Bush

Angelina Jolie

Arnold Schwarzeneger

What do we mean by similar to?

  • Edit Distance
  • Jaccard Similarity
  • Cosine Similaity
  • Dice
  • Etc.

The similar to predicate can help our described applications!

How can we support these types of queries efficiently?

slide17

Approximate Query Answering

Main Idea: Use q-grams as signatures for a string

irvine

Sliding Window

2-grams {ir, rv, vi, in, ne}

Intuition: Similar strings share a certain number of grams

Inverted index on grams supports finding all data strings sharing enough grams with a query

slide18

Approximate Query Example

Query: “irvine”, Edit Distance 1

2-grams {ir, rv, vi, in, ne}

Lookup Grams

tf

vi

ir

ef

rv

ne

un

in

2-grams

1

2

4

5

6

5

9

1

3

4

5

7

9

1

5

1

2

3

9

3

9

7

9

5

6

9

Inverted Lists (stringIDs)

Candidates = {1, 5, 9}

May have false positives

Need to compute real similarity

Each edit operations can “destroy” at most q grams

Answers must share at least T = 5 – 1 * 2 = 3 grams

T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

approximate string matching
Approximate string matching
  • Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324
  • Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615
  • Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266
  • Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739
  • ……
outline2
Outline

XML data management

  • XML twig query processing
  • XML keyword search

Approximate string matching

Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011)

motivation
Motivation

clothes

food

clothes

clothes

sports

food

clothes

If add a new shop at Q, which shops will be influenced?

Influence facts

  • Spatial Distance
    • Results: D, F
  • Textual Similarity
    • Services/Products...
    • Results: F, C

2

problems of finding influential sets
Problems of finding Influential Sets

Traditional query

Reverse k nearest neighbor query (RkNN)

Our new query

Reverse spatial and textual k nearest neighbor query (RSTkNN)

3

problem statement
Problem Statement

Spatial-Textual Similarity

  • describe the similarity between such objects based on both spatial proximity and textual similarity.

Spatial-Textual Similarity Function

4

problem statement con t
Problem Statement (con’t)

RSTkNN query

  • finding objects which have the query object as one of their k spatial-textual similar objects.

5

related work
Related Work
  • Pre-computing the kNN for each object

(Korn ect, SIGMOD2000, Yang ect, ICDE2001)

  • (Hyper) Voronio cell/planes pruning strategy

(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)

  • 60-degree-pruning method

(Stanoi ect, SIGMOD2000)

  • Branch and Bound (based on Lp-norm metric space)

(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)

Challenging Features:

  • Lose Euclidean geometric properties.
  • High dimension in text space.
  • k and α are different from query to query.

7

overview of search algorithm
Overview of Search Algorithm

RSTkNN Algorithm:

  • Travel from the IUR-tree root
  • Progressively update lower and upper bounds
  • Apply search strategy:
    • prune unrelated entries in Pruned;
    • report entries to be results Ans;
    • add candidate objects to Cnd.
  • FinalVerification
    • For objects in Cnd, check whether results or not by updating the bounds for candidates using expanding entries in Pruned.

14

slide28

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

N4

N1

N2

N3

p5

p3

p1

p2

p4

Initialize N4.CLs;

EnQueue(U, N4);

U

N4, (0, 0)

15

slide29

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

Mutual-effect

N2

N1

N3

N1

N3

N2

N4

N1

N2

N3

p5

p3

p1

p2

p4

DeQueue(U, N4)

EnQueue(U, N2)

EnQueue(U, N3)

Pruned.add(N1)

Pruned

N1(0.37, 0.432)

U

N4(0, 0)

N3(0.323, 0.619 )

N2(0.21, 0.619 )

16

slide30

Mutual-effect

p4

N2

p4,N2

p5

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

N4

N2

N3

N1

p5

p3

p1

p2

p4

DeQueue(U, N3)

Answer.add(p4)

Candidate.add(p5)

Pruned

Answer

N1(0.37, 0.432)

p4(0.21, 0.619 )

U

Candidate

N3(0.323, 0.619 )

N2(0.21, 0.619 )

p5(0.374, 0.374)

17

slide31

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

Mutual-effect

p4,p5

p2

p2,p4,p5

p3

N4

N2

N3

N1

p5

p3

p1

p2

p4

DeQueue(U, N2)

Answer.add(p2, p3)

So far since U=Cand=empty,

algorithm ends.

Results: p2, p3, p4.

Pruned.add(p5)

Pruned

Answer

N1(0.37, 0.432)

p4

p2

p3

U

Candidate

N2(0.21, 0.619 )

p5(0.374, 0.374)

18

cluster iur tree ciur tree
Cluster IUR-tree: CIUR-tree

IUR-tree: Texts in an index node could be very different.

CIUR-tree: An enhanced IUR-tree by incorporating textual clusters.

19

optimizations
Optimizations
  • Motivation
    • To give a tighter bound during CIUR-tree traversal
    • To purify the textual description in the index node
  • Outlier Detection and Extraction (ODE-CIUR)
    • Extract subtrees with outlier clusters
    • Take the outliers into special account and calculate their bounds separately.
  • Text-entropy based optimization (TE-CIUR)
    • Define TextEntropy to depict the distribution of text clusters in an entry of CIUR-tree
    • Travel first for the entries with higher TextEntropy,i.e. more diverse in texts.

20

experimental study
Experimental Study

Experimental Setup

  • OS: Windows XP; CPU: 2.0GHz; Memory: 4GB
  • Page size: 4KB; Language: C/C++.

Compared Methods

  • baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.

Datasets

  • ShopBranches(Shop), extended from a small real data
  • GeographicNames(GN), real data
  • CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia.

Metric

  • Total query time
  • Page access number

21

scalability
Scalability

(1) Log-scale version

(2) Linear-scale version

22

effect of k
Effect of k

Query time

23

conclusion
Conclusion

Propose a new query problem RSTkNN.

Present a hybrid index IUR-Tree.

Show the enhanced variant CIUR-Tree and two optimizations ODE-CIUR and TE-CIUR to further improve search processing.

24

current and future works
Current and future works
  • Distributed XML query processing
  • Cloud-based data management