1 / 26

Extending PRIX for Similarity-based XML Query

Extending PRIX for Similarity-based XML Query. Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao. Agenda. System Architecture Introduction Semantic-based Similarity Search Query Expansion Semantic Similarity Computation Structural-based Similarity Search Adapting PRIX algorithm

danno
Download Presentation

Extending PRIX for Similarity-based XML Query

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

  2. Agenda • System Architecture Introduction • Semantic-based Similarity Search • Query Expansion • Semantic Similarity Computation • Structural-based Similarity Search • Adapting PRIX algorithm • Indexing • Query Processing • Structural Similarity Computation • Similarity Computation and Ranking • Discussion & Conclusion

  3. System Architecture Introduction

  4. Agenda • System Architecture Introduction • Semantic-based Similarity Search • Query Expansion • Semantic Similarity Computation • Structural-based Similarity Search • Adapting PRIX algorithm • Indexing • Query Processing • Structural Similarity Computation • Similarity Computation and Ranking • Discussion & Conclusion

  5. Query Expansion (I) An Example: Tags in a sample query {title, Praveen Rao, information retrieval} Keywords {title, Praveen, Rao, information, retrieval} Keyword Extensions {{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} Valid Keyword Extensions {{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} (Continue in next page)

  6. Tag Extensions • {{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}} • Valid Tag Extensions • {{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}} • Query Expansions • { {title}, {Praveen Rao}, {modern information retrieval} } • {A claim on theory of computation}, {Praveen Rao}, {modern information retrieval} } …… • Valid Queries • { {title}, {Praveen Rao}, {modern information retrieval} } Query Expansion (II)

  7. Semantic Similarity Computation • Similarity between query q and one of its extensions q’ t: tag in query q t’: tag in query q’ n: number of tags in q = 1, if ki= ki’ α (0 =< α <1), if ki <> ki’ m: number of keywords in tag t

  8. Agenda • System Architecture Introduction • Semantic-based Similarity Search • Query Expansion • Semantic Similarity Computation • Structural-based Similarity Search • Adapting PRIX algorithm • Indexing • Query Processing • Structural Similarity Computation • Similarity Computation and Ranking • Discussion & Conclusion

  9. Indexing: Prix (PRüfer sequences for Indexing Xml)

  10. AD-Label (Ancestor-Descendant) Indexing structure in DB Indexing: Prix (PRüfer sequences for Indexing Xml)

  11. Query Processing • Procedure • Filtering • Based on Subsequence matching • O (n*n*m) : n is the number of nodes in the document; m is the number of nodes in the query. • Refinement • Connectivity • Gap Consistency • Frequency Consistency

  12. Subsequence Matching • Definition - Example: * Good results: media, mult, mm, ted, tia, etc… • Why it works? • Is not enough, need more refinements…

  13. Concept of Dummy Nodes - PRIX offers only partial match - Solution: extend prix to leaves level - Example: Refinement #1

  14. Connection vs Connectionless - Definition - How to check it? - If not connected, then what? - Solution: apply penalty Example (Disconnected By Gap): Example (Disconnected By Unknown): Refinement #2

  15. Refinement #3 • Checking for Gap Consistency - Gap Consistency depends on gaps of prüfer sequence - How to check it? - Determines if query tree is subset of searching domain

  16. Refinement #4 • Checking for Frequency Consistency - Frequency consistency depends on Gap Consistency and occurrences of NPS - How to check it? - Determines if query tree is exact match in searching domain - If not frequency consistent, then what? - Solution: apply penalty

  17. Structure Similarity • Calculations are based on edit distances which transforms to penalty values • Each mismatch node in structure has penalty equal to size of subtree + 1 • Overall penalty is dot product of all mismatches • All results are normalized with respect to worst case penalty • Overall penalty is dot product of all mismatches • All results are normalized with respect to worst case penalty

  18. Structural Similarity #1: Connectivity

  19. Structural Similarity #2: Gap Similarity

  20. Structural Similarity #3:Frequency Similarity

  21. Agenda • System Architecture Introduction • Semantic-based Similarity Search • Query Expansion • Semantic Similarity Computation • Structural-based Similarity Search • Adapting PRIX algorithm • Indexing • Query Processing • Structural Similarity Computation • Similarity Computation and Ranking • Discussion & Conclusion

  22. Rank returned XML patterns Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

  23. Advantages of the approach • Prix Indexing • Faster • Captures all structural information • Similarity based • Structure similarity • Semantic similarity

  24. Limitations and Extensions • Limitation of Prix: • Ordering of nodes • We need to handle it in query extension a a b c c b baca caba

  25. Limitations and Extensions • More Limitations of Prix: • It is difficult to map intuitive structure similarities in tree to sequences similarities in Prix sequences • thus difficult to have accurate definitions of the similarity • However: • Translate tree structures to equivalent sequences and further do data mining or similarity matching on sequences is a promising direction

  26. Limitations and Extensions • Limitations of Semantic similarity • Too many similar results • However: • We consider semantic similarity together with structure information • In broad sense: • Structure similarity • Semantic similarity • Syntax similarity • Similarity information from co-occurrences of keywords • Similarity information from user feedback • Similarity information from metadata (DTD, data source, region, language, link structure of XML files, etc.)

More Related