1 / 40

Adaptive XML Search

Adaptive XML Search. Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology. Outline. Motivation Key-Tag Search Multi-Ranker Model Ranking Support vector machine in voting SpyNB Framework (RSSF) Experiments Conclusions and Ongoing Work.

hester
Download Presentation

Adaptive XML Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive XML Search Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology

  2. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  3. Motivation

  4. Why we need XML Search Engine? • Different nature of HTML and XML data • HTML data • Hyperlink-intensive • Declarative languages • Tags have no semantic meaning • XML data • Self-describing tags • Extra structural information • XML search engines retrieve more accurate fragments

  5. Why we need XML Search Engine? • Web searching • Document paradigm • Matching keywords Vs documents • Return links to whole document (web page) • XML searching • Query Keywords maybe tags or data values • Structure of XML document is diverse, e.g. DBLP and Shakespeare • Not return whole document: 100Mb or larger • Return fragments

  6. DBLP <dblp> <incollection mdate="2002-01-03" key="books/acm/kim95/AnnevelinkACFHK95"> <author>Jurgen Annevelink</author> <title>Object SQL - A Language for the Design and Implementation of Object Databases.</title> <pages>42-68</pages> <year>1995</year> <booktitle>Modern Database Systems</booktitle> <url>db/books/collections/kim95.html</url> </incollection> ….

  7. Shakespeare • <SPEECH> • <SPEAKER>OCTAVIUS CAESAR</SPEAKER> • <LINE>No, my most wronged sister; Cleopatra</LINE> • <LINE>Hath nodded him to her. He hath given his empire</LINE> • <LINE>Up to a whore; who now are levying</LINE> • <LINE>The kings o' the earth for war; he hath assembled</LINE> • <LINE>Bocchus, the king of Libya; Archelaus,</LINE> • <LINE>Of Cappadocia; Philadelphos, king</LINE> • <LINE>Of Paphlagonia; the Thracian king, Adallas;</LINE> • <LINE>King Malchus of Arabia; King of Pont;</LINE> • <LINE>Herod of Jewry; Mithridates, king</LINE> • <LINE>Of Comagene; Polemon and Amyntas,</LINE> • <LINE>The kings of Mede and Lycaonia,</LINE> • <LINE>With a more larger list of sceptres.</LINE> • </SPEECH>

  8. Research Ideas • In Information Retrieval community, many ranking techniques are developed • Weighted keywords • Vector space • Searching and ranking XML as plain text using IR techniques is possible but • Too simple • Do not use the advantage of XML data • Can achieve better accuracy using features of XML data: • Structures • Tag semantics

  9. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  10. Key-Tag Search

  11. Key-Tag Query vs. XQuery • Keywords in Web search engine vs. SQL • The goals of key-tag query and XQuery are different • Key-Tag Query • Simple • Easy to understand • Flexible Too complicate for ordinary users!! XQuery: for $x in doc(“some.xml") where $x/author[(.ftcontains(‘Mary’)] return $x/title Key-Tag Query: <author>Mary</author> Will users input such complex XQuery in search engines?

  12. Key-Tag Search Query Tag Key • <author>Mary</author> • For example,

  13. If there is a fragment: • <b> • <c> • <b>b</b> • </c> • </b> F1: <b>b</b> F2: <b><c><b>b</b></c></b> F1 will be the answer • If there is a fragment: • F1: • <a> • <b>b</b>---------(B1)</a> • F2: • <a> • <c> • <b>b</b>----------(B2) • </c> • </a> Key-Tag Query Semantics • A fragment is considered as a result candidate if at least one key-tag is found in it. • If F1 and F2 both contain the same instance of key-tag and F1 is a subtree of F2, F1is chosen to be the only answer. • For example, a query <b>b</b>

  14. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  15. Multi-Ranker Model

  16. Introduction to MRM • Handle diversified XML documents and user preferences

  17. 1 2 n AR1 AR2 … ARn STR DAT DFT CUS W1 Similarity Granularity NEW Feature1 … Feature2 Feature3 Multi-Ranker Model User Profiles RSSF Adaptive Ranking Level (AR) w11 w12 w13 w14 Standard Ranking Level (XR) NEW Feature Ranking Level Keyword Access Path Element Order Category Sibling Children Distance+ Distance- Tag Attribute

  18. Adaptive Ranking Level (AR) • AR maintains a feature vector,, which adapts to the four XRs, the vector is weighted and trained by RSSF •  = (STR, DAT, DFT, CUS, STR, DAT, DFT, CUS) • The adaptive ranking of fragments is calculated by: W * , where W is generated by RSSF, we will introduce it later.

  19. Standard Ranking Level (XR) • Four XRs • Structure ranker (STR): focus on ranking XML fragments based on their structure • Data ranker (DAT): ignore the structure and rank the XML fragments with their textual data • System default ranker (DFT): a balance of structure and data ranker • Customized ranker (CUS): system administrator selects low-level feature for tuning, in our experiment, the low-level features are randomly pick

  20. Predefine that: Academic category {article, title, author} Sport category {team, player, match, year} … Category Vector for Q: <2/3, 0> Category Vector for F: <1, 1/4> Category similarity = distance of sqrt((1/3)2+(1/4)2)=0.4167 Keyword similarity = Access similarity = 3/7 Path similarity = 3/4 Element similarity = 2/7 Feature Ranking Level For example, Q = {<author>Mary</author>, <title>XML</title>} • Similarity Features • Keyword • Access • Path • Element • Order • Category Order in Q: author > title Sibling order in F: author>title, author>year, title>year, first>last Ancestor order similarity = 0 Sibling order similarity = 1/4

  21. Feature Ranking Level For example, Q = {<author>Mary</author>, <title>XML</title>} • Granularity Features • Sibling • Children • Distance+ • Distance- • Tag • Attribute • Involves statistical data in the database The length of the path from root to farthest leaf dblp/article/author/first: length = 4 The length of the path from root to nearest leaf dblp/article/title: length = 3 Number of fragments whose roots are dblp Number of tags whose parent are dblp Number of tag in F: 7 Number of attributes in F: 0

  22. Highlights of MRM • Highly Flexible • Add or remove of new features or new XR is straightforward • Only require to update the feature vector,  • “Ranking Level Independence” • Analogous to data independence in relational model

  23. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  24. Features of RSSF • Input: set of labeled fragment • Output: a trained ranker • Naïve Bayes is a successful algorithm for learning to classify text documents • Require small amount of training data, both positive and negative samples • In our setting, we only have labeled and unlabeled data, we extend the Naïve Bayes with spying technique to obtain the negative training samples

  25. The RSSF

  26.  Ranking SVM Techniques • Find a vector that makes the inequality holds: F1 < F2 <F3

  27. Voting Spy Naïve Bayes

  28. Voting Spy Naïve Bayes P1 P2 P3 Estimated Negative Training Naïve Bayes… Training Completed Positive Unclassified Negative

  29. P1 P2 P3 F11 Voting Spy Naïve Bayes The Final Estimated Negative is…… F11 F11 F11 F12 F12 F13 F14 Positive Unclassified Negative

  30. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  31. Effect of Varying Voting Threshold X: voting threshold Y: Relative average rank of labeled fragments: new average rank / original average rank

  32. Effectiveness of Low-Level Features on XR • In this experiment, we remove individual low-level feature from STR and DAT rankers and measure the new precision • The queries we use can be found in the appendix of the proposal

  33. Processing Time

  34. Comparison with TopX • TopX is a searching engine for XML data available online • State-of-the-art XML search engine • We measure the MAP and precison@k • MAP: mean average precision • precison@k: top k precision Average precision over 100 recall points for each query. Then, take the average. Number of top k relevant results k

  35. Outline • Motivation • Key-Tag Search • Multi-Ranker Model • Ranking Support vector machine in voting SpyNB Framework (RSSF) • Experiments • Conclusions and Ongoing Work

  36. Further remarks • Searching and ranking XML data are important, since existing Web search engines cannot handle them well • We present effective approach to perform adaptive XML searching and ranking by extending traditional IR techniques by considering different features of XML data

  37. Ongoing Work – INEX 2007 • The Initiative for Evaluation of XML retrieval (INEX) • A community which aims to provide large test data and scoring method for researchers to evaluate their retrieval systems • It is getting attention recently • We participate INEX in 2006 and 2007 • INEX 2007 Collection is a Wikipedia XML Corpus with a set of 659388 XML documents • We are running experiments using their data and queries

  38. Ongoing Work – INEX 2007

  39. Ongoing Work – Merging • Displaying a list of fragments one by one to the user may not be adequate in XML setting. • Fragments may be scattered on the list • Duplicated fragments in different structures • Refine a search query to obtain more and better results • Ideas: Make use of the schema information (DTD) and consider the fragments as entities and merge them in a concise way

  40. My Publications • Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model for Adaptive XML Searching. Accepted and to appear: VLDB Journal. (2007). • Ho-Lam LAU and Wilfred NG. Towards an Adaptive Information Merging Using Selected XML Fragments. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 1013-1019, (2007). • James CHENG and Wilfred NG. A Development of Hash-Lookup Trees to Support Querying Streaming XML. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 768-780, (2007). • Wilfred NG and James CHENG. An Efficient Index Lattice for XML Query Evaluation. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 753-767, (2007). • Wilfred NG and Ho-Lam LAU. A Co-Training Framework for Searching XML Documents. Information Systems, 32(3), pp. 477-503, (2007). • Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG . An Efficient Approach to Support Querying Secure Outsourced XML Information. Conference on Advanced Information Systems Engineering. CAiSE 2006, Lecture Notes in Computer Science Vol. 4007, Luxembourg, pp. 157-171, (2006). • Wilfred NG and Ho-Lam LAU. Effective Approaches for Watermarking XML Data. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 68-80, (2005). • Ho-Lam LAU and Wilfred NG. A Unifying Framework for Merging and Evaluating XML Information. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 81-94, (2005).

More Related