CoXML: A Cooperative XML Query Answering System

CoXML: A Cooperative XML Query Answering System Shaorong Liu and Wesley W. Chu APWeb/WAIM 2007

Motivation • XML has become the standard format for information representation and data exchange • XML schema is usually very complex • E.g., the XML schema for the IEEE Computer Society publications contains about 170 distinct tags and more than 1000 distinct paths • It is often unrealistic for users to fully understand a schema before asking queries • Exact query answering isinadequateand approximate query answering is more desirable!

A new paradigm for XML approximate query answering that places users and their demands in the center of the design approach Query Cooperative XML Query Answering Approximate Answers XML Database Engine XML Documents Our Contribution: CoXML

Roadmap • Introduction • Background • CoXML • Related Work • Conclusion

article article article title title year year section section title year section search engine search engine 2003 2003 spam detection spam detection search engine 2000-2005 spam detection document title year section search engine 2003 spam detection XML Query Relaxation Types • Value relaxation: enlarging a value condition’s search scope • Node relabel: changing the label a node to a similar or a more general label by domain knowledge [1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)

article article article title year section title year section title year section search engine 2003 spam detection search engine 2003 spam detection search engine 2003 spam detection article search engine year section 2003 spam detection XML Query Relaxation Types • Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge • Node deletion: dropping a node from a query tree

XML Relaxation Properties • Definition • Relaxation operation: an application of a relaxation type to a specific query node or edge • Lemma • Given a query tree with n applicable relaxation operations, there are potentially up to 2n relaxed trees • Possible combinations:

XML Query Relaxation Challenges • Query relaxation is often user-specific • Different users may have different approximate matching specifications for a given query tree • How to provide user-specific approximate query answering? • A query with n relaxation operations has potentially up to 2n relaxed queries • How to systematically relax a query? • Query relaxation generates a set of approximate answers • How to effectively rank the returned approximate answers?

Relaxation Index XML Documents CoXML System Overview relaxation language ranked results query Ranking Module Relaxation Engine results relaxed query Relaxation Index Builder exact answers query CoXML XML Database Engine

Roadmap • Introduction • Background • CoXML • Relaxation Language • Relaxation Index Structure • Ranking of Approximate Answers • Experimental Studies • Related Work • Conclusion

Relaxation Language • A relaxation-enabled query is a tuple {T,R, C, S} • T: tree-pattern query • R: relaxation constructs • E.g., delete/re-label a node, generalize an edge • C: relaxation controls • E.g., prefer/reject certain relaxation operations, use certain relaxation types, control relaxation orders, etc • S: stop condition • E.g., the minimum # of approximate answers to be returned

article $1 C = !Rel($3, -)  !Del($3)  Reject($2, bb) fm $2 atl $3 “digital libraries” !Rel($3, -) : $3 cannot be re-labeled !Del($3): $3 cannot be deleted Reject($2, bb): $2 cannot be re-labeled to bb Relaxation Language Example <inex_topic topic_id="267" > <castitle> //article//fm//atl[about(., "digital libraries")] </castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative> </inex_topic>

How to Relax Queries? • Naïve approach • Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers • Exhaustive, but not scalable • Observation • Many queries share the same (or similar) tree structures • Our approach: relaxation index structure • Consider the structure of a query tree T as a template • Build indexes on the relaxed trees of T • Use the index to guide the relaxations of any query with the same (or similar) tree structure as that of T

Relaxation Index Structure - XTAH • XTAH • A hierarchical multi-level labeled cluster of relaxed trees for a given query tree • Building an XTAH • Given a query structure template T, generate all possible relaxed trees • Each relaxed trees uses an unique set of relaxation operations • Cluster relaxed trees into groups based on relaxation operations and distances -- similar to “suffix-tree” clustering

article relax body T6 T2 article section title body edge_generalization node_relabel node_deletion T7 T4 article article T3 article section title body title body section {gen(e$1,$2)} … {gen(e$3, $4)} ... {del($2)} … section section T1 article … {gen(e$1, $2), gen(e$3, $4)} … {gen(e$3, $4), gen(e$1,$3)} … {del($2), del($3)} title body article $1 section … … … title $2 $3 body section $4 Template structure T XTAH Example for Template Structure T gen(e$u, $v) – relaxing the edge between $u and $v del($u) – deleting the node $u

XTAH Properties • Each group consists of a set of relaxed trees derived from similar relaxation operations • The relaxed trees can be located efficiently based on the type of relaxation operation • The higher level group in the XTAH yields lesser relaxation than the lower group • Query can be relaxed to different level of granularities by traversing up and down the XTAH

Ranking of XML Approximate Answers • Content similarity – cont_sim(A, Q) • An extended vector space model [2] • Structure similarity – struct_dist(A, Q) • Use tree editing distance for measuring structure similarity • Propose a cost model that assigns operation cost based on relaxation semantics • Overall relevancy – sim(A, Q) • A ranking model combing both content similarity and structure distance  is a small constant between 0 and 1 [2] Configurable Indexing and Ranking for XML Information Retrieval (S. Liu, et al., 2004)

Experimental Studies • Experiment Setup • INEX (INitiative for the Evaluation of Xml) 05 test collection • Document collection • Query set • Gold standard • Evaluation Metrics • nxCG (normalized extended cumulative gain) • the official evaluation metric used in INEX 05 • Given a number i (i1), nxCG@i, similar to precision@i, measures the relative gain users accumulated up to the rank i

Retrieval performance improvements with semantic cost model • Query set: all content-and-structure queries in INEX 05 nxCG@10 (, cost model) Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results

article $1 C = !Rel($3, -)  !Del($3)  Reject($2, bb) fm $2 atl $3 “digital libraries” Perfect accuracy Evaluation of Relaxation Control • Query: topic 267 • Result: Relaxation control enables the system to provide answers with greater relevancy!

Related Work • Relaxation based on schema conversions ([LC01, LMC01], [LMC03]) • Without structure relaxation • Native XML relaxation • Proposed structure relaxation types [e.g., KS01, ACS02] • Used the relaxation types [ACS02] in our work • Investigate efficient algorithms for deriving top-K answers based on relaxation types [e.g, Sch02, ACS02, ALP04, AKM05] • Without relaxation control

Conclusion • Cooperative XML (CoXML) query answering • Relaxation-enabled query language allows users to effectively express the relaxed query conditions as well as controlling the relaxation process • XTAH provides systematic query relaxation guidance • Used both content and structure similarity metrics for evaluating the relevancy of approximate answers • Evaluation studies with the INEX test collections validate the effectiveness of our methodology

CoXML: A Cooperative XML Query Answering System