A Distributed Indexing Strategy for Efficient XML Retrieval

A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop (EIIR) @ 30th European Conference on Information Retrieval (ECIR), Glasgow, GB, March/April 2008 Judith Winter Institute for Informatics / Telematics GroupJ. W. Goethe-University / Frankfurt am Main, Germany

A Distributed Indexing Strategy for Efficient XML Retrieval Overview 1. Introduction • Introduction • A search engine for XML IR in P2P • Indexing techniques • Outlook on current implementation • Questions and discussion

1.Introduction2.Architecture 3.Indexing 4.Outlook XML Information Retrieval in Peer-to-Peer Systems • Challenges: • bandwith consumption / communication overhead • only selected information available • vague queries • relevance-ranking InformationRetrieval Peer-to-Peer XML-Retrieval • structured documents • more precise search • based on c/s architectures • distributed • autonomous peers • growing amount of XML-documents

1.Introduction 2.Search engine 3.Indexing 4.Outlook System characteristics: • Queries: content-and-structure (CAS) • Indexing: include structure • Fixed limit for posting list sizes; pre-computing of posting lists for popular term combinations  highly discriminative keys (HDKs) • Hybrid indexing: globally or locally (distributing summaries) depending on peer status • Pruning posting lists by considering structural information • Ranking: extended vector space model • Results/Retrieval units: document or passage retrieval

1.Introduction 2.Search engine 3.Indexing 4.Outlook P2P network APPLICATION File system Graphical User Interface Querying & result presentation Indexing local documents results for q INFORMATION RETRIEVAL Index storage component local index HDK index Document index Retrieval unitindex distributed index Frequent XTerm index HDK index Document index Retrieval unitindex documents dn term statistics for retrieval units(d) frequencies query q Indexing component Ranking component Retrieval component PEER-TO-PEER P2P component variant of DHT-algorithm (Kademlia/Chord)

1.Introduction 2.Search Engine 3.Indexing 4.Outlook HDK-based indexing: • Use of XTerms: (content, structure)-tuples • Rare tuple-combinations: Highly Discriminative Keys (HDKs) • Over 80% multiterm queries precomputed key-combinations • If key is frequent (frequency exceeds threshold): combine with other frequent keys of same window (e.g. same XML element) • Example apple \book\chapter  dok1(14.5), dok2(12.4) \magazine\p  dok2(5.3), dok3(2.7), dok4(0.7) chips \book  dok4(18.4), dok1(2.3), dok2(2.1), dok3(1.5)

1.Introduction 2.Search Engine 3.Indexing 4.Outlook Pruning posting lists (FrequentXTermIndex): • Entries sorted by scoret(di); choose k best entries for XTerm t • Considers document di, best retrieval unit rubest, and peer pi • Weighting function w: BM25f-based • PeerScore: high for peers with good collections regarding t and with good performance metrics

1.Introduction 2.Search Engine 3.Indexing 4.Outlook Hybrid indexing: • Indexing depending on status of peer: • Exhaustive indexing: per document • Quick indexing: per peer (summaries, e.g. tf per peer) • Peer status considers: • Response times • Available bandwidth • Open IP address (vs. NAT-bound) • Latency • CPU/Memory … • Online time (65% of the peers joined the system online only once, >20% of all connections lasted <1 minute, 60% of the peers kept active <10 min)

1.Introduction 2.Search Engine 3.Indexing 4.Outlook Outlook on current implementation: • Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents • Indexing based on Terrier (centralized approach for text documents, Uni Glasgow) • P2P-complex: • Based on Kademlia/Chord, • Collects peer characteristics, • Adapted to special requirements of XML IR • Ranking: • Extension of the vector space model, • BM25f-based weighing

A Distributed Indexing Strategy for Efficient XML Retrieval • Introduction • Architecture for XML IR in P2P • Indexing techniques • Outlook on current implementation • Questions and discussion ? 5. Questions and discussion

A Distributed Indexing Strategy for Efficient XML Retrieval