Highly Scalable Algorithm for Distributed Real-time Text Indexing

Highly Scalable Algorithm for Distributed Real-time Text Indexing Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay Garg IBM Research- India. Email: {annarang, avikas, monkedia}@in.ibm.com, garg@ece.utexas.edu

Agenda • Background • Challenges in Scalable Indexing • In-memory Index data structure design • Parallel Indexing Algorithm • Parallel Pipelined Indexing • Asymptotic Time Complexity Analysis • Experimental Results • Strong Scalability • Weak Scalability • Search Performance • Conclusions & Future Work

Background • Data Intensive Supercomputingis gaining strong research momentum • Large scale computations over massive and changing data sets • Multiple Domains: Telescope imagery, online transaction records, financial markets, medical records, weather prediction • Massive throughput real-time text indexing and search • Massive data at high rate ~ 1-10 GB/s • Index expected to age-off at regular intervals • Architectural Innovations • Massively parallel / many-core architectures • Storage class memories with 10s of tera-bytes of storage • Requirement for very high indexing rates and stringent search response time • Optimizations needed to • Maximize Indexing Throughput • Minimize Indexing Latency from indexing to search (per document) • Sustained search performance

Background – Index for Text Search (e.g. Lucene) • Lucene Index Overview • A Lucene index covers a set of documents • A document is a sequence of fields • A field is a sequence of terms • A term is a text string • A Lucene index consists of one or more segments • Each segment covers a set of documents • Each segment is a fully independent index • Index updates are serialized • Multiple index searches can proceed concurrently • Supports simultaneous index update and search

Challenges to Scalable In-memory Distributed Real-time Indexing • Scalability Issues in Typical Approaches • Mergesort of sorted terms in the input segments to generate the list of Terms and TermInfos for the MergedSegment • Merging and Re-organization of the Document-List and Position-List of the input segments • Load Imbalance increase with increase in number of processors • Index-Merge process quickly becomes the bottleneck • Large indexing latency • Index Data Structure Design Challenge • Inherent trade-offs in index-size vs. indexing throughput vs. search throughput • Trade-off in indexing latency vs. search response time vs. throughput • Performance Objective • Maximize Indexing performance while sustaining the search performance (including search response time and throughput)

Scalability Issues With Typical Indexing Approaches Position-List : p31,p32, p41,p42,p43 Position-List : p11,p12, p21,p22,p23 Document-List : Doc(1’) / Freq(1’), Doc(2’) / Freq(2’) Document-List : Doc(1) / Freq(1), Doc(2) / Freq(2) Segment(1) Segment(2) TermInfo(T(i)) TermInfo(T(i)) Term (T(i)) Term (T(i)) Merged Segment Step(1) : Merge- Sort Of Terms & Creation of new TermInfo Step(2) : Merge of Document-Lists and Position-Lists Term(T(i)) Document-List : Doc(1)/F1,Doc(2)/F2, Doc(3)/F3, Doc(4)/F4 TermInfo(T(i)) Positions List : p11,p12,p21,p22,p23, p31,p32,p41,p42,p43

In-memory Indexing Data Structure Design • Two-level hierarchical index data structure design • Top-level hash table: GHT (Global Hash Table) • Represents complete index for a given set of documents • Map: Terms => Second-level hash table(IHT) • Second level hash table: IHT (Interval Hash Table) • Represents index for an interval of documents with contiguous IDs • Map: Term => list of documents containing that term • Postings data also stored • Advantages of the design • No need for re-organization of data while merging IHT into GHT • Merge-sort eliminated using hash-operations • Efficient encoding of IHT reduces memory requirements of an IHT

Hash Table Term Collision Resolution . . . Ti . . . . Ti : HF(Ti) Di Di+1 . . . . . Dj DocID, Frequency, Positions Array IHT Data Interval Hash Table (IHT) : Concept

Hash Table Term Collision Resolution . . . Ti . . . . Ti : HF(Ti) Document-interval Indexed Hash Table . . . Dj-Dk Dj Dj+1 . . . . . Dk DocID, Frequency, Positions Array Global Hash Table (GHT) : Concept

Size of each sub-array # Hash table entries # Distinct terms in IHT # Docs/term * # terms # Docs/term * #terms # Distinct terms in IHT #Docs/term * #terms What each sub-array represents Document IDs per term Term IDs Number of Docs in which each term occurred Offset into Position Information Term frequency in each Document Number of Distinct terms in each Hash table entry Steps to Access Term-Positions from (TermID(Ti), docID(Dj)) Get NumTerms From TermKey(Ti) GetTermID(Ti) GetNum-Docs(Ti) Get Offset Into Position Data(Ti,Dj) GetDocIDs(Ti) GetNumTerms(Dj) Encoded IHT representation

New Indexing Algorithm • Main Steps of the Indexing Algorithm • Posting table (LHT) for each document is constructed without involving sorting of terms • Posting tables of k documents are merged into an IHT which are then encoded appropriately • Encoded IHTs are merged into a single GHT in an efficient manner

Global Hash Table . . . Ti HF(Ti) S2(a) IHT(g) New Encoded IHT(g) . . . Tj HF(Tj) . . . Ti Tj S1 Distinct terms S2(b) IHT(g) Encoded IHT array Array Of IHTs GHT Construction from IHT

Parallel Group-based Indexing Algorithm Documents Index Groups Indexing Group I1 I0 I2 Search Group Documents Query I4 I3 Documents Documents Documents

Pipeline Diagram Produce IHTs/segments Merge IHTs/segments Send IHTs/segments Consumer Barrier Sync. Producer(3) Producer(2) Producer(1) Time (Distributed Indexing Algorithm)

Asymptotic Time Complexity Analysis • Definitions • Size of the indexing group: |G| = (|P|+ 1) • P: set of Producers • Single Consumer • “n” Produce-Consume rounds • |P|Producers and single Consumer in each round • Prod(j,i):total time for jthProducer, in ithround. • ProdComp(j,i): compute time • ProdComm(j,i): communication time • Cons(i): total time for the Consumer in the ith round • ConsComp(i): compute time • ConsComm(i): communication time • Distributed Indexing • T(distributed) = X + Y + Z • X= maxjProdComp(j,1) • Y= Σ2≤i≤n max( maxjProd(j,i), Cons(i−1)) • Z= Cons(n)

Asymptotic Time Complexity Analysis • Overall Indexing Time: dependent on balance of pipeline stages • 2 cases • Produce phase dominates merge phase • Merge phase dominates the compute phase • Time complexity Upper Bounds • Case(1) : Production time per round > Merge time per round • T(Pghtl) = O(R/|P|) • T(Porgl) = O((R/|P|) * log(k)) • Case(2) : Merge time per round > Production time per round • T(Pghtl) = O(R/k) • T(Porgl) = O((R/k) * log(|P|))

Experimental Setup • Original CLucene codebase (v0.9.20) • Porgl implementation • Distributed in-memory indexing algorithm using RAMDirectory • Distributed Search Implementation • Pghtl implementation • Implementation of IHT and GHT data structures • Distributed Indexing and Search Algorithm Implementation • IBM Intranet website data • Text data extracted from HTML files • Loaded equally into the memory of the producer nodes • Experiments run on Blue Gene/L • Upto 16K Processor nodes (PPC 440) • 2 PPC 440 per node • Co-processor mode: 1 compute, 1 router • High Bandwidth 3D torus interconnect • For Porgl • “k” such that only one segment is created from all the text data fed to a Producer so as to get its best indexing throughput.

Strong Scalability Comparison: Pghtl vs Porgl

SpeedUp Comparison: Pghtl vs Porgl

Weak Scalability Comparison: Pghtl vs Porgl

Scalability With Data Size: Pghtl vs Porgl

Indexing Latency Variation: Pghtl vs. Porgl

Search Performance Comparison(Single Index Group)

Conclusions & Future Work • High throughput text indexing demonstrated for the first time at such a large scale. • Architecture independent design of new data structures • Algorithm for distributed in-memory real-time group-based text indexing • better load balance, low communication cost & good cache performance. • Proved analytically: parallel time complexity of our indexing algorithm is at least (log(P)) better asymptotically compared to typical indexing approaches. • Experimental Results • 3× - 7× improvement in indexing throughput and around 10× better indexing latency on Blue Gene/L. • Peak indexing throughput of 312 GB/min on 8K nodes • Estimate: 5 TB/min on 128K nodes • Future Work: Distributed Search Optimizations

Highly Scalable Algorithm for Distributed Real-time Text Indexing

Highly Scalable Algorithm for Distributed Real-time Text Indexing

Presentation Transcript

Full-Text Indexing

Real -Time Object Mapping Algorithm for Mobile Robots

Towards Distributed Garbage Collection for Distributed Real-Time Java

Query-Driven Indexing for Scalable P2P Text Retrieval

Highly Scalable Packetised correlators

Text Indexing

Building Highly Scalable Websites

On Distributed Real-time Systems:

A Highly Scalable Perfect Hashing Algorithm

Real-Time Distributed Databases

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware

Open, Scalable Real-Time Solutions

A Highly Adaptive Distributed Routing Algorithm for Mobile Wireless Networks

Inverted Indexing for Text Retrieval

Real-Time Text Taskforce

Real-Time Text Taskforce

Scalable Real-Time Negotiation Toolkit Organizational-Structured, Distributed Resource Allocation

Highly Scalable Distributed Dataflow Analysis

Full-Text Indexing

Scalable Applications and Real Time Response

REAL-TIME DISTRIBUTED SYSTEMS

Distributed Algorithm for MST