highly scalable algorithm for distributed real time text indexing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Highly Scalable Algorithm for Distributed Real-time Text Indexing PowerPoint Presentation
Download Presentation
Highly Scalable Algorithm for Distributed Real-time Text Indexing

Loading in 2 Seconds...

play fullscreen
1 / 24

Highly Scalable Algorithm for Distributed Real-time Text Indexing - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Highly Scalable Algorithm for Distributed Real-time Text Indexing. Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay Garg IBM Research- India. Email: { annarang, avikas, monkedia}@in.ibm.com , garg@ece.utexas.edu. Agenda. Background Challenges in Scalable Indexing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Highly Scalable Algorithm for Distributed Real-time Text Indexing' - cargan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
highly scalable algorithm for distributed real time text indexing

Highly Scalable Algorithm for Distributed Real-time Text Indexing

Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay Garg

IBM Research- India.

Email: {annarang, avikas, monkedia}@in.ibm.com, garg@ece.utexas.edu

agenda
Agenda
  • Background
  • Challenges in Scalable Indexing
  • In-memory Index data structure design
  • Parallel Indexing Algorithm
    • Parallel Pipelined Indexing
    • Asymptotic Time Complexity Analysis
  • Experimental Results
    • Strong Scalability
    • Weak Scalability
    • Search Performance
  • Conclusions & Future Work
background
Background
  • Data Intensive Supercomputingis gaining strong research momentum
    • Large scale computations over massive and changing data sets
    • Multiple Domains: Telescope imagery, online transaction records, financial markets, medical records, weather prediction
  • Massive throughput real-time text indexing and search
    • Massive data at high rate ~ 1-10 GB/s
    • Index expected to age-off at regular intervals
    • Architectural Innovations
      • Massively parallel / many-core architectures
      • Storage class memories with 10s of tera-bytes of storage
    • Requirement for very high indexing rates and stringent search response time
      • Optimizations needed to
        • Maximize Indexing Throughput
        • Minimize Indexing Latency from indexing to search (per document)
      • Sustained search performance
slide4

Background – Index for Text Search (e.g. Lucene)

  • Lucene Index Overview
    • A Lucene index covers a set of documents
    • A document is a sequence of fields
    • A field is a sequence of terms
    • A term is a text string
    • A Lucene index consists of one or more segments
    • Each segment covers a set of documents
    • Each segment is a fully independent index
  • Index updates are serialized
  • Multiple index searches can proceed concurrently
  • Supports simultaneous index update and search
challenges to scalable in memory distributed real time indexing
Challenges to Scalable In-memory Distributed Real-time Indexing
  • Scalability Issues in Typical Approaches
    • Mergesort of sorted terms in the input segments to generate the list of Terms and TermInfos for the MergedSegment
    • Merging and Re-organization of the Document-List and Position-List of the input segments
    • Load Imbalance increase with increase in number of processors
      • Index-Merge process quickly becomes the bottleneck
    • Large indexing latency
  • Index Data Structure Design Challenge
    • Inherent trade-offs in index-size vs. indexing throughput vs. search throughput
      • Trade-off in indexing latency vs. search response time vs. throughput
  • Performance Objective
    • Maximize Indexing performance while sustaining the search performance (including search response time and throughput)
slide6

Scalability Issues With Typical Indexing Approaches

Position-List : p31,p32, p41,p42,p43

Position-List : p11,p12, p21,p22,p23

Document-List : Doc(1’) / Freq(1’), Doc(2’) / Freq(2’)

Document-List : Doc(1) / Freq(1), Doc(2) / Freq(2)

Segment(1)

Segment(2)

TermInfo(T(i))

TermInfo(T(i))

Term (T(i))

Term (T(i))

Merged Segment

Step(1) : Merge- Sort Of Terms & Creation of new TermInfo

Step(2) : Merge of Document-Lists and Position-Lists

Term(T(i))

Document-List : Doc(1)/F1,Doc(2)/F2, Doc(3)/F3, Doc(4)/F4

TermInfo(T(i))

Positions List : p11,p12,p21,p22,p23, p31,p32,p41,p42,p43

in memory indexing data structure design
In-memory Indexing Data Structure Design
  • Two-level hierarchical index data structure design
    • Top-level hash table: GHT (Global Hash Table)
      • Represents complete index for a given set of documents
      • Map: Terms => Second-level hash table(IHT)
    • Second level hash table: IHT (Interval Hash Table)
      • Represents index for an interval of documents with contiguous IDs
      • Map: Term => list of documents containing that term
        • Postings data also stored
  • Advantages of the design
    • No need for re-organization of data while merging IHT into GHT
    • Merge-sort eliminated using hash-operations
    • Efficient encoding of IHT reduces memory requirements of an IHT
interval hash table iht concept

Hash Table

Term Collision

Resolution

. . .

Ti

. . . .

Ti : HF(Ti)

Di

Di+1

. . . . .

Dj

DocID, Frequency, Positions Array

IHT Data

Interval Hash Table (IHT) : Concept
global hash table ght concept

Hash Table

Term Collision

Resolution

. . .

Ti

. . . .

Ti : HF(Ti)

Document-interval Indexed Hash Table

. . .

Dj-Dk

Dj

Dj+1

. . . . .

Dk

DocID, Frequency, Positions Array

Global Hash Table (GHT) : Concept
encoded iht representation

Size of each sub-array

# Hash table entries

# Distinct terms in IHT

# Docs/term * # terms

# Docs/term * #terms

# Distinct terms in IHT

#Docs/term * #terms

What each sub-array represents

Document IDs per term

Term IDs

Number of Docs in which each term occurred

Offset into Position Information

Term frequency in each Document

Number of Distinct terms in each Hash table entry

Steps to Access Term-Positions from (TermID(Ti), docID(Dj))

Get NumTerms From TermKey(Ti)

GetTermID(Ti)

GetNum-Docs(Ti)

Get Offset Into Position Data(Ti,Dj)

GetDocIDs(Ti)

GetNumTerms(Dj)

Encoded IHT representation
new indexing algorithm
New Indexing Algorithm
  • Main Steps of the Indexing Algorithm
    • Posting table (LHT) for each document is constructed without involving sorting of terms
    • Posting tables of k documents are merged into an IHT which are then encoded appropriately
    • Encoded IHTs are merged into a single GHT in an efficient manner
ght construction from iht

Global Hash Table

. . .

Ti

HF(Ti)

S2(a)

IHT(g)

New Encoded IHT(g)

. . .

Tj

HF(Tj)

. . .

Ti

Tj

S1

Distinct terms

S2(b)

IHT(g)

Encoded IHT array

Array Of IHTs

GHT Construction from IHT
slide13

Parallel Group-based Indexing Algorithm

Documents

Index Groups

Indexing Group

I1

I0

I2

Search Group

Documents

Query

I4

I3

Documents

Documents

Documents

pipeline diagram
Pipeline Diagram

Produce IHTs/segments

Merge IHTs/segments

Send IHTs/segments

Consumer

Barrier Sync.

Producer(3)

Producer(2)

Producer(1)

Time (Distributed Indexing Algorithm)

asymptotic time complexity analysis
Asymptotic Time Complexity Analysis
  • Definitions
    • Size of the indexing group: |G| = (|P|+ 1)
      • P: set of Producers
      • Single Consumer
    • “n” Produce-Consume rounds
      • |P|Producers and single Consumer in each round
      • Prod(j,i):total time for jthProducer, in ithround.
        • ProdComp(j,i): compute time
        • ProdComm(j,i): communication time
      • Cons(i): total time for the Consumer in the ith round
        • ConsComp(i): compute time
        • ConsComm(i): communication time
  • Distributed Indexing
    • T(distributed) = X + Y + Z
      • X= maxjProdComp(j,1)
      • Y= Σ2≤i≤n max( maxjProd(j,i), Cons(i−1))
      • Z= Cons(n)
asymptotic time complexity analysis1
Asymptotic Time Complexity Analysis
  • Overall Indexing Time: dependent on balance of pipeline stages
    • 2 cases
      • Produce phase dominates merge phase
      • Merge phase dominates the compute phase
  • Time complexity Upper Bounds
    • Case(1) : Production time per round > Merge time per round
      • T(Pghtl) = O(R/|P|)
      • T(Porgl) = O((R/|P|) * log(k))
    • Case(2) : Merge time per round > Production time per round
      • T(Pghtl) = O(R/k)
      • T(Porgl) = O((R/k) * log(|P|))
experimental setup
Experimental Setup
  • Original CLucene codebase (v0.9.20)
    • Porgl implementation
      • Distributed in-memory indexing algorithm using RAMDirectory
      • Distributed Search Implementation
    • Pghtl implementation
      • Implementation of IHT and GHT data structures
      • Distributed Indexing and Search Algorithm Implementation
  • IBM Intranet website data
    • Text data extracted from HTML files
    • Loaded equally into the memory of the producer nodes
  • Experiments run on Blue Gene/L
    • Upto 16K Processor nodes (PPC 440)
      • 2 PPC 440 per node
      • Co-processor mode: 1 compute, 1 router
    • High Bandwidth 3D torus interconnect
  • For Porgl
    • “k” such that only one segment is created from all the text data fed to a Producer so as to get its best indexing throughput.
conclusions future work
Conclusions & Future Work
  • High throughput text indexing demonstrated for the first time at such a large scale.
    • Architecture independent design of new data structures
    • Algorithm for distributed in-memory real-time group-based text indexing
      • better load balance, low communication cost & good cache performance.
  • Proved analytically: parallel time complexity of our indexing algorithm is at least (log(P)) better asymptotically compared to typical indexing approaches.
  • Experimental Results
    • 3× - 7× improvement in indexing throughput and around 10× better indexing latency on Blue Gene/L.
    • Peak indexing throughput of 312 GB/min on 8K nodes
    • Estimate: 5 TB/min on 128K nodes
  • Future Work: Distributed Search Optimizations