Inverted indexing for text retrieval
1 / 25

Inverted Indexing for Text Retrieval - PowerPoint PPT Presentation

  • Uploaded on

Inverted Indexing for Text Retrieval. Chapter 4 Lin and Dyer. Introduction. Web search is a quintessential large-data problem. So are any number of problems in genomics. Google, amazon ( aws ) all are involved in research and discovery in this area

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Inverted Indexing for Text Retrieval' - warren-guerrero

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Inverted indexing for text retrieval

Inverted Indexing for Text Retrieval

Chapter 4 Lin and Dyer


  • Web search is a quintessential large-data problem.

  • So are any number of problems in genomics.

    • Google, amazon (aws) all are involved in research and discovery in this area

  • Web search or full text search depends on a data structure called inverted index.

  • Web search problem breaks down into three major components:

    • Gathering the web content (crawling) (like project 1)

    • Construction of inverted index (indexing)

    • Ranking the documents given a query (retrieval) (exam 1)

Issues with these components
Issues with these components

  • Crawling and indexing have similar characteristics: resource consumption is high

    • Typically offline batch processing except of course on twitter model

  • There are many requirements for a web crawler or in general a data aggregator..

    • Etiquette, bandwidth resources, multilingual, duplicate contents, frequency of changes…

    • How often to collect: too few may miss important updates, too often may have too much info

  • Web crawling
    Web Crawling

    • Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input

    • What are the issues?

    • See page 67


    • Retrieval is a online problem that demands stringent timings: sub-second response times.

      • Concurrent queries

      • Query latency

      • Load on the servers

      • Other circumstances: day of the day

      • Resource consumption can be spikey or highly variable

    • Resource requirement for indexing is more predictable


    • Regular index: Document  terms

    • Inverted index termdocuments

    • Example:

      term1  {d1,p}, {d2, p}, {d23, p}

      term2  {d2, p}. {d34, p}

      term3  {d6, p}, {d56, p}, {d345, p}

      Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)

    Inverted index
    Inverted Index

    • Inverted index consists of postings lists, one associated with each term that appears in the corpus.

    • <t, posting>n

    • <t, <docid, tf> >n

    • <t, <docid, tf, other info>>n

    • Key, value pair where the key is the term (word) and the value is the docid, followed by “payload”

    • Payload can be empty for simple index

    • Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank of the doc, etc.

    • <t2, <d1, d4, d67, d89>>

    • <t3, <d4, d6, d7, d9, d22>>

    • Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks.


    • Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs.

    • The docs are ranked and top k docs are listed.

    • It is good to have the inverted index in memory.

    • If not , some queries may involve random disk access for decoding of postings.

    • Solution: organize the disk accesses so that random seeks are minimized.

    Pseudo code
    Pseudo Code

    Pseudo code  Baseline implementation  value-key conversion pattern implementation…

    Inverted index baseline implementation using mr
    Inverted Index: Baseline Implementation using MR

    • Input to the mapper consists of docid and actual content.

    • Each document is analyzed and broken down into terms.

    • Processing pipeline assuming HTML docs:

      • Strip HTML tags

      • Strip Javascript code

      • Tokenize using a set of delimiters

      • Case fold

      • Remove stop words (a, an the…)

      • Remove domain-specific stop works

      • Stem different forms (, ..ed…, dogs – dog)

    Baseline implementation
    Baseline implementation

    procedure map (docid n, doc d)

    H  new Associative array

    for all terms in doc d

    H{t}  H{t} + 1

    for all term in H

    emit(term t, posting <n, H{t}>)

    Reducer for baseline implmentation
    Reducer for baseline implmentation

    procedure reducer( term t, postings[<n1, f1> <n2, f2>, …])

    P  new List

    for all posting <a,f> in postings

    Append (P, <a,f>)

    Sort (P) // sorted by docid

    Emit (term t, postings P)

    Shuffle and sort phase
    Shuffle and sort phase

    • Is a very large group by term of the postings

    • Lets look at a toy example

    • Fig. 4.3 some items are incorrect in the figure

    Baseline mr for ii
    Baseline MR for II

    class Mapper

    procedure Map(docid n; doc d)

    H =new AssociativeArray

    for all term t in doc d do

    H(t) H(t) + 1

    for all term t in H do

    Emit(term t; posting (n,H[t])

    class Reducer

    procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :])

    P = new List

    for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do

    Append(P, (t, f))


    Emit(term t; postings P)

    Revised implementation
    Revised Implementation

    • Issue: MR does not guarantee sorting order of the values.. Only by keys

    • So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory.

    • Lets check a revised solution

    • (term t, posting<docid, f>) to

    • (term<t,docid>, tf f)

    Inverted index revised implementation
    Inverted Index: Revised implementation

    • From Baseline to an improved version

    • Observe the sort done by the Reducer. Is there any way to push this into the MR runtime?

    • Instead of

      • (term t, posting<docid, f>)

    • Emit

      • (tuple<t, docid>, tf f)

    • This is our previously studied value-key conversion design pattern

    • This switching ensures the keys arrive in order at the reducer

    • Small memory foot print; less buffer space needed at the reducer

    • See fig.4.4

    Modified mapper
    Modified mapper

    Map (docid n, doc d)

    H  new AssociativeArray

    For all terms t in doc

    H{t}  H{t} + 1

    For all terms in H

    emit (tuple<t,n>, H{t})

    Modified reducer
    Modified Reducer


    tprev 0

    P  new PostingList

    method reduce (tuple <t,n>, tf[f1, ..])

    if t # tprev ^ tprev # 0

    { emit (term t, posting P);

    reset P; }


    tprev  t


    emit(term t, postings P)

    Improved mr for ii
    Improved MR for II

    class Mapper

    method Map(docid n; doc d)

    H = new AssociativeArray

    for all term t in doc d do

    H[t] = H[t] + 1

    for all term t in H do

    Emit(tuple <t; n>, tfH[t])

    class Reducer

    method Initialize

    tprev = 0; P = new PostingsList

    method Reduce(tuple <t, n>; tf[f])

    if t <> tprev ^ tprev <> 0; then

    Emit(term t; postings P)


    P:Add(<n, f>)

    tprev = t

    method Close

    Other modifications
    Other modifications

    • Partitionerand shuffle have to deliver all related <key, value> to same reducer

    • Custom partitioner so that all terms t go to the same reducer.

    • Lets go through a numerical example

    What about retrieval
    What about retrieval?

    • While MR is great for indexing, it is not great for retrieval.

    Index compression for space
    Index compression for space

    • Section 4.5

    • (5,2), (7,3), (12,1), (49,1), (51,2)…

    • (5,2), (2,3), (5,1), (37,1), (2,2)…

    Miscellaneous stuff
    Miscellaneous Stuff

    • How to MR Spam Filtering (Naïve Bayes solution) discussed in Ch.4 DDS? In training the model.

    • Write solution in the form of your main workflow configuration.

    • Prior is What is random probability of x occurring? Eg. What is the probability that the next person who walks into the class is a female?

    Nih solicitation in big data 2014
    NIH Solicitation in Big Data (2014)

    • ..

    • This opportunity targets four topic areas of high need for researchers working with biomedical Big Data,

      1. Data Compression/Reduction

      2. Data Provenance

      3. Data Visualization

      4. Data Wrangling

    Odds ratio example from 4 16 2014 news article
    Odds Ratio Example from 4/16/2014 news article

    • Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1…..

    • How to interpret this?

      • =

      • =

      • =

    • Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there.