- 88 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Inverted Indexing for Text Retrieval' - warren-guerrero

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Inverted Indexing for Text Retrieval

Chapter 4 Lin and Dyer

Introduction

- Web search is a quintessential large-data problem.
- So are any number of problems in genomics.
- Google, amazon (aws) all are involved in research and discovery in this area
- Web search or full text search depends on a data structure called inverted index.
- Web search problem breaks down into three major components:
- Gathering the web content (crawling) (like project 1)
- Construction of inverted index (indexing)
- Ranking the documents given a query (retrieval) (exam 1)

Issues with these components

- Crawling and indexing have similar characteristics: resource consumption is high
- Typically offline batch processing except of course on twitter model
- There are many requirements for a web crawler or in general a data aggregator..
- Etiquette, bandwidth resources, multilingual, duplicate contents, frequency of changes…
- How often to collect: too few may miss important updates, too often may have too much info

Web Crawling

- Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input
- What are the issues?
- See page 67

Retrieval

- Retrieval is a online problem that demands stringent timings: sub-second response times.
- Concurrent queries
- Query latency
- Load on the servers
- Other circumstances: day of the day
- Resource consumption can be spikey or highly variable
- Resource requirement for indexing is more predictable

Indexes

- Regular index: Document terms
- Inverted index termdocuments
- Example:

term1 {d1,p}, {d2, p}, {d23, p}

term2 {d2, p}. {d34, p}

term3 {d6, p}, {d56, p}, {d345, p}

Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)

Inverted Index

- Inverted index consists of postings lists, one associated with each term that appears in the corpus.
- <t, posting>n
- <t, <docid, tf> >n
- <t, <docid, tf, other info>>n
- Key, value pair where the key is the term (word) and the value is the docid, followed by “payload”
- Payload can be empty for simple index
- Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank of the doc, etc.
- <t2, <d1, d4, d67, d89>>
- <t3, <d4, d6, d7, d9, d22>>
- Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks.

Retrieval

- Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs.
- The docs are ranked and top k docs are listed.
- It is good to have the inverted index in memory.
- If not , some queries may involve random disk access for decoding of postings.
- Solution: organize the disk accesses so that random seeks are minimized.

Pseudo Code

Pseudo code Baseline implementation value-key conversion pattern implementation…

Inverted Index: Baseline Implementation using MR

- Input to the mapper consists of docid and actual content.
- Each document is analyzed and broken down into terms.
- Processing pipeline assuming HTML docs:
- Strip HTML tags
- Strip Javascript code
- Tokenize using a set of delimiters
- Case fold
- Remove stop words (a, an the…)
- Remove domain-specific stop works
- Stem different forms (..ing, ..ed…, dogs – dog)

Baseline implementation

procedure map (docid n, doc d)

H new Associative array

for all terms in doc d

H{t} H{t} + 1

for all term in H

emit(term t, posting <n, H{t}>)

Reducer for baseline implmentation

procedure reducer( term t, postings[<n1, f1> <n2, f2>, …])

P new List

for all posting <a,f> in postings

Append (P, <a,f>)

Sort (P) // sorted by docid

Emit (term t, postings P)

Shuffle and sort phase

- Is a very large group by term of the postings
- Lets look at a toy example
- Fig. 4.3 some items are incorrect in the figure

Baseline MR for II

class Mapper

procedure Map(docid n; doc d)

H =new AssociativeArray

for all term t in doc d do

H(t) H(t) + 1

for all term t in H do

Emit(term t; posting (n,H[t])

class Reducer

procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :])

P = new List

for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do

Append(P, (t, f))

Sort(P)

Emit(term t; postings P)

Revised Implementation

- Issue: MR does not guarantee sorting order of the values.. Only by keys
- So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory.
- Lets check a revised solution
- (term t, posting<docid, f>) to
- (term<t,docid>, tf f)

Inverted Index: Revised implementation

- From Baseline to an improved version
- Observe the sort done by the Reducer. Is there any way to push this into the MR runtime?
- Instead of
- (term t, posting<docid, f>)
- Emit
- (tuple<t, docid>, tf f)
- This is our previously studied value-key conversion design pattern
- This switching ensures the keys arrive in order at the reducer
- Small memory foot print; less buffer space needed at the reducer
- See fig.4.4

Modified mapper

Map (docid n, doc d)

H new AssociativeArray

For all terms t in doc

H{t} H{t} + 1

For all terms in H

emit (tuple<t,n>, H{t})

Modified Reducer

Initialize

tprev 0

P new PostingList

method reduce (tuple <t,n>, tf[f1, ..])

if t # tprev ^ tprev # 0

{ emit (term t, posting P);

reset P; }

P.add(<n,f>)

tprev t

Close

emit(term t, postings P)

Improved MR for II

class Mapper

method Map(docid n; doc d)

H = new AssociativeArray

for all term t in doc d do

H[t] = H[t] + 1

for all term t in H do

Emit(tuple <t; n>, tfH[t])

class Reducer

method Initialize

tprev = 0; P = new PostingsList

method Reduce(tuple <t, n>; tf[f])

if t <> tprev ^ tprev <> 0; then

Emit(term t; postings P)

P:Reset()

P:Add(<n, f>)

tprev = t

method Close

Other modifications

- Partitionerand shuffle have to deliver all related <key, value> to same reducer
- Custom partitioner so that all terms t go to the same reducer.
- Lets go through a numerical example

What about retrieval?

- While MR is great for indexing, it is not great for retrieval.

Index compression for space

- Section 4.5
- (5,2), (7,3), (12,1), (49,1), (51,2)…
- (5,2), (2,3), (5,1), (37,1), (2,2)…

Miscellaneous Stuff

- How to MR Spam Filtering (Naïve Bayes solution) discussed in Ch.4 DDS? In training the model.
- Write solution in the form of your main workflow configuration.
- Prior is What is random probability of x occurring? Eg. What is the probability that the next person who walks into the class is a female?

NIH Solicitation in Big Data (2014)

- ..
- This opportunity targets four topic areas of high need for researchers working with biomedical Big Data,

1. Data Compression/Reduction

2. Data Provenance

3. Data Visualization

4. Data Wrangling

Odds Ratio Example from 4/16/2014 news article

- Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1…..
- How to interpret this?
- =
- =
- =
- Woods is also the favorite to win the Open Championship at Hoylake in July. He\'s 7/1 there.

=

Download Presentation

Connecting to Server..