260 likes | 383 Views
Inverted indexing is a crucial technique for effective web search and data retrieval, utilized by major tech companies like Google and AWS. This process involves gathering web content, constructing an inverted index, and retrieving ranked documents based on user queries. The chapter discusses the complexities of web crawling, indexing requirements, and the mechanics of retrieval under stringent time constraints. It also delves into the architecture of inverted indexes, highlighting their structure and the importance of efficient data handling. Understanding this foundational concept enhances capabilities in large-scale data management.
E N D
Inverted Indexing for Text Retrieval Chapter 4 Lin and Dyer
Introduction • Web search is a quintessential large-data problem. • So are any number of problems in genomics. • Google, amazon (aws) all are involved in research and discovery in this area • Web search or full text search depends on a data structure called inverted index. • Web search problem breaks down into three major components: • Gathering the web content (crawling) (like project 1) • Construction of inverted index (indexing) • Ranking the documents given a query (retrieval) (exam 1)
Issues with these components • Crawling and indexing have similar characteristics: resource consumption is high • Typically offline batch processing except of course on twitter model • There are many requirements for a web crawler or in general a data aggregator.. • Etiquette, bandwidth resources, multilingual, duplicate contents, frequency of changes… • How often to collect: too few may miss important updates, too often may have too much info
Web Crawling • Start with a “seed” URL , say wikipedia page, and start collecting the content by following the links in the seed page; the depth of traversal is also specified by the input • What are the issues? • See page 67
Retrieval • Retrieval is a online problem that demands stringent timings: sub-second response times. • Concurrent queries • Query latency • Load on the servers • Other circumstances: day of the day • Resource consumption can be spikey or highly variable • Resource requirement for indexing is more predictable
Indexes • Regular index: Document terms • Inverted index termdocuments • Example: term1 {d1,p}, {d2, p}, {d23, p} term2 {d2, p}. {d34, p} term3 {d6, p}, {d56, p}, {d345, p} Where d is the doc id, p is the payload (example for payload: term frequency… this can be blank too)
Inverted Index • Inverted index consists of postings lists, one associated with each term that appears in the corpus. • <t, posting>n • <t, <docid, tf> >n • <t, <docid, tf, other info>>n • Key, value pair where the key is the term (word) and the value is the docid, followed by “payload” • Payload can be empty for simple index • Payload can be complex: provides such details as co-occurrences, additional linguistic processing, page rank of the doc, etc. • <t2, <d1, d4, d67, d89>> • <t3, <d4, d6, d7, d9, d22>> • Document numbering typically do not have semantic content but docs from the same corpus are numbered together or the numbers could be assigned based on page ranks.
Retrieval • Once the inverted index is developed, when a query comes in, retrieval involves fetching the appropriate docs. • The docs are ranked and top k docs are listed. • It is good to have the inverted index in memory. • If not , some queries may involve random disk access for decoding of postings. • Solution: organize the disk accesses so that random seeks are minimized.
Pseudo Code Pseudo code Baseline implementation value-key conversion pattern implementation…
Inverted Index: Baseline Implementation using MR • Input to the mapper consists of docid and actual content. • Each document is analyzed and broken down into terms. • Processing pipeline assuming HTML docs: • Strip HTML tags • Strip Javascript code • Tokenize using a set of delimiters • Case fold • Remove stop words (a, an the…) • Remove domain-specific stop works • Stem different forms (..ing, ..ed…, dogs – dog)
Baseline implementation procedure map (docid n, doc d) H new Associative array for all terms in doc d H{t} H{t} + 1 for all term in H emit(term t, posting <n, H{t}>)
Reducer for baseline implmentation procedure reducer( term t, postings[<n1, f1> <n2, f2>, …]) P new List for all posting <a,f> in postings Append (P, <a,f>) Sort (P) // sorted by docid Emit (term t, postings P)
Shuffle and sort phase • Is a very large group by term of the postings • Lets look at a toy example • Fig. 4.3 some items are incorrect in the figure
Baseline MR for II class Mapper procedure Map(docid n; doc d) H =new AssociativeArray for all term t in doc d do H(t) H(t) + 1 for all term t in H do Emit(term t; posting (n,H[t]) class Reducer procedure Reduce(term t; postings [hn1; f1i; hn2; f2i : : :]) P = new List for all posting (t,f) in postings [(n1,f1); (n2, f2) : : :] do Append(P, (t, f)) Sort(P) Emit(term t; postings P)
Revised Implementation • Issue: MR does not guarantee sorting order of the values.. Only by keys • So the sort in the reducer is an expensive operation esp. if the docs cannot be held in memory. • Lets check a revised solution • (term t, posting<docid, f>) to • (term<t,docid>, tf f)
Inverted Index: Revised implementation • From Baseline to an improved version • Observe the sort done by the Reducer. Is there any way to push this into the MR runtime? • Instead of • (term t, posting<docid, f>) • Emit • (tuple<t, docid>, tf f) • This is our previously studied value-key conversion design pattern • This switching ensures the keys arrive in order at the reducer • Small memory foot print; less buffer space needed at the reducer • See fig.4.4
Modified mapper Map (docid n, doc d) H new AssociativeArray For all terms t in doc H{t} H{t} + 1 For all terms in H emit (tuple<t,n>, H{t})
Modified Reducer Initialize tprev 0 P new PostingList method reduce (tuple <t,n>, tf[f1, ..]) if t # tprev ^ tprev # 0 { emit (term t, posting P); reset P; } P.add(<n,f>) tprev t Close emit(term t, postings P)
Improved MR for II class Mapper method Map(docid n; doc d) H = new AssociativeArray for all term t in doc d do H[t] = H[t] + 1 for all term t in H do Emit(tuple <t; n>, tfH[t]) class Reducer method Initialize tprev = 0; P = new PostingsList method Reduce(tuple <t, n>; tf[f]) if t <> tprev ^ tprev <> 0; then Emit(term t; postings P) P:Reset() P:Add(<n, f>) tprev = t method Close
Other modifications • Partitionerand shuffle have to deliver all related <key, value> to same reducer • Custom partitioner so that all terms t go to the same reducer. • Lets go through a numerical example
What about retrieval? • While MR is great for indexing, it is not great for retrieval.
Index compression for space • Section 4.5 • (5,2), (7,3), (12,1), (49,1), (51,2)… • (5,2), (2,3), (5,1), (37,1), (2,2)…
Miscellaneous Stuff • How to MR Spam Filtering (Naïve Bayes solution) discussed in Ch.4 DDS? In training the model. • Write solution in the form of your main workflow configuration. • Prior is What is random probability of x occurring? Eg. What is the probability that the next person who walks into the class is a female?
NIH Solicitation in Big Data (2014) • .. • This opportunity targets four topic areas of high need for researchers working with biomedical Big Data, 1. Data Compression/Reduction 2. Data Provenance 3. Data Visualization 4. Data Wrangling
Odds Ratio Example from 4/16/2014 news article • Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1….. • How to interpret this? • = • = • = • Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. =