1 / 14

Indexing The World Wide Web: The Journey So Far

Indexing The World Wide Web: The Journey So Far. Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji. Is Indexing Difficult? - Yes!. Words not known beforehand Content available in different languages Variations in Grammar and Style

noma
Download Presentation

Indexing The World Wide Web: The Journey So Far

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing The World Wide Web: The Journey So Far Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: AnamikaMukherji Indexing The World Wide Web

  2. Is Indexing Difficult?- Yes! • Words not known beforehand • Content available in different languages • Variations in Grammar and Style • No structure – riddled with colors, fonts, images, etc. • Various byte-encoding schemes Indexing The World Wide Web

  3. Answering The User’s Query • Retrieval for a typical query • Find terms in dictionary • Start with the least frequent term since posting list will be the shortest. • Fetch corresponding posting lists • Intersect the lists on document identifiers to get relevant documents • Rank and re-order the documents to present it to user. • To get quality results as fast as possible, understanding of each usage is required • Disk Space • Disk Transfer • Memory • CPU Time • Choice of data structure impacts CPU and storage • Fixed-length array wasteful if posting lists kept in memory • Singly linked list allows cheap insertions and updates • Variable length array require less CPU time • Linked list of fixed length arrays can be used for each term. • Avoid pointers when storing the posting list in memory. Indexing The World Wide Web

  4. Better Understanding of User Intent • Check proximity of different terms • Positional Index expands storage, slows down query processing . • Phrase based Indexing – expensive, no accurate mechanism for identifying which phrase might be used. – Use a good phrase. Indexing The World Wide Web

  5. Document vs. Term Based Partitioning Indexing The World Wide Web

  6. Memory vs. Disk Storage Indexing The World Wide Web

  7. Compressing The Index • Advantages of compressed index • Faster transfer of data from disk to memory • Reduces disk seek time • Compressions schemes • Variable Encoding • Bit-level Encoding • Using gaps • Original posting lists: the: ⟨1, 9⟩ ⟨2, 8⟩ ⟨3, 8⟩ ⟨4, 5⟩ ⟨5, 6⟩ ⟨6, 9⟩ to: ⟨1, 5⟩ ⟨3, 1⟩ ⟨4, 2⟩ ⟨5, 2⟩ ⟨6, 6⟩ john: ⟨2, 4⟩ ⟨4, 1⟩ ⟨6, 4⟩ • With gaps: the: ⟨1, 9⟩ ⟨1, 8⟩ ⟨1, 8⟩ ⟨1, 5⟩ ⟨1, 6⟩ ⟨1, 9⟩ to: ⟨1, 5⟩ ⟨2, 1⟩ ⟨1, 2⟩ ⟨1, 2⟩ ⟨1, 6⟩ john: ⟨2, 4⟩ ⟨2, 1⟩ ⟨2, 4⟩ Indexing The World Wide Web

  8. Variable Byte Encoding • Uses an integral but adaptive number of bytes depending upon the gap size. • First bit of each byte is a continuation bit. • Remaining 7 bits in each byte are used to encode part of gap. • To decode a byte: • Read sequence of bytes till continuation bit flips. • Extract and concatenate the 7-bit parts to get the magnitude of a gap. Indexing The World Wide Web

  9. Bit Level Encoding • Used when disk space is at premium. • These codes adapt the length of the code on a finer grained bit level. • Codeword is divided into 2 parts – prefix and suffix • Prefix indicates the binary magnitude of the value and tells the decoder how many bits are there in the suffix part. • Suffix indicates the value of the number within the corresponding binary range. • Query processing is more time consuming. Indexing The World Wide Web

  10. Ordering by Highest Impact First Example: • (<doc id, term frequency>): • ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨40, 6⟩ ⟨78, 1⟩ ⟨101, 3⟩ ⟨106, 1⟩. • When the list is reordered by term frequency, it gets transformed: • ⟨40, 6⟩ ⟨101, 3⟩ ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨78, 1⟩ ⟨106, 1⟩. • The repeated frequency information can then be factored out into a prefix component with a counter that indicates how many documents there are with this same frequency value: • ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 17⟩ ⟨1 : 4 : 29, 32, 78, 106⟩. • Not storing the repeated frequencies gives a considerable saving. Finally, if differences of document • identifiers are taken, we get the following: • ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 5⟩ ⟨1 : 4 : 29, 3, 46, 28⟩. • The document gaps within each equal-frequency segment of the list are now on average larger than when the document identifiers were sorted, thereby requiring more encoding bits/bytes. Indexing The World Wide Web

  11. Managing Multiple Indices • Multiples indices bucketed by rate of refreshing. • The Large, rarely refreshing pages index • The small, ever-refreshing pages index • The dynamic real-time/news pages index • Waterfall approach • Pages discovered in one tier can be passed over the next over time. • Invalidate older index and crawl file entries Indexing The World Wide Web

  12. SCALING THE SYSTEM • Web search engines use Distributed indexing algorithms for index construction • Distributed File System • In order to manage large amounts of data across large commodity clusters, a distributed file system that provides efficient remote file access, file transfers, and the ability to carry out concurrent independent operations while being extremely fault tolerant is essential. • Map-Shuffle-Reduce • Map: The master node chops up the problem into small chunks and assigns each chunk to a worker. The worker either processes the chunk of data with the mapper and returns the result to the master or further chops up the input data and assigns it hierarchically. • Shuffle: Group key-value pair from mapper. • Reduce: Take sub-answers and combine to create final output. Indexing The World Wide Web

  13. FUTURE RESEARCH DIRECTIONS • Real Time Data and Search – What can we do with each tweet? • Create a Social Graph • Extract and Index links • Real-Time Related Topics • Sentiment Analysis • Social and Personalized Web Search • Facebook, Twitter, etc. • Facebook Users post a wealth of information • Static – book, movie interest • Dynamic – user locations, status updates, wall posts • Learning user’s personal information can personalize search results • Facebook impacting the world of search • Opened data to third party service • Search for 2 degrees of user Indexing The World Wide Web

  14. Pros and Cons • What I liked about it • Delves into the history of Search Engines • Talks about the Future Enhancement • Explains how a search engine works • What I didn’t like • Skims through the surface without going deep. • Includes very few examples which make understanding difficult. • Compressing the Index section lacks structure which makes it difficult to understand. Indexing The World Wide Web

More Related