CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 4: Index Construction. Plan. Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex This time: Index construction. n-z. a-hu. hy-m. $m. mace.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
CS276: Information Retrieval and Web Search
Christopher Manning and Prabhakar Raghavan
Lecture 4: Index Construction
(e.g., compare & swap a word)
4.5 bytes per word token vs. 7.5 bytes per word type: why?
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
After all documents have been parsed, the inverted file is sorted by terms.Key step
We focus on this sort step.
We have 100M items to sort.
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
4How to merge the sorted runs?
must use a distributed computing cluster