240 likes | 314 Views
Discover the intricacies of web searching, from indexing to link structures to ranking algorithms like PageRank and TF-IDF. Learn how search engines update content and adapt to the evolving web landscape.
E N D
Announcements • Research Paper due today • Research Talks • Nov. 29 (Monday) Kayatana and Lance • Dec. 1 (Wednesday) Mark and Jeremy • Dec. 3 (Friday) Joe and Anton • Dec. 5 (Monday) Colin and Paul
Web Search Lecture 23
Searching the Web • Only search what is indexed • 1999, 800 million documents indexed by Northern Light[7] • Largest Index - 16% of the indexable web • 2004, 800 billion urls indexed by Google [1] • Largest Index - ?% of indexable web
Visualizing the Web • View the web as a directed graph of nodes and edges • set of abstract nodes (the pages) • joined by directional edges (the hyperlinks) • Structure provides significant insight about the content
Citation Analysis[2] • Use structure to identify important, or prominent, nodes • Garfield’s impact factor • Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years • More heavily cited journals have more overall impact on a field • Consider it better to receive citations from an important journal
Influence Weights • Pinski and Narin’s notion of influence weights • strength of the connection from one journal to another • percentage of citations in the first journal that refer to the second • equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections) • If a journal receives regular citations from other journals of large weight, it will acquire large weight
On the web • Lot of dead-ends in the link structure • Prominent sites may have no links to outside world • Use “smoothing” operation, giving all pages a small, positive connection strength to every other page • Compute equilibrium weights with respect to modified connection strengths
Different Model on the Web • Prominent cites do not link to other prominent cites • Search engines won’t link to other search engines because they are competitors • Want to keep users on its sites • Large collection of pages link to many prominent sites in a focused manner • act as resource lists and guides to search engines
Hubs and Authorities • Authorities – most prominent sources of primary content for a topic • Hubs – high quality guides and resource lists direct users to recommended authorities • Each page is assigned a hub weight and an authority weight • authority weight - proportional to the sum of the hub weights of pages that link to it • hub weight - proportional to the sum of the authority weights of the pages that it links to
Simplified PageRank Algorithm[5] • Formula used by Google to rank pages • Let u be a web page • Fu is a set of pages u points to • Bu is the set of pages that point to u • Nu = |Fu| • c factor used for normalization
Simplified PageRank Calculation where c = 1
PageRank Formula • Account for sinks • Complete Formula • d is empirically set to about 0.15 to 0.2 by the system
Using Queries to find DocumentsVector Space Model – Content Relevance Slide by Mark Levene [3]
Term Frequency (TF) • Count number of occurrences of each term. • Bag of words approach • Ignore stopwords such as is, a, of, the, … • Stemming - computer is replaced by comput, as are its variants: computers, computingcomputation,computer and computed. • Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. is a chess game game computer chess programming chess Slide by Mark Levene [3]
Inverse Document Frequency (IDF) • N is number of documents in the corpus. • niis number of docs in which word i appears. • Log dampens the effect of IDF. • IDF is also number of bits to represent the term. Slide by Mark Levene [3]
Ranking with TF-IDF • i – refers to document i • j – refers to word (or term) j in doc i • q – is the query which is a sequence of terms • scorej -is the score for document j given q • Rank results according to the scoring function. Slide by Mark Levene [3]
Factor in Link Metrics • Multilply by PageRank of document (web page). • We do not know exactly how Google factors in the PR, it may be that log(PR) is used. Slide by Mark Levene [3]
Rate of change on the Web [4] • Search engines update their index periodically in order to keep up with evolving web • obsolete index leads to irrelevant or “broken” search results • update both content and link structure • Source of change • content of pages change • new pages are added
What’s new on the Web? • New pages created rate of 8% a week[4] • New pages borrow significant amount of content from old pages • After one year, 50% of the content on the web is new • Only 20% of pages available today accessible after one year
New Link Structure • After a year, about 80% of links on the Web will be replaced with new ones • 25% change per week • week-old rankings may not reflect the current ranking of the pages very well
Change in old pages • After one week • 30% of the changed pages – difference > 5% • After one year • less than 50% of changed pages – difference > 5% • Creation of new pages more significant source of change on the Web
Impact on Search Engines • Need to continually update links – this data changes more rapidly then content • most links persist for less than 6 months • Page removed and replaced by new ones at rapid rates • Sometimes better to used cached version of page • Pages that persist usually do not change very much • Past change does not predict future change
Citations [1] GOOGLE. Google. www.google.com [2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999. [3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt [4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004. [5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998. [6] I. Rogers. The Google PageRank Algorithm and How It Works.www.iprcom.com/papers/pagerank, April, 2002. [7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.