Announcements

Announcements • Research Paper due today • Research Talks • Nov. 29 (Monday) Kayatana and Lance • Dec. 1 (Wednesday) Mark and Jeremy • Dec. 3 (Friday) Joe and Anton • Dec. 5 (Monday) Colin and Paul

Web Search Lecture 23

Searching the Web • Only search what is indexed • 1999, 800 million documents indexed by Northern Light[7] • Largest Index - 16% of the indexable web • 2004, 800 billion urls indexed by Google [1] • Largest Index - ?% of indexable web

Visualizing the Web • View the web as a directed graph of nodes and edges • set of abstract nodes (the pages) • joined by directional edges (the hyperlinks) • Structure provides significant insight about the content

Example Graph [6]

Citation Analysis[2] • Use structure to identify important, or prominent, nodes • Garfield’s impact factor • Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years • More heavily cited journals have more overall impact on a field • Consider it better to receive citations from an important journal

Influence Weights • Pinski and Narin’s notion of influence weights • strength of the connection from one journal to another • percentage of citations in the first journal that refer to the second • equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections) • If a journal receives regular citations from other journals of large weight, it will acquire large weight

On the web • Lot of dead-ends in the link structure • Prominent sites may have no links to outside world • Use “smoothing” operation, giving all pages a small, positive connection strength to every other page • Compute equilibrium weights with respect to modified connection strengths

Different Model on the Web • Prominent cites do not link to other prominent cites • Search engines won’t link to other search engines because they are competitors • Want to keep users on its sites • Large collection of pages link to many prominent sites in a focused manner • act as resource lists and guides to search engines

Hubs and Authorities • Authorities – most prominent sources of primary content for a topic • Hubs – high quality guides and resource lists direct users to recommended authorities • Each page is assigned a hub weight and an authority weight • authority weight - proportional to the sum of the hub weights of pages that link to it • hub weight - proportional to the sum of the authority weights of the pages that it links to

Simplified PageRank Algorithm[5] • Formula used by Google to rank pages • Let u be a web page • Fu is a set of pages u points to • Bu is the set of pages that point to u • Nu = |Fu| • c factor used for normalization

Simplified PageRank Calculation where c = 1

PageRank Formula • Account for sinks • Complete Formula • d is empirically set to about 0.15 to 0.2 by the system

Using Queries to find DocumentsVector Space Model – Content Relevance Slide by Mark Levene [3]

Term Frequency (TF) • Count number of occurrences of each term. • Bag of words approach • Ignore stopwords such as is, a, of, the, … • Stemming - computer is replaced by comput, as are its variants: computers, computingcomputation,computer and computed. • Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. is a chess game game computer chess programming chess Slide by Mark Levene [3]

Inverse Document Frequency (IDF) • N is number of documents in the corpus. • niis number of docs in which word i appears. • Log dampens the effect of IDF. • IDF is also number of bits to represent the term. Slide by Mark Levene [3]

Ranking with TF-IDF • i – refers to document i • j – refers to word (or term) j in doc i • q – is the query which is a sequence of terms • scorej -is the score for document j given q • Rank results according to the scoring function. Slide by Mark Levene [3]

Factor in Link Metrics • Multilply by PageRank of document (web page). • We do not know exactly how Google factors in the PR, it may be that log(PR) is used. Slide by Mark Levene [3]

Rate of change on the Web [4] • Search engines update their index periodically in order to keep up with evolving web • obsolete index leads to irrelevant or “broken” search results • update both content and link structure • Source of change • content of pages change • new pages are added

What’s new on the Web? • New pages created rate of 8% a week[4] • New pages borrow significant amount of content from old pages • After one year, 50% of the content on the web is new • Only 20% of pages available today accessible after one year

New Link Structure • After a year, about 80% of links on the Web will be replaced with new ones • 25% change per week • week-old rankings may not reflect the current ranking of the pages very well

Change in old pages • After one week • 30% of the changed pages – difference > 5% • After one year • less than 50% of changed pages – difference > 5% • Creation of new pages more significant source of change on the Web

Impact on Search Engines • Need to continually update links – this data changes more rapidly then content • most links persist for less than 6 months • Page removed and replaced by new ones at rapid rates • Sometimes better to used cached version of page • Pages that persist usually do not change very much • Past change does not predict future change

Citations [1] GOOGLE. Google. www.google.com [2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999. [3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt [4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004. [5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998. [6] I. Rogers. The Google PageRank Algorithm and How It Works.www.iprcom.com/papers/pagerank, April, 2002. [7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements