1 / 24

Announcements

Announcements. Research Paper due today Research Talks Nov. 29 (Monday) Kayatana and Lance Dec. 1 (Wednesday) Mark and Jeremy Dec. 3 (Friday) Joe and Anton Dec. 5 (Monday) Colin and Paul. Web Search. Lecture 23. Searching the Web. Only search what is indexed

morwen
Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • Research Paper due today • Research Talks • Nov. 29 (Monday) Kayatana and Lance • Dec. 1 (Wednesday) Mark and Jeremy • Dec. 3 (Friday) Joe and Anton • Dec. 5 (Monday) Colin and Paul

  2. Web Search Lecture 23

  3. Searching the Web • Only search what is indexed • 1999, 800 million documents indexed by Northern Light[7] • Largest Index - 16% of the indexable web • 2004, 800 billion urls indexed by Google [1] • Largest Index - ?% of indexable web

  4. Visualizing the Web • View the web as a directed graph of nodes and edges • set of abstract nodes (the pages) • joined by directional edges (the hyperlinks) • Structure provides significant insight about the content

  5. Example Graph [6]

  6. Citation Analysis[2] • Use structure to identify important, or prominent, nodes • Garfield’s impact factor • Quantitative “score” for each journal proportional to the average number of citations per paper published in the previous two years • More heavily cited journals have more overall impact on a field • Consider it better to receive citations from an important journal

  7. Influence Weights • Pinski and Narin’s notion of influence weights • strength of the connection from one journal to another • percentage of citations in the first journal that refer to the second • equilibrium: the weight of each journal J equal to sum of the weights of all journals citing J (scaled by strengths of connections) • If a journal receives regular citations from other journals of large weight, it will acquire large weight

  8. On the web • Lot of dead-ends in the link structure • Prominent sites may have no links to outside world • Use “smoothing” operation, giving all pages a small, positive connection strength to every other page • Compute equilibrium weights with respect to modified connection strengths

  9. Different Model on the Web • Prominent cites do not link to other prominent cites • Search engines won’t link to other search engines because they are competitors • Want to keep users on its sites • Large collection of pages link to many prominent sites in a focused manner • act as resource lists and guides to search engines

  10. Hubs and Authorities • Authorities – most prominent sources of primary content for a topic • Hubs – high quality guides and resource lists direct users to recommended authorities • Each page is assigned a hub weight and an authority weight • authority weight - proportional to the sum of the hub weights of pages that link to it • hub weight - proportional to the sum of the authority weights of the pages that it links to

  11. Simplified PageRank Algorithm[5] • Formula used by Google to rank pages • Let u be a web page • Fu is a set of pages u points to • Bu is the set of pages that point to u • Nu = |Fu| • c factor used for normalization

  12. Simplified PageRank Calculation where c = 1

  13. PageRank Formula • Account for sinks • Complete Formula • d is empirically set to about 0.15 to 0.2 by the system

  14. Using Queries to find DocumentsVector Space Model – Content Relevance Slide by Mark Levene [3]

  15. Term Frequency (TF) • Count number of occurrences of each term. • Bag of words approach • Ignore stopwords such as is, a, of, the, … • Stemming - computer is replaced by comput, as are its variants: computers, computingcomputation,computer and computed. • Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. is a chess game game computer chess programming chess Slide by Mark Levene [3]

  16. Inverse Document Frequency (IDF) • N is number of documents in the corpus. • niis number of docs in which word i appears. • Log dampens the effect of IDF. • IDF is also number of bits to represent the term. Slide by Mark Levene [3]

  17. Ranking with TF-IDF • i – refers to document i • j – refers to word (or term) j in doc i • q – is the query which is a sequence of terms • scorej -is the score for document j given q • Rank results according to the scoring function. Slide by Mark Levene [3]

  18. Factor in Link Metrics • Multilply by PageRank of document (web page). • We do not know exactly how Google factors in the PR, it may be that log(PR) is used. Slide by Mark Levene [3]

  19. Rate of change on the Web [4] • Search engines update their index periodically in order to keep up with evolving web • obsolete index leads to irrelevant or “broken” search results • update both content and link structure • Source of change • content of pages change • new pages are added

  20. What’s new on the Web? • New pages created rate of 8% a week[4] • New pages borrow significant amount of content from old pages • After one year, 50% of the content on the web is new • Only 20% of pages available today accessible after one year

  21. New Link Structure • After a year, about 80% of links on the Web will be replaced with new ones • 25% change per week • week-old rankings may not reflect the current ranking of the pages very well

  22. Change in old pages • After one week • 30% of the changed pages – difference > 5% • After one year • less than 50% of changed pages – difference > 5% • Creation of new pages more significant source of change on the Web

  23. Impact on Search Engines • Need to continually update links – this data changes more rapidly then content • most links persist for less than 6 months • Page removed and replaced by new ones at rapid rates • Sometimes better to used cached version of page • Pages that persist usually do not change very much • Past change does not predict future change

  24. Citations [1] GOOGLE. Google. www.google.com [2] J. Kleinberg. Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), 1999. [3] M. Levene. Lecture 4: Searching the Web. www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt [4] A. Ntoulas et al. What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. In Proceedings of The Thirteenth International World Wide Web Conference, New York, May 17-22, 2004. [5] L. Page et al. The PageRank citation ranking: Bringing Order to the web. Stanford Digital Libraries Working Paper, 1998. [6] I. Rogers. The Google PageRank Algorithm and How It Works.www.iprcom.com/papers/pagerank, April, 2002. [7] E. Selberg and O. Etzioni. On the Stability of Web Search Engines. In Proceedings of RIAO 2000 Conference, Paris, April 12-14, 2000.

More Related