hypersearching the web chakrabarti soumen
Download
Skip this Video
Download Presentation
Hypersearching the Web, Chakrabarti, Soumen

Loading in 2 Seconds...

play fullscreen
1 / 18

Hypersearching the Web, Chakrabarti, Soumen - PowerPoint PPT Presentation


  • 235 Views
  • Uploaded on

Hypersearching the Web, Chakrabarti, Soumen. Presented By Ray Yamada. Overview. Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research. Why Do We Care?. Web Link Analysis is crucial for efficient Crawling and Ranking algorithms

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hypersearching the Web, Chakrabarti, Soumen' - medwin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • Why Do We Care?
  • Purpose of The Paper?
  • Solution by Clever Project
  • Pros / Cons of the Paper
  • Further Research
why do we care
Why Do We Care?
  • Web Link Analysis is crucial for efficient Crawling and Ranking algorithms
  • Crawling: Google Sitemap Submission, Yahoo Directory
  • Ranking: Relevant Result
purpose of the paper
Purpose of The Paper?
  • To Overcome These Challenges:
    • Its Size & Growth
    • Its Content Types
    • Language Semantics
    • New Language
    • Staleness of Results
    • SPAM
    • And More…
solution hyperlinks hyperlinks hyperlinks
Solution: Hyperlinks, Hyperlinks, Hyperlinks…
  • Can Think of the Web as a Directed Graph
  • Node = Web page (URL)
  • Edge = Hyperlink
solution hits algorithm
Solution: HITS Algorithm
  • Hyperlink-Induced Topic Search (HITS)
    • A.k.a. Hubs and Authorities
  • Hubs – Highly-valued lists for a given query
    • Ex. Yahoo Directory, Open Directory Project and Bookmarking sites.
  • Authorities – Highly endorsed answers to the query
    • Ex. New York Times, Huffington Post, Twitter
  • It is possible for a webpage to be both Hub and Authority
    • Ex. Restaurant Review Blogs
solution hits algorithm cont
Solution: HITS Algorithm Cont…
  • For each page p, we assign it two values hub(p) and auth(p)
  • Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number)
  • Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it.
  • Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it.
  • Normalize and Repeat
slide9
Pros:
  • Accurately addresses concerns and challenges we currently deal with
  • Great introduction to search engine algorithm
  • Briefly covered many topics (Breadth)
slide10
Cons:
  • Some materials are out of date (1999)
    • Ex. Google vs. Clever Project
  • Lack of Depth
    • Ex. Normalization of Hub and Auth values
further research hits algorithm extreme cases
Further Research: HITS Algorithm – Extreme Cases
  • Large-in-small-out sites
    • High Auth(p)
    • No Problem
  • Small-in-large-out sites
    • High Hub(p)
    • Problem
further research hits relevance scoring method
Further Research: HITS + Relevance Scoring Method
  • Vector Space Model (VSM)
    • Documents and queries are represented by vectors
    • Term Frequency
  • Okapi Measurement
    • Term Frequency + Document Length
  • Cover Density Ranking (CDR)
    • Phrase Similarity (How close terms appear)
further research hits relevance scoring method13
Further Research: HITS + Relevance Scoring Method
  • Use Cosine Relevance Test

Price

Car

further research hits relevance scoring method14
Further Research: HITS + Relevance Scoring Method
  • Three-Level Scoring Method (TLS)
    • Manual Evaluation of Relevance
      • Relevant Links = 2 points
      • Slightly Relevant Links = 1 point
      • Inactive Links + Error Links (404, 603) = 0 point
      • Irrelevant Links = 0 point
    • Order of query terms matters
further research co citation graph
Further Research: Co-citation Graph
  • Regular Link Graph:
  • Co-citation Graph:
what s next
What’s Next?
  • Google’s New Search Index: Caffeine
    • Announced June 8th, 2010
    • Up to 50% fresher results
    • Twice as fast
  • Real Time Search
    • Twitter / Facebook

http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html

references
References
  • Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. 
  • Longzhuang Li , Yi Shang , Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA  [doi>10.1145/511446.511514]
  • Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50.
  • Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM46 (5): 604–632. doi:10.1145/324133.324140.
  • von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.
ad