Hypersearching the Web, Chakrabarti, Soumen - PowerPoint PPT Presentation

Hypersearching the web chakrabarti soumen l.jpg
Download
1 / 18

  • 217 Views
  • Updated On :
  • Presentation posted in: Internet / Web

Hypersearching the Web, Chakrabarti, Soumen. Presented By Ray Yamada. Overview. Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research. Why Do We Care?. Web Link Analysis is crucial for efficient Crawling and Ranking algorithms

Related searches for Hypersearching the Web, Chakrabarti, Soumen

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Hypersearching the Web, Chakrabarti, Soumen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hypersearching the web chakrabarti soumen l.jpg

Hypersearching the Web, Chakrabarti, Soumen

Presented By

Ray Yamada


Overview l.jpg

Overview

  • Why Do We Care?

  • Purpose of The Paper?

  • Solution by Clever Project

  • Pros / Cons of the Paper

  • Further Research


Why do we care l.jpg

Why Do We Care?

  • Web Link Analysis is crucial for efficient Crawling and Ranking algorithms

  • Crawling: Google Sitemap Submission, Yahoo Directory

  • Ranking: Relevant Result


Purpose of the paper l.jpg

Purpose of The Paper?

  • To Overcome These Challenges:

    • Its Size & Growth

    • Its Content Types

    • Language Semantics

    • New Language

    • Staleness of Results

    • SPAM

    • And More…


Solution hyperlinks hyperlinks hyperlinks l.jpg

Solution: Hyperlinks, Hyperlinks, Hyperlinks…

  • Can Think of the Web as a Directed Graph

  • Node = Web page (URL)

  • Edge = Hyperlink


Solution hits algorithm l.jpg

Solution: HITS Algorithm

  • Hyperlink-Induced Topic Search (HITS)

    • A.k.a. Hubs and Authorities

  • Hubs – Highly-valued lists for a given query

    • Ex. Yahoo Directory, Open Directory Project and Bookmarking sites.

  • Authorities – Highly endorsed answers to the query

    • Ex. New York Times, Huffington Post, Twitter

  • It is possible for a webpage to be both Hub and Authority

    • Ex. Restaurant Review Blogs


Solution hits algorithm cont l.jpg

Solution: HITS Algorithm Cont…

  • For each page p, we assign it two values hub(p) and auth(p)

  • Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number)

  • Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it.

  • Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it.

  • Normalize and Repeat


Solution hits algorithm cont8 l.jpg

Solution: HITS Algorithm Cont…

Calculation


Slide9 l.jpg

Pros:

  • Accurately addresses concerns and challenges we currently deal with

  • Great introduction to search engine algorithm

  • Briefly covered many topics (Breadth)


Slide10 l.jpg

Cons:

  • Some materials are out of date (1999)

    • Ex. Google vs. Clever Project

  • Lack of Depth

    • Ex. Normalization of Hub and Auth values


Further research hits algorithm extreme cases l.jpg

Further Research: HITS Algorithm – Extreme Cases

  • Large-in-small-out sites

    • High Auth(p)

    • No Problem

  • Small-in-large-out sites

    • High Hub(p)

    • Problem


Further research hits relevance scoring method l.jpg

Further Research: HITS + Relevance Scoring Method

  • Vector Space Model (VSM)

    • Documents and queries are represented by vectors

    • Term Frequency

  • Okapi Measurement

    • Term Frequency + Document Length

  • Cover Density Ranking (CDR)

    • Phrase Similarity (How close terms appear)


Further research hits relevance scoring method13 l.jpg

Further Research: HITS + Relevance Scoring Method

  • Use Cosine Relevance Test

Price

Car


Further research hits relevance scoring method14 l.jpg

Further Research: HITS + Relevance Scoring Method

  • Three-Level Scoring Method (TLS)

    • Manual Evaluation of Relevance

      • Relevant Links = 2 points

      • Slightly Relevant Links = 1 point

      • Inactive Links + Error Links (404, 603) = 0 point

      • Irrelevant Links = 0 point

    • Order of query terms matters


Further research co citation graph l.jpg

Further Research: Co-citation Graph

  • Regular Link Graph:

  • Co-citation Graph:


What s next l.jpg

What’s Next?

  • Google’s New Search Index: Caffeine

    • Announced June 8th, 2010

    • Up to 50% fresher results

    • Twice as fast

  • Real Time Search

    • Twitter / Facebook

http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html


References l.jpg

References

  • Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. 

  • Longzhuang Li , Yi Shang , Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA  [doi>10.1145/511446.511514]

  • Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50.

  • Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM46 (5): 604–632. doi:10.1145/324133.324140.

  • von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.


Slide18 l.jpg

Q & A


  • Login