1 / 29

VIPAS: Virtual Link Powered Authority Search in the Web

VIPAS: Virtual Link Powered Authority Search in the Web. Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University. Outline. Motivation and Goal Preliminaries and Related work Introduction to Link-analysis

jerom
Download Presentation

VIPAS: Virtual Link Powered Authority Search in the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VIPAS: Virtual Link Powered Authority Search in the Web Chi-Chun Lin and Ming-Syan Chen Network Database Laboratory National Taiwan University

  2. Outline • Motivation and Goal • Preliminaries and Related work • Introduction to Link-analysis • Defects of Traditional Link-analysis and Ideas for Improvement • System Framework and Algorithms • Implementation and Experimental Results • Conclusions NTU

  3. Motivation and Goal • To find the most relevant pages satisfying the user’s information need in the Web • Traditional means for this task • Keyword-based search engines • Problems • Some relevant pages do not contain the keywords in the page text • An alternative method • Analyze the links contained in Web pages instead of ranking by keywords NTU

  4. HITS (1/3) • Authority pages • A page pointed to by many other pages • Hub pages • A page pointing to many other pages • Mutual reinforcement • An authority pointed to by many hub pages is an even better authority • A hub pointing to many authority pages is an even better hub • Based on this argument, the goal of HITS is to find the set of best authority pages NTU

  5. HITS (2/3) • Let xp and yp denote the authority and hub score of page p, respectively q1 page p xp := sum of yqfor all qp q1 q2 q3 q2 page p yp := sum of xqfor all pq q3 NTU

  6. HITS (3/3) • Iterative algorithm • Obtain a set of Web pages using a keyword-based query and expand it to form a base set • Assign each page of the base set an initial authority and hub score of 1 • According to its links, update the scores of each page • Normalize the scores so that(xp)2=1 and (yp)2=1 for all p in the base set • Do steps 3 and 4 iteratively until the scores converge NTU

  7. The Problem with HITS • Links in Web pages only reflect page creators’ judgment • Sometimes a link will not be put in the page even though its destination is very relevant • e.g: There will be no link to a company’s competitor in the same industry in its homepage • We argue: Page readers’ considerationshould be of equal importance NTU

  8. The Notion of Virtual Links • The basic idea • Identify pages that are heavily accessed within a period, and form a “hot set” from these pages • Create “virtual links” for pages in the hot set and incorporate them into the computation of authority scores • Design a Web warehouse for this task and utilize it to identify authoritative Web pages NTU

  9. System Framework Page Archive Query Interface Web Pages page content & links keywords virtual links Keyword & Ranking Database Virtual Link Creator Authority Evaluator scores query results Clickstream Database Clicking Observer NTU

  10. Creating Virtual Links • Scenario: A user interested in Java-related Web pages came to our system • She submitted a query with keyword “java” • Assume that the query result contains 100 URLs • She clicked top 1-10 of the 100 URLs except the 6th • The hot set consists of the 9 URLs clicked NTU

  11. Creating Virtual Links (cont’d) • 2 criteria URL 1 URL 1 URL 2 URL 2 Hub 1 URL 5 URL 5 Hub 2 Virtual Hub URL 6 URL 6 Hub n URL 7 URL 7 URL 10 URL 10 NTU

  12. Algorithm VIPAS(Virtual LInk Powered Authority Search) • Initialization Phase • For a query term, perform the regular HITS analysis • Collect a base set of pages with computed authority and hub scores and store them in the database • Virtual Link Collection Phase • Monitor the user behavior to see whether a URL in the list is clicked by the user or not • After a period of user behavior observation, put URLs that are often accessed into the “hot set” • Create virtual links for pages in the hot set NTU

  13. Algorithm VIPAS (cont’d) • Refinement Phase • For each page in the hot set, compute its new authority and hub scores • Run several iterations of score updating for pages in the base set • 2flavors • VIPAS-VH(VIPAS with virtual links from a Virtual Hub) • VIPAS-TH(VIPAS with virtual links from Top Hubs) NTU

  14. Finding Hot Sets • In an observing period, pay attention to clicks of continuous URLs in the list • When a user continuously clicks several URLs and then skips some URLs following, we mark those that have been skipped • Exclude pages marked with a frequency greater than  from the forming of hot sets • Among pages left, those that are accessed by at least % users are put into the hot set • Some relevant URLs that have already been browsed by the user will be skipped NTU

  15. Finding Hot Sets (cont’d) • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. clicked clicked URL 4 is marked clicked skipped clicked • http://java.sun.com/ • http://www.sun.com/java/ • http://www.javaworld.com/ • http://java.oreilly.com/ • http://www.jars.com/ • ………….. skipped clicked URL 4 is marked,but URL 1 is not clicked skipped clicked NTU

  16. Assigning Weights to Virtual Links n pages in the hot set: t1,t2,…,tn Clickstream 1: (t1,t2,t3,t4,x1,x2) Clickstream 2: (t3,x1,t1) NTU

  17. Assigning Weights to Virtual Links (cont’d) • Final weight:  • For period Ti where i 2 (1/3 is the degeneration factor) NTU

  18. Computing the New Scores • Let xp and yp denote the authority and hub score of page p, respectively • For each page p, we update p’s authority score by • Similarly, we update p’s hub score by NTU

  19. Query result for keyword: “Java” plain URL http://java.sun.com/ replaced by wrapper.asp?URL=http://java.sun.com/ • The Source of Java(TM) Technologyhttp://java.sun.com/ • ………………….http://…. • ………http://… • Increment the click count ofhttp://java.sun.com/ • Record the time • Redirect the user tohttp://java.sun.com/ Query result page User-behavior Observation • Use an ASP script NTU

  20. Implementation and Experiments • Experimental testbed • NTUEE website(http://www.ee.ntu.edu.tw/) • Data collection • 03/28/’02 ~ 05/31/’02 • Parameters NTU

  21. Evaluation Method • For a keyword, we manually select a list of authority pages and compare it with the output of each algorithm • Discrepancycoefficient  NTU

  22. Discrepancy Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)  NTU

  23. Discrepancy Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)  NTU

  24. Evaluation Method • Grouping coefficient  • Stability • The standard deviation of each algorithm’s discrepancy coefficients for all of the keywords NTU

  25. Grouping Coefficient –Regular HITS R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 41(SN 7228)  NTU

  26. Grouping Coefficient –VIPAS-VH R1 = 1(SN 5633), R2 = 5(SN 8682), R3 = 6(SN 7228)  NTU

  27. Experimental Results NTU

  28. Experimental Results (cont’d) NTU

  29. Conclusions • Link-analysis algorithms are popular in Web information retrieval • But they need further improvement • In our work, we built a Web warehouse • Incorporate user feedback into the identification of authoritative resources(Algorithm VIPAS) • Experimental results show that VIPAS is very effective and the warehouse is able to retrieve much more valuable information for users NTU

More Related