1 / 11

Improved Algorithms for Topic Distillation in a Hyperlinked Environment

This paper discusses improved algorithms for topic distillation in a hyperlinked environment, including connectivity analysis and pruning nodes. It also evaluates the effectiveness of these algorithms in finding related web pages.

cwaller
Download Presentation

Improved Algorithms for Topic Distillation in a Hyperlinked Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improved Algorithms for Topic Distillation in a Hyperlinked Environment(ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000

  2. Topic Distillation on the WWW • Definition Given a typical user query to find quality documents related to the query topic. • Characteristics • More general than finding a precise query match • Not as ambitious as trying to exactly satisfy user information need • In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.

  3. Related Research Related Page [3] Topic Distillation [2] HITS [1] Web Community [4] Reputation [5] • Authoritative sources in a hyperlinked environment ‘97 • Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’98 • Finding Related Pages in the World Wide Web ’99 • Inferring Web Communities from link topology ’98 • What is this page known for ? Computing Web Page Reputations. ‘00

  4. a(p) =  h(q) h(p) =  a(q) qp pq HITS (Hyperlink Induced Topic Search) • Algorithm • Start with a root set S • Ss is relatively small (typically up to 200 pages) • Ss is rich in relevant pages • Ss contains most (or many) of the strongest authorities. • Recursively compute the degree of authority and hub for each element. set T set S

  5. HITS (Hyperlink Induced Topic Search) • Premises • The implicit annotation provided by human creator contains sufficient information to infer authority. • The sufficiently broad topics contain embedded communities of hyperlinked pages. • Problems • Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. • Automatically Generated Links no human opinion is expressed by the link. • Non-relevant Documents the graph contains documents not relevant to the query topic

  6. a(p) =  h(q) x auth_wt(q,p) h(p) =  a(q) x hub_wt(p,q) qp pq Wiq x Wij Similarity(Q,Dj) = t t     i=1 i=1 t  2 2 wiq wij i=1 Improved Algorithm • Improved Connectivity Analysis • Mutually reinforcing relationships should have the same infulence on a single document. • Pruning Nodes from Neighborhood Graph • Relevant threshold : • Median Weight • Start Set Median Weight • Fixed Fraction of Maximum Weight

  7. Partial Content Analysis • Selectively analyze and prune if needed, the nodes that are most influential in the outcome. • Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links • Pruning • Degree Based Pruning • Use 4*in_degree+out_degree as a measure of influence • Fetch the top 100 nodes, scored against Q and pruned if needed. • Iterative Pruning • Use connectivity analysis itself to select nodes to prune.(imp) • Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.

  8. Average Precision at Top 5 and 10 ranked authority documents Average Precision at Top 5 and 10 ranked hub documents Without Regulation Without Regulation With Regulation With Regulation Partial Partial base base imp imp med med start start max max imp imp med med start start max max pca0 pca0 pca1 pca1 0.73 0.75 0.69 0.80 0.65 0.87 0.77 0.72 0.80 0.67 At 5 At 5 0.60 0.52 0.64 0.62 0.73 0.74 0.65 0.69 0.79 0.65 All All 0.62 0.76 At 10 At 10 0.46 0.56 0.55 0.48 0.56 0.80 0.48 0.64 0.60 0.88 0.80 0.60 At 5 At 5 0.44 0.24 0.66 0.76 0.54 0.43 0.44 0.53 0.64 0.50 Rare Rare 0.76 0.44 At 10 At 10 0.18 0.46 1.00 0.64 0.60 0.68 0.80 0. 80 0.72 0.60 0.80 0.55 At 5 At 5 0.48 0.36 Popular Popular 0.60 0.74 0.68 0.57 0.76 0.68 0.62 0.60 0.58 0.70 At 10 At 10 0.40 0.42 23% 26% 36% 33% 0.72 0.74 0.70 0.78 0.65 0.87 0.80 0.81 0.66 0.75 0.65 0.67 0.81 0.73 0.66 0.76 0.64 0.71 0.58 0.70 0.80 0.60 0.72 0.60 0.60 0.48 0.36 0.64 0.72 0.88 0.60 0.48 0.76 0.63 0.48 0.50 0.24 0.64 0.60 0.80 0.88 0.88 0.80 0.55 0.68 0.60 0.80 0.80 0.60 0.68 0.54 0.68 0.72 0.70 0.80 0.70 0.64 0.74 0.54 0.60 Evaluation

  9. Finding Related Pages in the WWW • Appears in 8th www conference • Definition • A related web page is one that addresses the same topic as the original page. • For example, www.washingtonpost.com is a page related to www.nytimes.com. • Algorithms • Companion algorithm : derived from HITS. • Cocitation algorithm : finds pages that are frequently cocited with the input URL u. • Evaluation • Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.

  10. Companion Algorithm • Takes as input a URL u and consists of four steps: • Build a vicinity graph for u. • Contract duplicates and near-duplicates in this graph • Compute edge weights based on host to host connections • Compute hub/authority score. u

  11. Cocitation Algorithm • Degree of co-citation • The number of common parents of two nodes. Sibling Set u

More Related