Improved Algorithms for Topic Distillation in a Hyperlinked Environment

Improved Algorithms for Topic Distillation in a Hyperlinked Environment(ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000

Topic Distillation on the WWW • Definition Given a typical user query to find quality documents related to the query topic. • Characteristics • More general than finding a precise query match • Not as ambitious as trying to exactly satisfy user information need • In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.

Related Research Related Page [3] Topic Distillation [2] HITS [1] Web Community [4] Reputation [5] • Authoritative sources in a hyperlinked environment ‘97 • Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’98 • Finding Related Pages in the World Wide Web ’99 • Inferring Web Communities from link topology ’98 • What is this page known for ? Computing Web Page Reputations. ‘00

a(p) =  h(q) h(p) =  a(q) qp pq HITS (Hyperlink Induced Topic Search) • Algorithm • Start with a root set S • Ss is relatively small (typically up to 200 pages) • Ss is rich in relevant pages • Ss contains most (or many) of the strongest authorities. • Recursively compute the degree of authority and hub for each element. set T set S

HITS (Hyperlink Induced Topic Search) • Premises • The implicit annotation provided by human creator contains sufficient information to infer authority. • The sufficiently broad topics contain embedded communities of hyperlinked pages. • Problems • Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. • Automatically Generated Links no human opinion is expressed by the link. • Non-relevant Documents the graph contains documents not relevant to the query topic

a(p) =  h(q) x auth_wt(q,p) h(p) =  a(q) x hub_wt(p,q) qp pq Wiq x Wij Similarity(Q,Dj) = t t     i=1 i=1 t  2 2 wiq wij i=1 Improved Algorithm • Improved Connectivity Analysis • Mutually reinforcing relationships should have the same infulence on a single document. • Pruning Nodes from Neighborhood Graph • Relevant threshold : • Median Weight • Start Set Median Weight • Fixed Fraction of Maximum Weight

Partial Content Analysis • Selectively analyze and prune if needed, the nodes that are most influential in the outcome. • Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links • Pruning • Degree Based Pruning • Use 4*in_degree+out_degree as a measure of influence • Fetch the top 100 nodes, scored against Q and pruned if needed. • Iterative Pruning • Use connectivity analysis itself to select nodes to prune.(imp) • Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.

Average Precision at Top 5 and 10 ranked authority documents Average Precision at Top 5 and 10 ranked hub documents Without Regulation Without Regulation With Regulation With Regulation Partial Partial base base imp imp med med start start max max imp imp med med start start max max pca0 pca0 pca1 pca1 0.73 0.75 0.69 0.80 0.65 0.87 0.77 0.72 0.80 0.67 At 5 At 5 0.60 0.52 0.64 0.62 0.73 0.74 0.65 0.69 0.79 0.65 All All 0.62 0.76 At 10 At 10 0.46 0.56 0.55 0.48 0.56 0.80 0.48 0.64 0.60 0.88 0.80 0.60 At 5 At 5 0.44 0.24 0.66 0.76 0.54 0.43 0.44 0.53 0.64 0.50 Rare Rare 0.76 0.44 At 10 At 10 0.18 0.46 1.00 0.64 0.60 0.68 0.80 0. 80 0.72 0.60 0.80 0.55 At 5 At 5 0.48 0.36 Popular Popular 0.60 0.74 0.68 0.57 0.76 0.68 0.62 0.60 0.58 0.70 At 10 At 10 0.40 0.42 23% 26% 36% 33% 0.72 0.74 0.70 0.78 0.65 0.87 0.80 0.81 0.66 0.75 0.65 0.67 0.81 0.73 0.66 0.76 0.64 0.71 0.58 0.70 0.80 0.60 0.72 0.60 0.60 0.48 0.36 0.64 0.72 0.88 0.60 0.48 0.76 0.63 0.48 0.50 0.24 0.64 0.60 0.80 0.88 0.88 0.80 0.55 0.68 0.60 0.80 0.80 0.60 0.68 0.54 0.68 0.72 0.70 0.80 0.70 0.64 0.74 0.54 0.60 Evaluation

Finding Related Pages in the WWW • Appears in 8th www conference • Definition • A related web page is one that addresses the same topic as the original page. • For example, www.washingtonpost.com is a page related to www.nytimes.com. • Algorithms • Companion algorithm : derived from HITS. • Cocitation algorithm : finds pages that are frequently cocited with the input URL u. • Evaluation • Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.

Companion Algorithm • Takes as input a URL u and consists of four steps: • Build a vicinity graph for u. • Contract duplicates and near-duplicates in this graph • Compute edge weights based on host to host connections • Compute hub/authority score. u

Cocitation Algorithm • Degree of co-citation • The number of common parents of two nodes. Sibling Set u

Improved Algorithms for Topic Distillation in a Hyperlinked Environment

Improved Algorithms for Topic Distillation in a Hyperlinked Environment

Presentation Transcript

MAIN DRIVERS

Distillation Column

Experiment 6: Fractional Distillation

Advanced Distillation Column Modelling and Reactive Distillation

Authoritative Sources in a Hyperlinked Environment

Backtracking

Distillation Column

DEPARTMENT OF CHEMICAL ENGINEERING

ClassAct SRS enabled.

Basic Maneuvering Tasks: Moderate Risk Driving Environment Topic 1 -- Risk Topic 2 -- Space Management System Topic 3

Steam Distillation

Authoritative Sources in a Hyperlinked environment

ACM SIGIR 2014 July 6-11, 2014 The 37 th Annual I nternational ACM SIGIR Conference

Evidence of Common Ancestry

Improved Video Categorization from Text Metadata and User Comments

DISTILLATION LAB #2

The Distillation Process

Basic Maneuvering Tasks: Low Risk Environment Topic 1 -- Basic Maneuvers

Improved Decremental Algorithms for

Improved BP algorithms ( first order gradient method)

Improved Algorithms for Dynamic Page Migration