1 / 17

Enhanced topic distillation using text, markup tags, and hyperlinks

Enhanced topic distillation using text, markup tags, and hyperlinks. Soumen Chakrabarti Mukul Joshi Vivek Tawde www.cse.iitb.ac.in/~soumen. Topic distillation. Keyword query. Given a query or some example URLs Collect a relevant subgraph (community) of the Web

mark-pace
Download Presentation

Enhanced topic distillation using text, markup tags, and hyperlinks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhanced topic distillation using text, markup tags, and hyperlinks Soumen ChakrabartiMukul JoshiVivek Tawde www.cse.iitb.ac.in/~soumen

  2. Topic distillation Keyword query • Given a query or some example URLs • Collect a relevant subgraph (community) of the Web • Bipartite reinforcement between hubs and authorities • Prototypes: • HITS and Clever • Bharat and Henzinger Searchengine Expanded set Root set

  3. Challenges and limitations • Web authoring style in flux since 1996 • Complex pages generated from templates • File or page boundary less meaningful • “Clique attacks”—rampant multi-host ‘nepotism’ via rings, ads, banner exchanges • Models are too simplistic • Hub and authority symmetry is illusory • Coarse-grain hub model ‘leaks’ authority • Ad-hoc linear segmentation not content-aware • Deteriorating results of topic distillation

  4. Clique attacks! Irrelevantlinks formpseudo-community Relevant regionsthat lead to inclusionof page in base set

  5. Benign drift and generalization Remainingsectionsgeneralize and/or drift This sectionspecializes on‘Shakespeare’

  6. html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A new fine-grained model <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>

  7. Generative model for hub text Global termdistribution 0 • Global hub text distribution 0 relevant to given query • Authors use internal DOM nodes to specialize 0 into I • At a certain frontier in the DOM tree, local distribution directly generates text in ‘hot’ and ‘cold’ subtrees Progressive‘distortion’ Modelfrontier I Other pages

  8. A balanced cost measure Reference distribution 0 Cumulative distortion cost =KL(0; u) + … + KL(u; v) u v (for exponential distribution) Dv Goal: Find minimumcost frontier Data encoding cost is roughly

  9. Marking ‘hot’ subtrees • Hard to solve exactly (knapsack) • (1+) dynamic programming solution • Too slow for 10 million DOM nodes • Greedy expansion approach: at each node v, compare the cost of • Directly encoding Dvw.r.t. model v at v • First distorting v to w for each child w of v, then encoding all Dw w.r.t. respective w • If latter is smaller expand v, else prune • Mark relevance subtrees as “must-prune”

  10. Exploiting co-citation in our model 1 2 Initial values ofleaf hub scores = target auth scores Must-prune nodes are marked Have reasonto believethese could be good too 0.10 0.20 0.01 0.06 0.05 0.13 3 4 Aggregate hubscores are copiedback to leaves 0.12 ‘Known’authorities 0.13 0.10 0.20 0.12 0.12 0.12 0.10 0.20 0.13 Frontier microhubsaccumulate scores Non-linear transform, unlike HITS

  11. Complete algorithm • Collect root set and base set • Pre-segment using text and mark relevant micro-hubs to be pruned • Assign only root set authority scores to 1s • Iterate • Transfer from authority to hub leaves • Re-segment hub DOM trees using link + text • Smooth and redistribute hub scores • Transfer from hub leaves to authority roots • Report top authority and ‘hot’ microhubs

  12. Experimental setup • Large data sets • 28 queries from Clever, >20 topics from Dmoz • Collect 2000…10000 pages per query/topic • Several million DOM nodes and fine links • Find top authorities using various algos • For ad-hoc query, measure cosine similarity of authorities with root-set centroid in vector space • For Dmoz, use an automatic classifier…

  13. Avoiding topic drift via micro-hubs Query: cyclingNo danger of topic drift Query: affirmative actionTopic drift from software sites

  14. Results for the Clever benchmark • Take top 40 auths • Find average cosine similarity to root set centroid • HITS < DOM+Text < DOM similarity • DOM alone cannot prune well enough: most top auths from root set • HITS drifts often

  15. Dmoz experiments and results • 223 topics from http://dmoz.org • Sample root set URLs from a class c • Top authorities not in root set submitted to Rainbow classifier • d Pr(c |d) is the expected number of relevant documents • DOM+Text best DMoz Train Rainbowclassifier Sample Test Music Expanded set Root set Top authority

  16. Anecdotes • “amusement parks”: http://www.411fun.com/THEMEPARKSleaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. • New algorithm reduces drift • Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi • Mixed hubs in top 50 for 13/28 queries

  17. Conclusion and ongoing work • Hypertext shows complex idioms, missed by coarse-grained graph model • Enhanced fine-grained distillation • Identifies content-bearing ‘hot’ micro-hubs • Disaggregates hub scores • Reduces topic drift via mixed hubs and pseudo-communities • Application: topic-based focused crawling • Need probabilistic combination of evidence from text and links

More Related