1 / 43

Focused Crawling and Collection Synthesis

Focused Crawling and Collection Synthesis. Donna Bergmark Cornell Information Systems. Outline. Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002). Definition. Spider = robot = crawler

Download Presentation

Focused Crawling and Collection Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems CUL Metadata WG Meeting

  2. Outline • Crawlers • Collection Synthesis • Focused Crawling • Some Results • Student Project (Fall 2002) CUL Metadata WG Meeting

  3. Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CUL Metadata WG Meeting

  4. Crawlers – some background • Resource discovery • Crawlers and internet history • Crawling and crawlers • Mercator CUL Metadata WG Meeting

  5. Resource Discovery • Finding info on the Web • Surfing (random strategy, goal is serendipity) • Searching (inverted indices; specific info) • Crawling (“all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CUL Metadata WG Meeting

  6. Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CUL Metadata WG Meeting

  7. Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CUL Metadata WG Meeting

  8. Crawler Issues • The web is so big • Visit Order • The URL itself • Politeness • Robot Traps • The hidden web • System Considerations CUL Metadata WG Meeting

  9. Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CUL Metadata WG Meeting

  10. Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CUL Metadata WG Meeting

  11. The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CUL Metadata WG Meeting

  12. System Issues • Crawlers are complicated systems • Efficiency is of utmost importance • Crawlers are demanding of system and network resources CUL Metadata WG Meeting

  13. CUL Metadata WG Meeting

  14. Mercator Features • Written in Java • One file configures a crawl • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CUL Metadata WG Meeting

  15. Collection Synthesis • The NSDL • National Scientific Digital Library • Educational materials for K-thru-grave • A collection of digital collections • Collection (automatically derived) • 20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall CUL Metadata WG Meeting

  16. Crawler is the Key • A general search engine is good for precise results, few in number • A search engine must cover all topics, not just scientific • For automatic collection assembly, a Web crawler is needed • A focused crawler is the key CUL Metadata WG Meeting

  17. Focused Crawling CUL Metadata WG Meeting

  18. 1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CUL Metadata WG Meeting

  19. Collections and Clusters • Traditional – document universe is divided into clusters, or collections • Each collection represented by its centroid • Web – size of document universe is infinite • Agglomerative clustering is used instead • Two aspects: • Collection descriptor • Rule for when items belong to that Collection CUL Metadata WG Meeting

  20. Q = 0.2 Q = 0.6 CUL Metadata WG Meeting

  21. The Setup A virtual collection of items about Chebyshev Polynomials CUL Metadata WG Meeting

  22. Adding a Centroid An empty collection of items about Chebyshev Polynomials CUL Metadata WG Meeting

  23. Document Vector Space • Classic information retrieval technique • Each word is a dimension in N-space • Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001> • Normalize the weights Both the “centroid” and the downloaded document are term vectors CUL Metadata WG Meeting

  24. Agglomerate A collection with 3 items about Ch. Polys. CUL Metadata WG Meeting

  25. Where does the Centroid come from? ? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s CUL Metadata WG Meeting

  26. Building a Centroid 1. Google(“Chebyshev Polynomials”)  {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value H(t) ++ 4. Compute tf-idf weights. C  top 20 terms. CUL Metadata WG Meeting

  27. Dictionary • Given centroids C1, C2, C3 … • Dictionary is C1 + C2 + C3 … • Terms are union of terms in Ci • Term Frequencies are total frequency in Ci • Document Frequency is how many C’s have t • Term IDF is as from Berkeley • Dictionary is 300-500 terms CUL Metadata WG Meeting

  28. 1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CUL Metadata WG Meeting

  29. Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff CUL Metadata WG Meeting

  30. Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CUL Metadata WG Meeting

  31. Closest Furthest CUL Metadata WG Meeting

  32. Collection “Evaluation” • Assume higher correlations are good • With human relevance assessments, one can also compute a “precision” curve • Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n. CUL Metadata WG Meeting

  33. Cutoff = 0 Threshold = 0.3 CUL Metadata WG Meeting

  34. CUL Metadata WG Meeting

  35. Tunneling with Cutoff • Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 • Fixed cutoff: 0 – X1 – X2 - … Xc • Adaptive cutoff:0 – X1 – X2 - … X? CUL Metadata WG Meeting

  36. Statistics Collected • 500,000 documents • Number of seeds: 4 • Path data for all but seeds • 6620 completed paths (0-x…x-0) • 100,000s incomplete paths (0-x…x..) CUL Metadata WG Meeting

  37. Nuggets that are x steps from a nugget CUL Metadata WG Meeting

  38. Nuggets that are x steps from a seed and/or a nugget CUL Metadata WG Meeting

  39. Better parents have better children. CUL Metadata WG Meeting

  40. Using the Empirical Observations • Use the path history • Use the page quality - cosine correlation • Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr)exp (2 x parent’s distance / cutoff) CUL Metadata WG Meeting

  41. Results • Details in the ECDL paper • Smaller frontier  more docs/second • More documents downloaded in same time • Higher-scoring documents were downloaded • Cutoff of 20 averaged 7 steps at the cutoff CUL Metadata WG Meeting

  42. Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CUL Metadata WG Meeting

  43. Conclusion • We’ve covered crawling – history, technology, use • Focused crawling with tunneling • Adaptive cutoff with tunneling • We have a good experimental setup for exploring automatic collection synthesis CUL Metadata WG Meeting

More Related