1 / 28

User-Centric Web Crawling

User-Centric Web Crawling. Sandeep Pandey & Christopher Olston Carnegie Mellon University. WWW. Web Crawling. One important application (our focus): search Topic-specific search engines + General-purpose ones. search queries. index. repository. user. crawler.

darby
Download Presentation

User-Centric Web Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

  2. WWW Web Crawling • One important application (our focus): search • Topic-specific search engines + General-purpose ones search queries index repository user crawler

  3. Out-of-date Repository • Web is always changing [Arasu et.al., TOIT’01] • 23% of Web pages change daily • 40% commercial Web pages change daily • Many problems may arise due to an out-of-date repository • Hurt both precision and recall

  4. WWW Web Crawling Optimization Problem • Not enough resources to (re)download every web document every day/hour • Must pick and choose  optimization problem • Others: objective function = avg. freshness, age • Our goal: focus directly on impact on users search queries index repository user crawler

  5. --------- • --------- • --------- • … Web Search User Interface • User enters keywords • Search engine returns ranked list of results • User visits subset of results documents

  6. Objective: Maximize Repository Quality (as perceived by users) • Suppose a user issues search query q: Qualityq = Σdocuments D(likelihood of viewing D)x (relevance of D to q) • Given a workload W of user queries: Average quality = 1/K x Σqueries q  W (freqq x Qualityq)

  7. Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing ViewProbability(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank R a n k

  8. Relevance Scoring Function • Search engines’ internal notion of how well a document matches a query • Each D/Q pair  numerical score  [0,1] • Combination of many factors, including: • Vector-space similarity (e.g., TF.IDF cosine metric) • Link-based factors (e.g., PageRank) • Anchortext of referring pages

  9. (Caveat) • Using scoring function for absolute relevance • Normally only used for relative ranking • Need to craft scoring function carefully

  10. scoring function over “live” copy of D ViewProb( Rank(D, q) ) query logs scoring function over (possibly stale) repository usage logs Measuring Quality Avg. Quality= Σq(freqqx ΣD(likelihood of viewing D) x (relevance of D to q))

  11. Lessons from Quality Metric Avg. Quality= Σq(freqqx ΣD(ViewProb( Rank(D, q) ) x (relevance of D to q)) • ViewProb(r) monotonically nonincreasing • Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking  lowers quality Let ΔQD = loss in quality due to inaccurate information about D • Alternatively, improvement in quality if we (re)download D

  12. ΔQD: Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQD

  13. Download Prioritization Two difficulties: • Live copy unavailable • Given both the “live” and repository copies of D, measuring ΔQD may require computing ranks of all documents for all queries Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly Q: How to measure ΔQD? Approach: (1) EstimateΔQD for past versions, (2) Forecast current ΔQD

  14. Overhead of EstimatingΔQD Estimate while updating inverted index

  15. Forecast Future ΔQD Avg. weekly ΔQD : Top 50% Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Top 80% second 24 weeks Top 90% first 24 weeks

  16. Summary • Estimate ΔQD at index time • Forecast future ΔQD • Prioritize downloading according to forecasted ΔQD

  17. Overall Effectiveness • Staleness = fraction of out-of-date documents* [Cho et al. 2000] • Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes • Scoring function: PageRank (similar results for TF.IDF) Min. Staleness Min. Embarrassment User-Centric resource requirement Quality (fraction of ideal)

  18. Reasons for Improvement • Does not rely on size of text change to estimate importance Tagged as important by shingling measure, although did not match many queries in workload (boston.com)

  19. Reasons for Improvement • Accounts for “false negatives” • Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)

  20. Related Work (1/2) • General-purpose Web crawling: • Min. Staleness [Cho, Garcia-Molina, SIGMOD’00] • Maximize average freshness or age for fixed set of docs. • Min. Embarrassment [Wolf et al., WWW’02]: • Maximize weighted avg. freshness for fixed set of docs. • Document weights determined by prob. of “embarrassment” • [Edwards et al., WWW’01] • Maximize average freshness for a growing set of docs. • How to balance new downloads vs. redownloading old docs.

  21. Related Work (2/2) • Focused/topic-specific crawling • [Chakrabarti, many others] • Select subset of pages that match user interests • Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests

  22. Summary • Crawling: an optimization problem • Objective: maximize quality as perceived by users • Approach: • Measure ΔQD using query workload and usage logs • Prioritize downloading based on forecasted ΔQD • Various reasons for improvement • Accounts for false positives and negatives • Does not rely on size of text change to estimate importance • Does not always ignore frequently updated pages

  23. THE END • Paper available at: www.cs.cmu.edu/~olston

  24. Most Closely Related Work • [Wolf et al., WWW’02]: • Maximize weighted avg. freshness for fixed set of docs. • Document weights determined by prob. of “embarrassment” • User-Centric Crawling: • Which queries affected by a change, and by how much? • Change A: significantly alters relevance to several common queries • Change B: only affects relevance to infrequent queries, and not by much • Metric penalizes false negatives • Doc. ranked #1000 for a popular query should be ranked #2 • Small embarrassment but big loss in quality

  25. Inverted Index Word Posting list DocID (freq) Doc1 Seminar: Cancer Symptoms Cancer Doc7 (2) Doc1 (1) Doc9 (1) Doc5 (1) Doc6 (1) Doc1 (1) Seminar Doc1 (1) Doc4 (3) Doc8 (2) Symptoms

  26. Updating Inverted Index Stale Doc1 Live Doc1 Cancer management: how to detect breast cancer Seminar: Cancer Symptoms Cancer Doc7 (2) Doc1 (1) Doc1 (2) Doc9 (1)

  27. Measure ΔQD While Updating Index • Compute previous and new scores of the downloaded document while updating postings • Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) • Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping • Measure ΔQD using previous and new ranks (by applying an approximate function derived in the paper)

  28. Out-of-date Repository Web Copy of D (fresh) Repository Copy of D (stale)

More Related