1 / 37

User-Centric Web Crawling*

User-Centric Web Crawling*. Christopher Olston CMU & Yahoo! Research**. * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon. central monitoring node. resource constraints. source A. source B. source C. Distributed Sources of Dynamic Information.

maryhjones
Download Presentation

User-Centric Web Crawling*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

  2. central monitoring node resource constraints source A source B source C Distributed Sources of Dynamic Information • Support integrated querying • Maintain historical archive • Sensors • Web sites Christopher Olston

  3. this talk Workload-driven Approach • Goal: meet usage needs, while adhering to resource constraints • Tactic: pay attention to workload • workload = usage + data dynamics • Thesis work:cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b] • Current focus:autonomous sources • Data archival from Web sources [VLDB’04] • Supporting Web search [WWW’05] Christopher Olston

  4. Outline • Introduction: monitoring distributed sources • User-centric web crawling • Model + approach • Empirical results • Related & future work Christopher Olston

  5. Web Crawling to Support Search search engine search queries index repository users crawler Q: Given a full repository, when to refresh each page? resource constraint Christopher Olston web site A web site B web site C

  6. Our approach: • User-centric optimization objective • Rich notion of document change, attuned to user-centric objective Approach • Faced with optimization problem • Others: • Maximize freshness, age, or similar • Boolean model of document change Christopher Olston

  7. --------- • --------- • --------- • … Web Search User Interface • User enters keywords • Search engine returns ranked list of results • User visits subset of results documents Christopher Olston

  8. Objective: Maximize Repository Quality, from Search Perspective • Suppose a user issues search query q Qualityq = Σdocuments d(likelihood of viewing d)x (relevance of d to q) • Given a workload W of user queries: Average quality = 1/K x Σqueries q  W(freqq x Qualityq) Christopher Olston

  9. Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing ViewProbability(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank Christopher Olston R a n k

  10. Relevance Scoring Function • Search engines’ internal notion of how well a document matches a query • Each D/Q pair  numerical score  [0,1] • Combination of many factors, e.g.: • Vector-space similarity (e.g., TF.IDF cosine metric) • Link-based factors (e.g., PageRank) • Anchortext of referring pages Christopher Olston

  11. (Caveat) • Using scoring function for absolute relevance (Normally only used for relative ranking) • Need to ensure scoring function has meaning on an absolute scale • Probabilistic IR models, PageRank okay • Unclear whether TF-IDF does (still debated, I believe) • Bottom line: stricter interpretability requirement than “good relative ordering” Christopher Olston

  12. scoring function over “live” copy of d ViewProb( Rank(d, q) ) query logs scoring function over (possibly stale) repository usage logs Measuring Quality Avg. Quality= Σq(freqqx Σd(likelihood of viewing d) x (relevance of d to q)) Christopher Olston

  13. Lessons from Quality Metric Avg. Quality= Σq(freqqx Σd(ViewProb( Rank(d, q) ) x Relevance(d, q)) ) • ViewProb(r) monotonically nonincreasing • Quality maximized when ranking function orders documents in descending order of true relevance Out-of-date repository: scrambles ranking  lowers quality Let ΔQD = loss in quality due to inaccurate information about D • Alternatively, improvement in quality if we (re)download D Christopher Olston

  14. ΔQD: Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQD Christopher Olston

  15. ∆QD(t) = Q(t) – Q(t–) =Σq(freqqx Σd(VP x Relevance(d, q)) ) where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) Formula for Quality Gain (ΔQD) • Quality beforehand: • Quality after re-download: • Quality gain: Re-download document D at time t. Q(t–) =Σq(freqqx Σd(ViewProb( Rankt–(d, q) ) x Relevance(d, q)) ) Q(t) =Σq(freqqx Σd(ViewProb( Rankt(d, q) ) x Relevance(d, q)) ) Christopher Olston

  16. Download Prioritization Three difficulties: • ΔQD depends on order of downloading • Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive • Live copy usually unavailable Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly Christopher Olston

  17. Difficulty 1: Order of Downloading Matters • ΔQD depends on relative rank positions of D • Hence, ΔQDdepends on order of downloading • To reduce implementation complexity, avoid tracking inter-document ordering dependencies • Assume ΔQD independent of downloading of other docs. QD(t) = Σq(freqqx Σd(VP x Relevance(d, q)) ) whereVP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) Christopher Olston

  18. Difficulty 3: Live Copy Unavailable • Take measurements upon re-downloading D (live copy available at that time) • Use forecasting techniques to project forward past re-downloads now time ΔQD(t1) ΔQD(t2) forecastΔQD(tnow) Christopher Olston

  19. Ability to Forecast ΔQD Avg. weekly ΔQD (log scale) Data: 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Docs downloaded once per week, in random order Top 50% Top 80% second 24 weeks Top 90% Christopher Olston first 24 weeks

  20. Strategy So Far • Measure shift in quality (ΔQD) each time re-download document D • Forecast future ΔQD • Treat each D independently • Prioritize re-downloading by ΔQD Remaining difficulty: • Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive Christopher Olston

  21. Difficulty 2: Metric Expensive to Compute Example: • “Live” copy of D becomes less relevant to query q than before • Now D is ranked too high • Some users visit D in lieu of Y, which is more relevant • Result: less-than-ideal quality • Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z Solution: estimate! • Use approximate relevancerank mapping functions, fit in advance for each query One problem: measurements of other documents required. Results for q ActualIdeal 1. X 1. X 2. D 2. Y 3. Y 3. Z 4. Z 4. D Christopher Olston

  22. DETAIL Estimation Procedure • Focus on query q (later we’ll see how to sum across all affected queries) • Let Fq(rel) be relevancerank mapping for q • We use piecewise linear function in log-log space • Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank • Use integral approximation of summation QD,q = Σd(ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D(VP(d,q) x Rel(d,q)) ≈Σr=r1+1…r2(VP(r–1) – VP(r)) x F–1q(r) Christopher Olston

  23. DETAIL Where we stand … Context: QD = Σq(freqqx QD,q) QD,q = VP(D,q) x Rel(D,q) + Σd≠D(VP(d,q) x Rel(d,q)) ≈ f(Rel(D,q), Rel(Dold,q)) ≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) ) QD,q ≈ g(Rel(D,q), Rel(Dold,q)) Christopher Olston

  24. Difficulty 2, continued Sketch: • Basic index unit: posting. Conceptually: • Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair* • Transform using estimation procedure, and accumulate across postings touched to get ΔQD Additional problem: must measure effect of shift in rank across all queries. Solution: couple measurements with index updating operations Christopher Olston * assumes scoring function treats term/document pairs independently

  25. DETAIL Background: Text Indexes Basic index unit: posting • One posting for each term/document pair • Contains information needed for scoring function • Number of occurrences, font size, etc. Dictionary Postings Christopher Olston

  26. DETAIL Pre-Processing: Approximate the Workload • Break multi-term queries into set of single-term queries • Now, term  query • Index has one posting for each query/document pair Dictionary Postings = query Christopher Olston

  27. DETAIL Taking Measurements During Index Maintenance • While updating index: • Initialize bank of ΔQD accumulators, one per document (actually, materialized on demand using hash table) • Each time insert/delete/update a posting: • Compute new & old relevance contributions for query/document pair: Rel(D,q), Rel(Dold,q) • Compute ΔQD,q using estimation procedure, add to accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q)) Christopher Olston

  28. Measurement Overhead Implemented in Lucene Caveat: Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion Christopher Olston

  29. Summary of Approach • User-centric metric of search repository quality • (Re)downloading document improves quality • Prioritize downloading by expected quality gain • Metric adaptations to enable feasible+efficient implementation Christopher Olston

  30. Next: Empirical Results • Introduction: monitoring distributed sources • User-centric web crawling • Model + approach • Empirical results • Related & future work Christopher Olston

  31. Overall Effectiveness • Staleness = fraction of out-of-date documents* [Cho et al. 2000] • Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes • Scoring function: PageRank (similar results for TF.IDF) Min. Staleness Min. Embarrassment User-Centric resource requirement Quality (fraction of ideal) Christopher Olston

  32. Reasons for Improvement • Does not rely on size of text change to estimate importance Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload Christopher Olston (boston.com)

  33. Reasons for Improvement • Accounts for “false negatives” • Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page Christopher Olston (washingtonpost.com)

  34. Related Work (1/2) • General-purpose web crawling • [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01] • Maximize average freshness or age • Balance new downloads vs. redownloading old documents • Focused/topic-specific crawling • [Chakrabarti, many others] • Select subset of documents that match user interests • Our work: given a set of docs., decide when to (re)download Christopher Olston

  35. Most Closely Related Work • [Wolf et al., WWW’02]: • Maximize weighted average freshness • Document weight = probability of “embarrassment” if not fresh • User-Centric Crawling: • Measure interplay between update and query workloads • When document X is updated, which queries are affected by the update, and by how much? • Metric penalizes false negatives • Doc. ranked #1000 for a popular query should be ranked #2 • Small embarrassment but big loss in quality Christopher Olston

  36. Future Work: Detecting Change-Rate Changes • Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD) • No provision to explore change-rates explicitly  Explore/exploit tradeoff • Ongoing work on Bandit Problem formulation Bad case: change-rate = 0, so never monitor • Won’t notice future increase in change-rate Christopher Olston

  37. Summary • Approach: • User-centric metric of search engine quality • Schedule downloading to maximize quality • Empirical results: • High quality with few downloads • Good at picking “right” docs. to re-download Christopher Olston

More Related