1 / 64

The Web Changes Everything: Understanding the Dynamics of Web Content

Eytan Adar, Jaime Teevan , Susan Dumais , and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09. The Web Changes Everything: Understanding the Dynamics of Web Content. Who Cares About Web Change?. Revisitation Monitoring Page Structure

dacey
Download Presentation

The Web Changes Everything: Understanding the Dynamics of Web Content

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eytan Adar, Jaime Teevan, Susan Dumais, and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09 The Web Changes Everything:Understanding the Dynamics of Web Content

  2. Who Cares About Web Change? • Revisitation • Monitoring • Page Structure • Fragility • Dynamic language • Search engine design

  3. Quantifying Change • Dynamics of the Web is well researched • Fetterly et al., (150 million pages), 65% stay the same • Koehler et al., (5 years), stabilization • Ntoulas et al., (turnover), 50% new content a year • And many others (see the paper for a summary) • But: eye towards systems issues • Crawl rates, indexing, storage needs, etc. • Always random samples • What about the visited Web • Slow (every day at best)

  4. Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  5. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  6. Behavior Driven Sampling • Can we measure dynamics of the actually used Web? • Usage Logs • Live Toolbar • 600k from August of ‘06 • Subset of total

  7. Sampling URLs 468 (avg), 650 (med) X 120 = 54788 Full details: Adar et al., CHI08 Visits Per User All crawlable, min 2 users, 2 times Inter-arrival time Unique Users (popularity)

  8. Behavior Driven Sampling • Can we measure dynamics of the actually used Web? • Usage Logs • Live Toolbar • 600k from August of ‘06 • Subset of total • Sampled URLs • Around 55k (use the 40k that had revisits in May/June) • Crawled hourly (and sub-hourly) for a year • May/June ’07

  9. URL Annotations • Visitation properties • Revisits, popularity, etc. • Broad type • News, Sports, Personal, Adult, etc. • Structural location • Top level page or deep within site?

  10. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  11. Basic measures of change Page version 1 Page version 2 time How long? (inter-version time) How much? Dice: 2*|A ⋂ B| / (|A|+|B|)

  12. 66% displayed change in 5 week sample (every 123 hours on average) Random web: 35% change after 11 weeks

  13. Average Inter-version Time by Page Popularity hours visitors More visitors = faster change

  14. Average Inter-version Time by Page “Depth” hours URL Depth More shallow (closer to homepage)= faster change

  15. Change Plot by Type Sports/Recreation 0.95 0.9 0.85 News/ Magazine Music 0.8 Personal Pages 0.75 Adult Mean Dice Coefficient 0.7 0.65 Industry/Trade 0.6 0.55 0.5 0 50 100 150 200 250 Mean Inter-version time (hours)

  16. Inter-Version Distribution

  17. Sub-hourly crawls Over 60% of pages displayed some change when crawled every 60 minutes. What is the “true” change rate of the page?

  18. Sub-hourly crawls controller Original crawl 1 Original crawl 2 2 minute delay 16 minute delay 32 minute delay 60 minute delay Round-robin crawling 8 samples over 3 (week)days shifted by at least 4 hours

  19. 40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Range of Changes in Sub-hourly crawls 19% At least once 9% pages 23% 24% 11% 66% Change every sample 11% 6% 12% 42% Mean Dice

  20. 40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Range of Changes in Sub-hourly crawls 19% At least once 9% pages 23% 24% 11% 66% Change every sample 11% 6% 12% 42% “623 Users Online” “Page generated in .6 ms” “Served to IP address…”

  21. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  22. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  23. Measuring change t0 • Pages are equally (dis)similar • Similarity based on • navigation elements • base language model Dice Time (hours) t1 t2 t3 t4 t5

  24. Two Segment Model Dynamic versus static steady state Knot point Time at which proportion of dynamic to static remains constant 2 segment (linear) – hockey stick

  25. Calculating the Knot Point Knot point Optimization problem

  26. Calculating the Knot Point Knot point Optimization problem

  27. Calculating the Knot Point Knot point Optimization problem

  28. Calculating the Knot Point Knot point Optimization problem

  29. Types of Change Curves *Consistent with the proportions of hand labeled data • 3 main types • Knotted (two-segment) • Sloped • Unchanging • Automatic classification (93% accuracy*) • 70% are knotted • 145 hours mean, 92 median • 28% sloped • 2% unchanging (flat)

  30. Change curves http://www.nytimes.com http://www.allrecipes.com Different stable segment  different ratios of dynamic to stable content

  31. Change curves Craigslist, Anchorage, AK Craigslist, Los Angeles, CA 1 dice AK .4 LA hours 10 20 30 40

  32. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  33. Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  34. Nature of the Text Or are still here? What terms vanish here? Baz Foo Bar

  35. Term Longevity Plot Baz Foo Bar Sep. Oct. Nov. Dec. Time

  36. Term Longevity Plot Baz Foo Bar • Term level representation of change curve • Pick a vertical (t0) • Compare overlap of terms to next vertical

  37. Features of Terms Baz Foo Bar • Divergence • Which terms distinguish current document from the collection (at a point in time) • Staying power (σ) • Likelihood of observing a word (w) at two different times, t and t+α in document D • σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)

  38. Low staying power (allrecipes.com) High Div. bbq salads sandwiches pork cheese cool High staying power (allrecipes.com) High Div. Distribution of terms by staying power (σ) cooks cookbooks ingredient desserts home you search … Low Div. Baz Foo Bar

  39. Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  40. Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications

  41. DOM Level Changes DOM Structure [UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web” • How long does structure hold? • Applications with assumed stability • Programming by Demonstration (PbD) • Mashups • Scrapers, etc.

  42. Tree Isomorphism • The “general” approach: • Compare the DOM structure of 2 trees • Produces alignment, edit distances, etc. [Grandi’04] • But: somewhat inefficient for large scale • We want: • A method for comparing many (1000s) of versions of the same page at the same time

  43. The Idea a / foo b full path type path node hash subtree hash version bar <a>foo <b>bar</b></a> @time = 0 Serialize each DOM structure

  44. The Idea a / foo b jar <a>foo <b>jar</b></a> @time = 1 Serialize each DOM structure

  45. The Idea a / b jar <a><b>jar</b></a> @time = 2 Serialize each DOM structure

  46. Operators on Serialized Data • sort(columns) • Sorts by the variables • reduce(columns) • Generates a set of sets • Look familiar?

  47. sort(full_path,version) S = reduce(full_path) foreach s in S: calculate the difference between the minimum version id and last reported id 2 1

  48. Structure Survival Over Time Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%

  49. Frequencies and Motion • Frequency of change of DOM elements

  50. Frequencies and Motion • Frequency of change of DOM elements • Motion of elements on a page • Can we predict the motion of a page element?

More Related