Workload Characterization of a Personalized Web Site
1 / 27

Workload Characterization of a Personalized Web Site ? And Its Implication for Dynamic Content Caching - PowerPoint PPT Presentation

  • Uploaded on

Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching. Weisong Shi , Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Workload Characterization of a Personalized Web Site ? And Its Implication for Dynamic Content Caching' - Donna

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching

Weisong Shi, Randy Wright*, Eli Collins, and Vijay Karamcheti

Department of Computer Science New York University

* NYUHome Team

Trends in web content access l.jpg
Trends in Web Content Access

  • Rapid growth of traffic for dynamic and personalized content

    • Dynamic web services

      • E.g., My Yahoo!,

    • Trickle-down effect for static web pages

      • Web caching and CDN

    • 50% of requests for dynamically generated content

      • Wolman/Voelker/Levy, SOSP’99

    • 30% of requests carry cookies (indicates personalization)

      • Caceres/Douglis/Rabinovich, SIGMETRICS Server Perf. Workshop’98

  • However, traditional web caching architectures do not work well with these trends

Problem and solution l.jpg
Problem and Solution

  • Problem: How to efficiently generate/deliver dynamic and personalized content?

  • Solution: object composition technique

  • Basic idea: reuse at sub-document level

    • Quasi-static document template(e.g. ESI or XSL-FO)

    • Multiple objects with different characteristics

    • 60% of bytes of dynamic content can be reused (Shi’02, Wills’00)

  • Several Projects

    • Server-side: DUP (Challenger’99)

    • Cache-side: HPP (Douglis’97), Content Assembly (Mikhailov/Wills’00), EdgeSuite (Akamai), Websphere (IBM)

    • CONCAproject is our effort

      • Reuse “sharable” portion of personalized content

      • Transcode content to suit client device and network connection

What is missing l.jpg
What is Missing?

  • Questions

    • Whether or not object composition techniques are in fact required and if they are likely to be beneficial?

    • What architecture for well-suited for dynamic content caching?

  • To answer the questions, we need……

  • A better understanding of their characteristicsfrom both a server and client perspective

  • This study focuses on characterization of a personalized web site

  • Complements previous work

    • Analysis of the MSNBC web site (Padmanabhan and Qiu, 2000)

    • Analysis of an e-commerce site (Arlitt et al., 2001)

Roadmap l.jpg

  • Motivation

  • NYUHome

  • Trace gathering

  • Analysis of characteristics

  • Implications for dynamic content caching

  • Related work

  • Summary

Nyuhome l.jpg

  • Portal for NYU students, faculty and staff ( 44,000 users)

  • Personalized web site

    • Tab-based design

    • Users choose channels and layout the chosen channels as desired

Nyuhome tabs and channels l.jpg

Five tabs


Twenty channels

NYUHome: Tabs and Channels

Trace gathering l.jpg
Trace Gathering

  • Process time (Tp )

    • Tp = T3  T2

  • Network latency (Tn)

    • Estimated by adding a blank pixel image at response

    • T2 T1 T5  T4

    • Tn = T5  T3

Roadmap9 l.jpg

  • Motivation

  • NYUHome

  • Trace gathering

  • Analysis of characteristics

  • Implications for dynamic content caching

  • Related work

  • Summary

Overall characteristics l.jpg
Overall Characteristics

  • Two weeks period (02/13/2002  02/28/2002)

  • Aggregate statistics

    • 643,853 total requests (1706 requests/hour)

    • 27,576 total users (62% of registered users)

    • 73,119 total distinct IP addresses

Distribution of requests l.jpg
Distribution of Requests

  • Classify the source IP addresses into 5 categories

    • Campus, Resnet, Dialup, Overseas, and Others

  • Findings

    • Machines from NYU campus contribute 17% of IP, but 69% of request

    • 83% of IP fall outside NYUcontrol

      • Grouped into 4183 network clusters (Krishnamurthy’00)

      • 60 clusters have more than 100 IPs

Requests to tabs l.jpg
Requests to Tabs

  • 90.1% to default HOME tab

    • Most of them use NYUHome primarily for checking e-mail

  • Template occupies a significant portion (30% to 60%)

    • Agrees with other study on dynamic content (Wills’00, Shi’02)

Channel size vs document size l.jpg
Channel Size vs. Document Size

  • 99% of channels are less than 3000 bytes

  • Modeled well by Weibull distribution with

    • Agrees with our previous study on e-commerce sites

  • 70% of the documents lie in a small range (9,725, 10,688)

    • Popularity of HOME tab, and sizeable fraction of template size

User behavior session characteristics l.jpg
User Behavior: Session Characteristics

  • Number of requests per session

    • Defined as the requests occupied by the same session key

    • 82.85% of sessions contain one request only

  • Inter-request time within a session

    • Average 492.7 seconds, median is 92.9 seconds

    • Reason why persistent connections are disabled at NYUHome

    • Captured very well with Lognormal distribution

User behavior client popularity l.jpg
User Behavior: Client Popularity

  • Relation between user rank and the number of requests

    • Based on the number of requests he/she issues

    • If client popularity follows a Zipf-like distribution, the log scale plot should appear linear with a slope near 

    • =0.35 for top 2000 users

User behavior personalization l.jpg
User Behavior: Personalization

  • Calculate the total number of channel combinations

  • Default vs. personalized users

    • Counted the number of distinct users who used a particular channel combination for each tab

User behavior personalization18 l.jpg
User Behavior: Personalization

  • Percentage of requests to different channel combinations

    • Observation: significant percentagediffers only in layout

Request processing cost l.jpg
Request Processing Cost

  • Apache 1.3 on 12-processor Sun E10000 (399MHz)

  • Average Tp = 1.41s

  • Relationship with server load

    • Observation: average Tp is independent of load

    • Inherent overhead of dynamic generation of personalized content

Request processing cost a closer look l.jpg
Request Processing Cost: A Closer Look

  • Correlation coefficient between

    • Tp and the number of channels: 0.98

      • which explains the lower Tp of ACADEMICS and RESEARCH

    • Tp and document size: 0.04

  • A simple model of processing overhead

    where Tc for obtaining from cache, Tg for generating the content synchronously, and Ta for assembling the content into a document

    Relationship: Tc+Ta = 0.32s, Tg+Ta = 0.52s

    • which means generating incurs an additional 0.2 seconds

Network latency and throughput l.jpg
Network Latency and Throughput

  • Average Tn=2.45s, 15% of requests require more than 5s

    • Both latency and throughput are captured well by Lognormal distribution

      • Agrees with Balakrishnan’s study of 1996 Atlanta Olympic traces

  • Correlation coefficient between Tn and document size

    • -0.0031, but strong correlation after categorization

  • Diversity of network connections

    • Two LAN-like, one WAN-like, one phone modem, and others

    • Access using NYU Dialup is 7 times slower than those that access from Resnet

Roadmap22 l.jpg

  • Motivation

  • NYUHome

  • Trace gathering

  • Analysis of characteristics

  • Implications for dynamic content caching

  • Related work

  • Summary

Implications for dynamic content caching l.jpg
Implications for Dynamic Content Caching

  • Need for efficient delivery of personalized content

    • 30% of users are using personalization

    • Larger if we count the “email checker phenomenon”

    • Increased server overheads and larger network latencies

  • Potential benefits from using the object composition technique

    • Advocate caching of channel content at proxy caches or surrogates

    • 6 of 11 channels larger than 1K bytes are sharable

    • 30% to 60% contributed by quasi-static template

    • Take HOME tab as an example, 96% potential bandwidth saving

      • Only Email channel need to be fetched

Implications for dynamic content caching24 l.jpg
Implications for Dynamic Content Caching

  • Benefits from proxy prefetching and/or server pushing

    • Sizeable fraction (40%) of personalized channels

    • Solution: Server pushing or proxy prefetching

    • Long inter-request interval within a session allows more sophisticated prefetching policies

  • Benefits from predicting access patterns

    • Conflicting demand between prefetching and personalization needs to be reconciled

    • Zipf-like client popularity allows us to predict a small group

  • Need for customizing content based on network connection

    • Solution: different default layouts and channel content

Related work l.jpg
Related Work

  • Workload characterization

    • A lot of previous work for static web content

    • Few on dynamic web content

      • Analysis of MSNBC by Padmanabhan and Qiu (2000)

      • Analysis of a large shopping site by Arlitt et al. (2001)

    • Sub-document level and instrumented logs

  • Personalization

    • My!Yahoo user experience analysis by Manber et al (2000)

      • Only general information and high level implications

    • Our study looks at quantitative aspects

  • Server performance

    • Web server performance for static content

      • Flash (Pai’98), SEDA(Welsh’01)

    • Server processing overhead for personalized content

Summary and future work l.jpg
Summary and Future Work

  • Characteristics of NYUHome

    • Document composition, personalization behavior, server-side overhead and network latency

  • Implications for dynamic content caching

    • Personalization functionality is increasingly accepted

    • Substantial benefits are likely by applying object composition technique for personalized content

    • Both server load and latency can be further reduced by prefetching the content of a small number of personalized channels

    • Client-perceived latencies can be reduced by specializing the document layout and content to the network connection

  • Next step: integrating with the CONCA prototype

    • ESI-based prototype is now running

Additional information l.jpg
Additional Information

Moving to Wayne State University next week