slide1
Download
Skip this Video
Download Presentation
Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching

Loading in 2 Seconds...

play fullscreen
1 / 27

Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching - PowerPoint PPT Presentation


  • 278 Views
  • Uploaded on

Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching. Weisong Shi , Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team. http://www.cs.nyu.edu/~weisong/conca.html.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching ' - Donna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching

Weisong Shi, Randy Wright*, Eli Collins, and Vijay Karamcheti

Department of Computer Science New York University

* NYUHome Team

http://www.cs.nyu.edu/~weisong/conca.html

trends in web content access
Trends in Web Content Access
  • Rapid growth of traffic for dynamic and personalized content
    • Dynamic web services
      • E.g., My Yahoo!, MyCiti.com
    • Trickle-down effect for static web pages
      • Web caching and CDN
    • 50% of requests for dynamically generated content
      • Wolman/Voelker/Levy, SOSP’99
    • 30% of requests carry cookies (indicates personalization)
      • Caceres/Douglis/Rabinovich, SIGMETRICS Server Perf. Workshop’98
  • However, traditional web caching architectures do not work well with these trends
problem and solution
Problem and Solution
  • Problem: How to efficiently generate/deliver dynamic and personalized content?
  • Solution: object composition technique
  • Basic idea: reuse at sub-document level
    • Quasi-static document template(e.g. ESI or XSL-FO)
    • Multiple objects with different characteristics
    • 60% of bytes of dynamic content can be reused (Shi’02, Wills’00)
  • Several Projects
    • Server-side: DUP (Challenger’99)
    • Cache-side: HPP (Douglis’97), Content Assembly (Mikhailov/Wills’00), EdgeSuite (Akamai), Websphere (IBM)
    • CONCAproject is our effort
      • Reuse “sharable” portion of personalized content
      • Transcode content to suit client device and network connection
what is missing
What is Missing?
  • Questions
    • Whether or not object composition techniques are in fact required and if they are likely to be beneficial?
    • What architecture for well-suited for dynamic content caching?
  • To answer the questions, we need……
  • A better understanding of their characteristicsfrom both a server and client perspective
  • This study focuses on characterization of a personalized web site
  • Complements previous work
    • Analysis of the MSNBC web site (Padmanabhan and Qiu, 2000)
    • Analysis of an e-commerce site (Arlitt et al., 2001)
roadmap
Roadmap
  • Motivation
  • NYUHome
  • Trace gathering
  • Analysis of characteristics
  • Implications for dynamic content caching
  • Related work
  • Summary
nyuhome
NYUHome
  • Portal for NYU students, faculty and staff ( 44,000 users)
  • Personalized web site
    • Tab-based design
    • Users choose channels and layout the chosen channels as desired
nyuhome tabs and channels
Five tabs

HOME, ACADEMICS, RESEARCH, NEWS, FILES

Twenty channels

NYUHome: Tabs and Channels
trace gathering
Trace Gathering
  • Process time (Tp )
    • Tp = T3  T2
  • Network latency (Tn)
    • Estimated by adding a blank pixel image at response
    • T2 T1 T5  T4
    • Tn = T5  T3
roadmap9
Roadmap
  • Motivation
  • NYUHome
  • Trace gathering
  • Analysis of characteristics
  • Implications for dynamic content caching
  • Related work
  • Summary
overall characteristics
Overall Characteristics
  • Two weeks period (02/13/2002  02/28/2002)
  • Aggregate statistics
    • 643,853 total requests (1706 requests/hour)
    • 27,576 total users (62% of registered users)
    • 73,119 total distinct IP addresses
distribution of requests
Distribution of Requests
  • Classify the source IP addresses into 5 categories
    • Campus, Resnet, Dialup, Overseas, and Others
  • Findings
    • Machines from NYU campus contribute 17% of IP, but 69% of request
    • 83% of IP fall outside NYUcontrol
      • Grouped into 4183 network clusters (Krishnamurthy’00)
      • 60 clusters have more than 100 IPs
requests to tabs
Requests to Tabs
  • 90.1% to default HOME tab
    • Most of them use NYUHome primarily for checking e-mail
  • Template occupies a significant portion (30% to 60%)
    • Agrees with other study on dynamic content (Wills’00, Shi’02)
channel size vs document size
Channel Size vs. Document Size
  • 99% of channels are less than 3000 bytes
  • Modeled well by Weibull distribution with
    • Agrees with our previous study on e-commerce sites
  • 70% of the documents lie in a small range (9,725, 10,688)
    • Popularity of HOME tab, and sizeable fraction of template size
user behavior session characteristics
User Behavior: Session Characteristics
  • Number of requests per session
    • Defined as the requests occupied by the same session key
    • 82.85% of sessions contain one request only
  • Inter-request time within a session
    • Average 492.7 seconds, median is 92.9 seconds
    • Reason why persistent connections are disabled at NYUHome
    • Captured very well with Lognormal distribution
user behavior client popularity
User Behavior: Client Popularity
  • Relation between user rank and the number of requests
    • Based on the number of requests he/she issues
    • If client popularity follows a Zipf-like distribution, the log scale plot should appear linear with a slope near 
    • =0.35 for top 2000 users
user behavior personalization
User Behavior: Personalization
  • Calculate the total number of channel combinations
  • Default vs. personalized users
    • Counted the number of distinct users who used a particular channel combination for each tab
user behavior personalization18
User Behavior: Personalization
  • Percentage of requests to different channel combinations
    • Observation: significant percentagediffers only in layout
request processing cost
Request Processing Cost
  • Apache 1.3 on 12-processor Sun E10000 (399MHz)
  • Average Tp = 1.41s
  • Relationship with server load
    • Observation: average Tp is independent of load
    • Inherent overhead of dynamic generation of personalized content
request processing cost a closer look
Request Processing Cost: A Closer Look
  • Correlation coefficient between
    • Tp and the number of channels: 0.98
      • which explains the lower Tp of ACADEMICS and RESEARCH
    • Tp and document size: 0.04
  • A simple model of processing overhead

where Tc for obtaining from cache, Tg for generating the content synchronously, and Ta for assembling the content into a document

Relationship: Tc+Ta = 0.32s, Tg+Ta = 0.52s

    • which means generating incurs an additional 0.2 seconds
network latency and throughput
Network Latency and Throughput
  • Average Tn=2.45s, 15% of requests require more than 5s
    • Both latency and throughput are captured well by Lognormal distribution
      • Agrees with Balakrishnan et.al’s study of 1996 Atlanta Olympic traces
  • Correlation coefficient between Tn and document size
    • -0.0031, but strong correlation after categorization
  • Diversity of network connections
    • Two LAN-like, one WAN-like, one phone modem, and others
    • Access using NYU Dialup is 7 times slower than those that access from Resnet
roadmap22
Roadmap
  • Motivation
  • NYUHome
  • Trace gathering
  • Analysis of characteristics
  • Implications for dynamic content caching
  • Related work
  • Summary
implications for dynamic content caching
Implications for Dynamic Content Caching
  • Need for efficient delivery of personalized content
    • 30% of users are using personalization
    • Larger if we count the “email checker phenomenon”
    • Increased server overheads and larger network latencies
  • Potential benefits from using the object composition technique
    • Advocate caching of channel content at proxy caches or surrogates
    • 6 of 11 channels larger than 1K bytes are sharable
    • 30% to 60% contributed by quasi-static template
    • Take HOME tab as an example, 96% potential bandwidth saving
      • Only Email channel need to be fetched
implications for dynamic content caching24
Implications for Dynamic Content Caching
  • Benefits from proxy prefetching and/or server pushing
    • Sizeable fraction (40%) of personalized channels
    • Solution: Server pushing or proxy prefetching
    • Long inter-request interval within a session allows more sophisticated prefetching policies
  • Benefits from predicting access patterns
    • Conflicting demand between prefetching and personalization needs to be reconciled
    • Zipf-like client popularity allows us to predict a small group
  • Need for customizing content based on network connection
    • Solution: different default layouts and channel content
related work
Related Work
  • Workload characterization
    • A lot of previous work for static web content
    • Few on dynamic web content
      • Analysis of MSNBC by Padmanabhan and Qiu (2000)
      • Analysis of a large shopping site by Arlitt et al. (2001)
    • Sub-document level and instrumented logs
  • Personalization
    • My!Yahoo user experience analysis by Manber et al (2000)
      • Only general information and high level implications
    • Our study looks at quantitative aspects
  • Server performance
    • Web server performance for static content
      • Flash (Pai’98), SEDA(Welsh’01)
    • Server processing overhead for personalized content
summary and future work
Summary and Future Work
  • Characteristics of NYUHome
    • Document composition, personalization behavior, server-side overhead and network latency
  • Implications for dynamic content caching
    • Personalization functionality is increasingly accepted
    • Substantial benefits are likely by applying object composition technique for personalized content
    • Both server load and latency can be further reduced by prefetching the content of a small number of personalized channels
    • Client-perceived latencies can be reduced by specializing the document layout and content to the network connection
  • Next step: integrating with the CONCA prototype
    • ESI-based prototype is now running
additional information
Additional Information

Moving to Wayne State University next week

[email protected]

ad