1 / 18

On Network-Aware Clustering of Web Clients

On Network-Aware Clustering of Web Clients. Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA. Outline. Introduction Simple approaches to clustering Network-aware approach

erv
Download Presentation

On Network-Aware Clustering of Web Clients

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Network-Aware Clustering of Web Clients Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA

  2. Outline • Introduction • Simple approaches to clustering • Network-aware approach • Applications of client clustering • Conclusion and future work On Network-Aware Clustering of Web Clients

  3. Introduction • Original goal: identify the group of clients that are responsible for a significant portion of a Web site’s requests • Cluster • Non-overlapping • Topologically close • Under common administrative control • But, identifying clusters requires knowledge that is not available to anyone outside the administrative entities. • Network-aware approach – BGP based On Network-Aware Clustering of Web Clients

  4. Simple approaches • Two approaches • Use traditional Class A, Class B and Class C networks • Assume prefix length is 24 bits • They are simple, but do not give good results (~50% accuracy). • Counter example On Network-Aware Clustering of Web Clients

  5. Network-aware approach • Use BGP routing and forwarding table snapshots • Routing table entries  clusters • Example snapshot of BGP routing table On Network-Aware Clustering of Web Clients

  6. Source of IP addresses BGP routing tables Prefix extraction, unification, merging IP address extraction IP addresses Prefix table Client cluster identification Automated process Raw client clusters Validation (optional) Examining impact of network dynamics Self-correction and adaptation Client clusters Clustering process On Network-Aware Clustering of Web Clients

  7. Network prefix extraction • Prefix entry extraction (BGP tables from 14 places via automated scripts) AADS, MAE-EAST, MAE-WEST, PACBELL, PAIX, ARIN, AT&T-Forw, AT&T-BGP, CANET, CERFNET, NLANR, OREGON, SINGAREN, and VBNS. • Prefix format unification and merging • Three formats: x1.x2.x3.x4/k1.k2.k3.k4 x1.x2.x3.x4/m x1.x2.x3.0 • Assembled total 391,497 unique prefix entries (412,109 entries by 7/24/2000) On Network-Aware Clustering of Web Clients

  8. Client cluster identification • Methodology • Extract the client IP address from the server log • Perform longest prefix matching on each client IP address • Classify all the client IP addresses which have the same longest matched prefix into a client cluster • Experiments • Experiments on wide range of Web server logs • Results • > 99% clients can be grouped into clusters • ~ 90% sampled clusters passed our validation tests On Network-Aware Clustering of Web Clients

  9. Server logs used in our experiments On Network-Aware Clustering of Web Clients

  10. Example: Nagano server log On Network-Aware Clustering of Web Clients

  11. Example: Nagano server log (cont.) On Network-Aware Clustering of Web Clients

  12. Validation of clustering • Validation - fundamentally difficult problem • A client cluster may be mis-identified by being too large or too small • Two approaches • nslookup-based test • Optimized traceroute-based test • Results on sampled 1% client clusters • A client cluster is mis-identified even if there is one client in the cluster doesn’t share same suffix with others. • Error rate of network-aware approach: ~10% • Error rate of simple approach: ~50% • Possible reason of mis-clustering: route aggregation, national gateway proxies • Effect of BGP prefix changes: < 3% (during 2 weeks) On Network-Aware Clustering of Web Clients

  13. ? Applications • Web caching, content distribution, server replication, traffic management and load balancing, Internet map discovery, etc. • Example: Web caching • Client classification: Normal client, proxy, and spider • Identifying spiders/proxies based on access patterns spider proxy On Network-Aware Clustering of Web Clients

  14. Detecting proxy/spider On Network-Aware Clustering of Web Clients

  15. Thresholding client clusters • Metric: number of requests issued from within a client cluster • 70% of the total requests in the server log • Web caching simulation On Network-Aware Clustering of Web Clients

  16. New dataset • Altavista server log containing 60,011,458 requests issued by 2,503,974 clients all over the world. • # clusters: 100,091 • # busy clusters: 242 • Accuracy: 91% • Clustering works on large, general portal site data. • Thanks to Altavista for sharing data with us. The data included only client IP addresses with no personally identifiable information. On Network-Aware Clustering of Web Clients

  17. Conclusion and future work • Network-aware client clustering • Based on BGP routing table snapshots • Ability to cluster >99% of clients in the server logs • Error rate is 10% (~ 50% for the simple approach) • Immune to BGP dynamics • Variety of applications • Ongoing work • Online algorithm • Super/sub clustering • Server clustering • Server replication application • Future work • Better validation • Lower error rate • Other applications On Network-Aware Clustering of Web Clients

  18. Acknowledgement Thanks to the following people for helping us in this project. Jennifer Rexford Anja Feldmann Tim Griffin Bill Manning Vern Paxson Craig Labovitz Thomas Narten Steven Bellovin Emden Gansner Nick Duffield S. Keshav Walter Willinger On Network-Aware Clustering of Web Clients

More Related