1 / 22

Using Database Technology to Improve Performance of Web Proxy Servers

Using Database Technology to Improve Performance of Web Proxy Servers. K. Cheng ¹ , Y. Kambayashi ¹ , M. Mohania ² ¹ Kyoto University, Japan ² Western Michigan University, USA. Proxy Server. Lower Bandwidth. Higher Bandwidth. ( WAN ). ( LAN ). X. Direct Access.

shakti
Download Presentation

Using Database Technology to Improve Performance of Web Proxy Servers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Database Technology to Improve Performance of Web Proxy Servers K. Cheng¹, Y. Kambayashi¹, M. Mohania² ¹Kyoto University, Japan ²Western Michigan University, USA

  2. Proxy Server LowerBandwidth HigherBandwidth (WAN) (LAN) X DirectAccess Caching on web proxy servers • Improve throughput of proxy servers • Improve response times for end users • Bridge bandwidth gap between WAN and LAN • Distribute workload from web servers WebServers Clients WebDB'2001, Santa Barbara CA

  3. Characteristics of proxy caching WebDB'2001, Santa Barbara CA

  4. Limitations of current caching schemes: case 1 • Tom found a very good page “P1” about car models • John is also looking for that kind of pages, but he only got “P2” • Both “P1” and “P2” were cached, but Tom didn’t know “P2” and John didn’t know about “P1”. • After several days, however, both were replaced since no further visits. • As a result, Tom missed “P2”, John missed “P1”, and cache missed 2 hits State-of-art caching schemes cannot deal this case!! WebDB'2001, Santa Barbara CA

  5. Limitations of current caching schemes: case 2 • Suppose the users of a proxy server are mostly interested in “XML”, but rarely favor of “Fuzzy” • Suppose some clients retrieved pages “P1” and “P2” • After checking the content of “P1”and “P2”, we know “P1” is a “XML” one, “P2” is a “Fuzzy” one Should we prefer to cache “P1” or “P2” ? WebDB'2001, Santa Barbara CA

  6. Why current schemes can’t deal with these cases ? • Physical object based cache management • Content transparency  low utilization rate (Case 1) • Approximately 60% data in cache never used • Approximately 90% data in cache rarely used • Usage-based object replacement  Needlessly long stay time for irrelevant contents (Case 2) WebDB'2001, Santa Barbara CA

  7. Our solution • We propose a hierarchical data model for management of web data (physical pages, logical pages and topics). • Object replacement based on • Link structure (“logical pages”) • Semantic similarity with other objects (“topics” ) • Facilitate active access to cache contents WebDB'2001, Santa Barbara CA

  8. A hierarchical model for web data Topics navigate Topic manager T1 T2 Mapping Logical pages Search Logical page manager L1 L2 L3 Mapping Physical pages Browse Physical page manager p1 p2 p3 p4 p5 p6 WebDB'2001, Santa Barbara CA

  9. Physical pages http://www.difa.unibas.it/webdb2001 ../icons/webdblogo.gif Physical page “A” Physical page “B” /instructionsPage/index.html WebDB'2001, Santa Barbara CA

  10. Logical page A B WebDB'2001, Santa Barbara CA

  11. Managing physical pages • Physical page • HTML/plain text file (.html, .txt) • Embedded media file (.gif, .png, wav, .mp3) • Application Generated File (.pdf, .ps, .doc) • Managing physical pages based on • URL (protocol, ip, port, path) • Physical properties (e.g. size, cost etc.) • Usage (frequency, recency) WebDB'2001, Santa Barbara CA

  12. Constructing logical pages • Basic logical pages • Single multimedia document • HTML(1)+ embedded media files(1..*) • Extended logical pages • Several closely related directly linked pages E.g. an HTML paper with sections on different multimedia documents WebDB'2001, Santa Barbara CA

  13. Managing topics • Defining a topic • Topic = <id, name, criteria, popularity, date, …> • Popularity=f(F, R, P, U) F – Access Frequency of Topic R - Time interval between last access time and current time P – Number of logical pages belonging to a topic U – Number of users accessing a topic • Deciding membership of a logical page to a topic • IR Approaches (K-NN, ) • ML Approaches (e.g. Support Vector Machine-SVM) WebDB'2001, Santa Barbara CA

  14. Definitions • We use a term “Priority” for object replacement. It is a function of several parameters, e.g. access frequency(F), time interval(R), size of object(S), retrieval cost(C), significance(G). • Significance: Importance of the topic WebDB'2001, Santa Barbara CA

  15. Caching policy: LRU-SP+ • Topic management • Priority = f(F, R, G) • Logical page management • Basic logical pages only • Priority = g(F, R) • Physical page management • LRU-SP --size-adjusted & popularity-aware LRU (K. Cheng et al, Compsac’00) • Priority = h(F, R, S) WebDB'2001, Santa Barbara CA

  16. Evaluate & add new objects “D” is of higher priority T2 T1 Topics Priority Higher Lower L1 L2 L3 Logical Pages P10 P20 P30 P40 Physical Pages P22 P11 P12 P21 P31 P41 P42 New Object “D” WebDB'2001, Santa Barbara CA

  17. T2 T1 L1 L2 L3 P10 P20 P30 P40 P22 P11 P12 P21 P23 P31 P41 P42 Replace an object Choose a candidate topic (T1) T1 has 1 logical page (L1), choose (L1) (L1) has 3 physical pages (P10), ( P11), (P12), where (P12) shared by (L2) Choose a victim (P*) from (P10), ( P11). Replace (P*) with the new page WebDB'2001, Santa Barbara CA

  18. Preliminary experiments • Replay access logs of our proxy server(Squid) • 30 clients, 30 days • 873,824 requests, 21.30GB data • 7 Topics, Priority  [1..5] • Significance Factor ([0, 2]) • Measure the significance of each topic • Hit Rate(HR) • Percentage of requests satisfied by cache • Profit Rate(PR)-- is significance of topic WebDB'2001, Santa Barbara CA

  19. Baseline algorithm LRV (Rizzo et al 1998) • A physical-page-based algorithm • Using size(S) to predict further access to incoming objects • Parameters in consideration • Access frequency (F) • Time interval (R) • Size of objects (S) WebDB'2001, Santa Barbara CA

  20. Results: Hit Rates 20% UP Cache space in % of total unique data WebDB'2001, Santa Barbara CA

  21. Results: Profit Rates 30% Up Cache space in % of total unique data WebDB'2001, Santa Barbara CA

  22. Conclusion and future work • Performance of caching proxies can be remarkably improved if cache contents were well organized and managed • Proposed a hierarchical model and the cache management scheme based on that model • Future work • Tuning various parameters to achieve better performance(Logical page clustering, priority balancing significance and popularity etc.) • More experiments WebDB'2001, Santa Barbara CA

More Related