1 / 34

Adaptive Web Sites: Automatically Synthesizing Web Pages

Adaptive Web Sites: Automatically Synthesizing Web Pages. Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/. Adaptive Web Sites. Web sites that automatically reconfigure their organization and presentation by learning from user access patterns.

myra
Download Presentation

Adaptive Web Sites: Automatically Synthesizing Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Web Sites:Automatically Synthesizing Web Pages Mike Perkowitz and Oren Etzioni www.cs.washington.edu/homes/map/adaptive/

  2. Adaptive Web Sites Web sites that automatically reconfigure their organization and presentation by learning from user access patterns. (Perkowitz & Etzioni, IJCAI’97)

  3. Adaptive Web Sites • Individual Customization: site learns you like sports • Group Transformation: site learns most sports lovers also read “Tank McNamara” and cross-links them

  4. Group Transformations • Our approach: history-based • Previously: Simple transformations (Perkowitz & Etzioni, WWW6) • Goal: change in view

  5. machines.hyperreal.org

  6. Drum Machine Samples

  7. Index Page Synthesis Find groups of related documents at the site and create new pages linking to those documents. • Input: web site, access log • Output: pages of links to related pages

  8. Questions • What links are on the index page? • How are the contents ordered? • What is the title? • How are links labeled? • How do we make the index comprehensive?

  9. Outline • Motivation • Plausible approaches • Clustering • Frequent sets • Our approach: Cluster Mining • Algorithm: PageGather • Evaluation

  10. Clustering Voorhees-86,Willet-88,Rasmussen-92 • Similarity metric over documents • Cluster: items close together, far from others Algorithms: • Hierarchical Agglomerative Clustering (HAC) • K-means clustering

  11. Clustering Visit: set of pages accessed by an individual • Document = page • Similarity = co-occurrence in visits • Cluster  index page contents

  12. Clustering: Problems • Clustering induces a partition over data • Clustering can be slow

  13. Frequent Sets Agrawal, Imielinski, & Swami-93 • Set of transactions: “basket” of items • Find all frequently-occurring itemsets Algorithm: • A priori

  14. Frequent Sets Visit: set of pages accessed by an individual • Item = page • Transaction = visit • Frequent set  index page contents

  15. Frequent Sets: Problems • “Frequent Item Problem” • Finds many similar itemsets • low minimum frequency  high running time

  16. Idea: Cluster Mining • Find only high-quality clusters • Not a partition • Clusters may overlap

  17. The PageGather Algorithm • Graph-based representation • Nodes: pages • Edges: if P(P1|P2) and P(P2|P1) is high • Fast and accurate

  18. /96/Autumn/Final/ /96/Autumn/Final/ /97/Winter/Final/ /97/Winter/Final/ /96/Autumn/Midterm/ /96/Autumn/Midterm/ /97/Spring/Final/ /97/Spring/Final/ /97/Spring/Midterm/ /97/Spring/Midterm/ www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.hyperreal.com|crawl3.atext.com|GET /robots.txt HTTP/1.0|text/html|301|1997/07/03-23:59:08|-|188|-|-|-|ArchitextSpider www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /related_projects.html HTTP/1.0|text/html|200|1997/07/03-23:59:09|-|5047|-|-|http://www.apache.org/|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/ralf_hildenbeutel.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:09|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|ras87.brunnet.net|GET /raves/media/cyberia/link.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:09|-|415|-|-|http://www.hyperreal.org/raves/media/cyberia/|Mozilla/4.01 [en] (Win95; I) www.apache.org|blizzard-ext.wise.edt.ericsson.se|GET /images/apache_sub.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:10|-|6083|-|-|http://www.apache.org/related_projects.html|Mozilla/3.01Gold (X11; I; SunOS 5.5.1 sun4u) via Harvest Cache version 3.0pl5-Solaris www.apache.org|210.140.143.27|GET /images/apache_pb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:10|-|-|-|-|http://www.apache.org/|Mozilla/3.01 [ja] (Win95; I) www.apache.org|r2d2.dd.dk|GET /docs/ HTTP/1.0|text/html|200|1997/07/03-23:59:11|-|2207|-|-|http://www.apache.org/|Mozilla/2.0 (compatible; MSIE 3.01; Windows 95) www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/oliver_lieb.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/epsilon.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|4002|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95) www.hyperreal.org|du5-ts1.lascruces.com|GET /~wally/hyperreal.gif HTTP/1.0|image/gif|200|1997/07/03-23:59:11|-|2525|-|-|http://www.hyperreal.org/music/artists/fsol/www/|Mozilla/2.0 (compatible; MSIE 3.02; Update a; Windows 95) www.hyperreal.org|md27-001.mun.compuserve.com|GET /music/labels/recycle_or_die/baked_beans.gif HTTP/1.0|image/gif|304|1997/07/03-23:59:11|-|-|-|-|http://www.hyperreal.org/music/labels/recycle_or_die/|Mozilla/2.02E [de]-Beta2 (Win95; I; 16bit) www.hyperreal.org|cc6145d.comm.sfu.ca|GET /music/machines/categories/effects/ HTTP/1.0|text/html|200|1997/07/03-23:59:12|-|3844|-|-|http://www.hyperreal.org/music/machines/categories/|Mozilla/2.02 (Macintosh; I Log Visits Co-occurrence New Page Clique/CC Graph

  19. PageGather • Implement with Cliques or CCs • Find all candidates, return best • Clique: maximal cliques of size  k • Clique and CC versions comparable in time and performance

  20. Experiments machines.hyperreal.org • Site gets ~1200 visitors/day (10k hits) • Site contains ~2500 distinct documents • Training: a month of access data • Testing: ten days of data

  21. Performance Metric Are index pages helpful to users? • How well do clusters predict user navigation? • Q(C) = Given that a user visits one page in cluster C, how likely is she to visit any other?

  22. Cluster Mining vs. Clustering PageGather using • Clique  10 clusters 1:05 min • HAC  10 clusters 48+ hours • K-means  10 clusters 3:35 min

  23. Cluster Mining vs. Clustering PageGather using • Clique  10 clusters 1:05 min • HAC  10 clusters 48+ hours • K-means  10 clusters 3:35 min • HAC*  8 clusters 21:55 min (threshold, less data, mining)

  24. Cluster Mining vs. Clustering PageGather using • Clique  10 clusters 1:05 min • HAC  10 clusters 48+ hours • K-means  10 clusters 3:35 min • HAC*  7 clusters 293:08 min (threshold, less data, mining)

  25. Cluster Mining vs. Clustering Q Top 10 Clusters

  26. Cluster Mining vs. Clustering Q Top 10 Clusters

  27. Cluster Mining vs. Clustering Q Top 10 Clusters

  28. PageGather vs. Frequent Sets • PG/Clique 10 clusters 1:05 min • A priori  10 frequent sets 1:41 min

  29. PageGather vs. Frequent Sets Q Top 10 Clusters

  30. Contributions • Motivating problem: Web page synthesis • Method: Cluster mining • well suited for discovery of coherent sets • comparison to clustering, frequent sets • Algorithm: PageGather • graph-based, fast and accurate

  31. Clique vs. Conn-component Q Top 10 Clusters

  32. Clique vs. Conn-component • Comparable accuracy • Clique finds fewer, smaller clusters than CC • Clique: more accurate (at first) • Comparable running time (in practice)

  33. Future Directions • Meta-Information to improve coherence • Conceptual clustering • Improve coherence • Naming pages • Cluster mining to generate association rules

More Related