480 likes | 484 Views
WEB USAGE MINING NEGATIVE-ASSOCIATION s.vignesh 1hk07cs073. HKBKCE. Web Mining. Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services
E N D
WEB USAGE MINING NEGATIVE-ASSOCIATIONs.vignesh1hk07cs073 HKBKCE
Web Mining • Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services • Discovering useful information from the World-Wide Web and its usage patterns • My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
Web Mining • Data Mining Techniques • Association rules • Sequential patterns • Classification • Clustering • Outlier discovery • Applications to the Web • E-commerce • Information retrieval (search) • Network management
Examples of Discovered Patterns • Association rules • 98% of AOL users also have E-trade accounts • Classification • People with age less than 40 and salary > 40k trade on-line • Clustering • Users A and B access similar URLs • Outlier Detection • User A spends more than twice the average amount of time surfing on the Web
Web Mining • The WWW is huge, widely distributed, global information service centre for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • WWW provides rich sources of data for data mining
Why Mine the Web? • Enormous wealth of information on Web • Financial information (e.g. stock quotes) • Book/CD/Video stores (e.g. Amazon) • Restaurant information (e.g. Zagats) • Car prices (e.g. Carpoint) • Lots of data on user access patterns • Web logs contain sequence of URLs accessed by users • Possible to mine interesting nuggets of information • People who ski also travel frequently to Europe • Tech stocks have corrections in the summer and rally from November until February
Why is Web Mining Different? • The Web is a huge collection of documents except for • Hyper-link information • Access and usage information • The Web is very dynamic • New pages are constantly being generated • Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to • Exploit hyper-links and access patterns • Be incremental
Web Mining Applications • E-commerce (Infrastructure) • Generate user profiles • Targetted advertizing • Fraud • Similar image retrieval • Information retrieval (Search) on the Web • Automated generation of topic hierarchies • Web knowledge bases • Extraction of schema for XML documents • Network Management • Performance management • Fault management
User Profiling • Important for improving customization • Provide users with pages, advertisements of interest • Example profiles: on-line trader, on-line shopper • Generate user profiles based on their access patterns • Cluster users based on frequently accessed URLs • Use classifier to generate a profile for each cluster • Engage technologies • Tracks web traffic to create anonymous user profiles of Web surfers • Has profiles for more than 35 million anonymous users
Internet Advertizing • Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites • Plenty of startups doing internet advertizing • Doubleclick, AdForce, Flycast, AdKnowledge • Internet advertizing is probably the “hottest” web mining application today
Internet Advertizing • Scheme 1: • Manually associate a set of ads with each user profile • For each user, display an ad from the set based on profile • Scheme 2: • Automate association between ads and users • Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on) • For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster
? A1 A2 A3 Internet Advertizing • Use collaborative filtering (e.g. Likeminds, Firefly) • Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.) • Rij - rating of user Ui for ad Aj • Problem: Compute user Ui’s rating for an unrated ad Aj
Internet Advertizing • Key Idea: User Ui’s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui’s • User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows: • Consider a user Uk who has rated ad Aj • Compute Dik, the distance between Ui and Uk’s ratings on common ads • Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik) • Display to Ui ad Aj with highest computed rating
Fraud • With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important • Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought) • If buying pattern changes significantly, then signal fraud • HNC software uses domain knowledge and neural networks for credit card fraud detection
Retrieval of Similar Images • Given: • A set of images • Find: • All images similar to a given image • All pairs of similar images • Sample applications: • Medical diagnosis • Weather predication • Web search engine for images • E-commerce
Retrieval of Similar Images • QBIC, Virage, Photobook • Compute feature signature for each image • QBIC uses color histograms • WBIIS, WALRUS use wavelets • Use spatial index to retrieve database image whose signature is closest to the query’s signature • WALRUS decomposes an image into regions • A single signature is stored for each region • Two images are considered to be similar if they have enough similar region pairs
Images retrieved by WALRUS Query image
Problems with Web Search Today • Today’s search engines are plagued by problems: • the abundance problem (99% of info of no interest to 99% of people) • limitedcoverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages • limitedquery interface based on keyword-oriented search • limitedcustomization to individual users
Problems with Web Search Today • Today’s search engines are plagued by problems: • Web is highly dynamic • Lot of pages added, removed, and updated every day • Very high dimensionality
Improve Search By Adding Structure to the Web • Use Web directories (or topic hierarchies) • Provide a hierarchical classification of documents (e.g., Yahoo!) • Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Yahoo home page Recreation Business Science News Travel Sports Companies Finance Jobs
Automatic Creation of Web Directories • In the Clever project, hyper-links between Web pages are taken into account when categorizing them • Use a bayesian classifier • Exploit knowledge of the classes of immediate neighbors of document to be classified • Show that simply taking text from neighbors and using standard document classifiers to classify page does not work • Inktomi’s Directory Engine uses “Concept Induction” to automatically categorize millions of documents
Router Service Provider Network Server Network Management • Objective: To deliver content to users quickly and reliably • Traffic management • Fault management
Why is Traffic Management Important? • While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three • Result is frequent congestion at servers and on network links • during a major event (e.g., princess diana’s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world • Olympic sites during the games • NASA sites close to launch and landing of shuttles
Traffic Management • Key Ideas • Dynamically replicate/cache content at multiple sites within the network and closer to the user • Multiple paths between any pair of sites • Route user requests to server closest to the user or least loaded server • Use path with least congested network links • Akamai, Inktomi
Router Server Traffic Management Congested link Congested server Request Service Provider Network
Traffic Management • Need to mine network and Web traffic to determine • What content to replicate? • Which servers should store replicas? • Which server to route a user request? • What path to use to route packets? • Network Design issues • Where to place servers? • Where to place routers? • Which routers should be connected by links? • One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server
Fault Management • Fault management involves • Quickly identifying failed/congested servers and links in network • Re-routing user requests and packets to avoid congested/down servers and links • Need to analyze alarm and traffic data to carry out root cause analysis of faults • Bayesian classifiers can be used to predict the root cause given a set of alarms
Web Mining Issues • Size • Grows at about 1 million pages a day • Google indexes 9 billion documents • Number of web sites • Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html) • Diverse types of data • Images • Text • Audio/video • XML • HTML
Number of Active Sites Total Sites Across All Domains August 1995 - October 2007
SystemsIssues • Web data sets can be very large • Tens to hundreds of terabytes • Cannot mine on a single server! • Need large farms of servers • How to organize hardware/software to mine multi-terabye data sets • Without breaking the bank!
Different Data Formats • Structured Data • Unstructured Data • OLE DB offers some solutions!
Web Data • Web pages • Intra-page structures • Inter-page structures • Usage data • Supplemental data • Profiles • Registration information • Cookies
Web Usage Mining • Pages contain information • Links are ‘roads’ • How do people navigate the Internet • Web Usage Mining (clickstream analysis) • Information on navigation paths available in log files • Logs can be mined from a client or a server perspective
Website Usage Analysis • Why analyze Website usage? • Knowledge about how visitors use Website could • Provide guidelines to web site reorganization; Help prevent disorientation • Help designers place important information where the visitors look for it • Pre-fetching and caching web pages • Provide adaptive Website (Personalization) • Questions which could be answered • What are the differences in usage and access patterns among users? • What user behaviors change over time? • How usage patterns change with quality of service (slow/fast)? • What is the distribution of network traffic over time?
Website Usage Analysis • Analog – Web Log File Analyser • Gives basic statistics such as • number of hits • average hits per time period • what are the popular pages in your site • who is visiting your site • what keywords are users searching for to get to you • what is being downloaded • http://www.analog.cx/
Web Mining Outline Goal: Examine the use of data mining on the World Wide Web • Web Content Mining • Web Structure Mining • Web Usage Mining
Web Mining Taxonomy Modified from [zai01]
Web Content Mining • Examine the contents of web pages as well as result of web searching • Can be thought of as extending the work performed by basic search engines • Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users • Web Content Mining is: the process of extracting knowledge from web contents
Semi-structured Data • Content is, in general, semi-structured • Example: • Title • Author • Publication_Date • Length • Category • Abstract • Content
Structuring Textual Data • Many methods designed to analyze structured data • If we can represent documents by a set of attributes we will be able to use existing data mining methods • How to represent a document? • Vector based representation (referred to as “bag of words” as it is invariant to permutations) • Use statistics to add a numerical dimension to unstructured text
Document Representation • A document representation aims to capture what the document is about • One possible approach: • Each entry describes a document • Attribute describe whether or not a term appears in the document
Document Representation • Another approach: • Each entry describes a document • Attributes represent the frequency in which a term appears in the document
Document Representation • Stop Word removal: Many words are not informative and thus • irrelevant for document representation • the, and, a, an, is, of, that, … • Stemming: reducing words to their root form (Reduce dimensionality) • A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword “fishing” • Different words share the same word stem and should be represented with its stem, instead of the actual word “Fish”