1 / 27

Web Mining Issues

Web Mining Issues. Size >350 million pages Grows at about 1 million pages a day Diverse types of data. Web Mining Taxonomy. Crawlers. Robot (spider) traverses the hypertext sructure in the Web. Collect information from visited pages Used to construct indexes for search engines

tacey
Download Presentation

Web Mining Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Mining Issues • Size • >350 million pages • Grows at about 1 million pages a day • Diverse types of data

  2. Web Mining Taxonomy

  3. Crawlers • Robot (spider) traverses the hypertext sructure in the Web. • Collect information from visited pages • Used to construct indexes for search engines • Traditional Crawler – visits entire Web (?) and replaces index • Periodic Crawler – visits portions of the Web and updates subset of index • Incremental Crawler – selectively searches the Web and incrementally modifies index • Focused Crawler – visits pages related to a particular subject

  4. Focused Crawler • Classifier also determines how useful outgoing links are

  5. Focused Crawler

  6. Personalization • Web access or contents tuned to better fit the desires of each user. • Manual techniques identify user’s preferences based on profiles or demographics. • Collaborative filtering identifies preferences based on ratings from similar users. • Content based filtering retrieves pages based on similarity between pages and user profiles.

  7. PageRank • Used by Google • Prioritize pages returned from search by looking at Web structure. • Importance of page is calculated based on number of pages which point to it – Backlinks. • Weighting is used to provide more importance to backlinks coming form important pages.

  8. PageRank (cont’d) • PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) • PR(i): PageRank for a page i which points to target page p. • Ni: number of links coming out of page I • Rank source E: R= cAR+cE

  9. CLEVER • Identify authoritative and hub pages. • Authoritative Pages : • Highly important pages. • Best source for requested information. • Hub Pages : • Contain links to highly important pages.

  10. Web Usage Mining Applications • Personalization • Improve structure of a site’s Web pages • Aid in caching and prediction of future page references • Improve design of individual pages • Improve effectiveness of e-commerce (sales and advertising)

  11. Web Usage Mining Activities • Preprocessing Web log • Cleanse • Remove extraneous information • Sessionize Session: Sequence of pages referenced by one user at a sitting. • Pattern Discovery • Count patterns that occur in sessions • Pattern is sequence of pages references in session. • Similar to association rules • Transaction: session • Itemset: pattern (or subset) • Order is important • Pattern Analysis

  12. Web Usage Mining Issues • Identification of exact user not possible. • Exact sequence of pages referenced by a user not possible due to caching. • Session not well defined • Security, privacy, and legal issues

  13. Web Log Cleansing • Replace source IP address with unique but non-identifying ID. • Replace exact URL of pages referenced with unique but non-identifying ID. • Delete error records and records containing not page data (such as figures and code)

  14. Sessionizing • Divide Web log into sessions. • Two common techniques: • Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes). • All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.

  15. Episodes • Partially ordered set of pages • Serial episode – totally ordered with time constraint • Parallel episode – partial ordered with time constraint • General episode – partial ordered with no time constraint

  16. DAG for Episode

  17. Longest Common Subseries • Find longest subseries they have in common. • Ex: • X = <10,5,6,9,22,15,4,2> • Y = <6,9,10,5,6,22,15,4,2> • Output: <22,15,4,2> • Sim(X,Y) = l/n = 4/9

  18. Similarity based on Linear Transformation • Linear transformation function f • Convert a value form one series to a value in the second • ef – tolerated difference in results • d – time value difference allowed

  19. Distance between Strings • Cost to convert one to the other • Transformations • Match: Current characters in both strings are the same • Delete: Delete current character in input string • Insert: Insert current character in target string into string

  20. Distance between Strings

  21. Frequent Sequence

  22. Frequent Sequence Example • Purchases made by customers • s(<{A},{C}>) = 1/3 • s(<{A},{D}>) = 2/3 • s(<{B,C},{D}>) = 2/3

  23. Frequent Sequence Lattice

  24. SPADE • Sequential Pattern Discovery using Equivalence classes • Divides lattice into equivalent classes and searches each separately.

  25. SPADE Example • ID-List for Sequences of length 1: • Count for <{A}> is 3 • Count for <{A},{D}> is 2

  26. Q1 Equivalence Classes

  27. SPADE Algorithm

More Related