1 / 95

Olfa Nasraoui Dept . of Computer Engineering & Computer Science

A journey through some of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of Louisville. Olfa Nasraoui Dept . of Computer Engineering & Computer Science Speed School of Engineering, University of Louisville Contact e-mail: olfa.nasraoui@louisville.edu.

Download Presentation

Olfa Nasraoui Dept . of Computer Engineering & Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A journey through some of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of Louisville OlfaNasraoui Dept. of Computer Engineering & Computer Science Speed School of Engineering, University of Louisville Contact e-mail: olfa.nasraoui@louisville.edu This work is supported by NSF CAREER Award IIS-0431128, NASA Grant No. AISR-03-0077-0139 issued through the Office of Space Sciences, NSF Grant IIS-0431128, Kentucky Science & Engr. Foundation, and a grant from NSTC via US Navy.

  2. Outline of talk • Data Mining Background • Mining Footprints Left Behind by Surfers on the Web • Web Usage Mining: WebKDD process, Profiling & Personalization • Mining Illegal Contraband Exchanges on Peer to Peer Networks • Mining Coronal Loops Created from Hot Plasma Eruptions on the Surface of the Sun

  3. Data Mining Background

  4. From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Too Much Data! • There is often information “hidden” in the data that is not always evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all Total new disk (TB) since 1995 The Data Gap Number of analysts

  5. Data Mining • Many Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of largequantities of data in order to discover meaningful patterns • Typical DM tasks: Clustering, classification, association rule mining • Applications: wherever there is data: • Business, World Wide Web (content, structure, usage), Biology, Medicine, Astronomy, Social Networks, E-learning, Images, …, etc

  6. Mining Footprints Left Behind by Surfers on the Web

  7. Introduction • Information overload: too much information to sift/browse through in order to find desired information • Most information on Web is actually irrelevant to a particular user • This is what motivated interest in techniques for Web personalization • As they surf a website, users leave a wealth of historic data about what pages they have viewed, choices they have made, etc • Web Usage Mining: A branch of Web Mining (itself a branch of data mining) that aims to discover interesting patterns from Web usage data (typically Web Access Log data/clickstreams)

  8. Classical Knowledge Discovery Process For Web Usage Mining • Source of Data: Web Clickstreams: that get recorded inWeb Log Files: • Date, time, IP Address/Cookie, URL accessed, …etc • Goal: Extract interesting user profiles: by categorizing user sessions into groups or clusters • Profile 1: • URLs a, b, c • Sessions in profile 1 • Profile 2: • URLs x, y, z, w • Sessions in profile 2 • …etc

  9. Classical Knowledge Discovery Process For Web Usage Mining Complete KDD process:

  10. Classical Knowledge Discovery Process For Web Usage Mining Complete KDD process:

  11. Classical Knowledge Discovery Process For Web Usage Mining Complete KDD process:

  12. Web Personalization • Web Personalization: Aims to adapt the Website according to the user’s activity or interests • Intelligent Web Personalization: often relies on Web Usage Mining (for user modeling) • Recommender Systems: recommend items of interest to the users depending on their interest • Content-based filtering: recommend items similar to the items liked by current user • No notion of community of users (specialize only to one user) • Collaborative filtering: recommend items liked by “similar” users • Combine history of a community of users: explicit (ratings) or implicit (clickstreams) • Hybrids: combine above (and others) Focus of our research

  13. Different Steps Of our Web Personalization System STEP 1: OFFLINE PROFILE DISCOVERY STEP 2: ACTIVE RECOMMENDATION User profiles/ User Model Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions

  14. Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Recommendation Engine Site Files Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • Dealing with Ambiguity: Semantics? • Implicit taxonomy? (Nasraoui, Krishnapuram, Joshi. 1999) • Website hierarchy (can help disambiguation, but limited) • Explicit taxonomy? (Nasraoui, Soliman, Badia, 2005) • From DB associated w/ dynamic URLs • Content taxonomy or ontology (can help disambiguation, powerful) • Concept hierarchy generalization / URL compression / concept abstraction: (Saka & Nasraoui, 2006) • How does abstraction affect quality of user models? Nasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous Environments

  15. Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • User Profile Post-processing Criteria? • (Saka & Nasraoui, 2006) • Aggregated profiles (frequency average)? • Robust profiles (discount noise data)? • How do they really perform? • How to validate? (Nasraoui & Goswami, SDM 2006)

  16. Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions Evolution: (Nasraoui, Cerwinske, Rojas, Gonzalez. CIKM 2006) Detecting & characterizingprofile evolution & change?

  17. Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • In case of massive evolving data streams: • Need stream data mining (Nasraoui et al. ICDM’03, WebKDD 2003) • Need stream-based recommender systems? (Nasraoui et al. CIKM 2006) • How do stream-based recommender systems perform under evolution? • How to validate above? (Nasraoui et al. CIKM 2006)

  18. Clustering: HUNC Methodology • Hierarchical Unsupervised Niche Clustering (HUNC) algorithm: a robust genetic clustering approach. • Hierarchical:clusters the data recursivelyand discovers profiles at increasing resolutions to allow finding even relatively small profiles/user segments. • Unsupervised: determines the number of clusters automatically. • Niching: maintains a diverse population in GA with members distributed among niches corresponding to multiple solutions  smaller profiles can survive alongside bigger profiles. • Genetic optimization: evolves a population of candidate solutions through generations of competition and reproduction until convergence to one solution.

  19. 0 1 0 0 1 1 1 Hierarchical Unsupervised Niche Clustering Algorithm (H-UNC):

  20. Unsupervised Niche Clustering (UNC) Evolutionary Algorithm performed in parallel with the evolution for estimating the niche sizes accurately and automatically Niching Strategy “Hill climbing”

  21. 0 1 0 0 1 1 1 Unsupervised Niche Clustering (UNC)

  22. Role of Similarity Measure: Adding Semantics • we exploit both • an implicit taxonomy as inferred from the website directory structure (we get this by tokenizing URL), • an explicit taxonomy as inferred from external data: Relation: object 1 is parent of object 2 …etc • Both implicit and explicit taxonomy information are seamlessly incorporated into clustering via a specialized Web session similarity measure

  23. Similarity Measure • Map NU URLs on site to indices • User session vector s(i) : temporally compact sequence of Web accesses by a user If site structure ignored cosine similarity • Taking site structure into account  relate distinct URLs • : path from root toURL’s node

  24. Pre-processing Data Mining Post-processing Server content DB Web Logs search queries about which companies? viewed Web pages? from which companies? Multi-faceted Web user profiles

  25. Example: Web Usage Mining of January 05 Log Data Total Number of Users: 3333 26 profiles

  26. Web Usage Mining of January 05 Cont… Who are these Users? /FRAMES.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX /TOP_FRAME.ASPX /LEFT1.HTM /MAINPAGE_FRAMESET.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX /MENU.ASPX/MENU_FILE=.JS /UNIVERSAL.ASPX/ID=Company Connection/Manufacturers / Suppliers/Soda Blasting /UNIVERSAL.ASPX/ID= Surface Treatment/Surface Preparation/Abrasive Blasting • Some are: • Highway Customers/Austria Facility • Naval Education And Training Program Management/United States • Magna International Inc/Canada • Norfolk Naval Shipyard (NNSY). What web pages did these users visit? What page did they visit before starting their session? What did they search for before visiting our website? • Some are: • http://www.eniro.se/query?q=www.nns.com&what=se&hpp=&ax= • http://www.mamma.com/Mamma?qtype=0&query=storage+tanks+in+water+treatment+plant%2Bcorrosion • http://www.comcast.net/qry/websearch?cmd=qry&safe=on&query=mesa+cathodic+protection&searchChoice=google • http://msxml.infospace.com/_1_2O1PUPH04B2YNZT__whenu.main/search/web/carboline.com. • Some are: • “shell blasting” • “Gavlon Industries • “obrien paints” • “Induron Coatings” • “epoxy polyamide” Total Number of Users:3333

  27. From Profiles to Personalization

  28. Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Current Session

  29. Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 1: Find closest Profile discovered by HUNC Current Session

  30. Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 2: Choose its specialized network NN #1 NN #2 Step 1: Find closest profile NN #3 ….etc Current Session ? NN # n

  31. Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 2: Choose the specialized network Recommen- dations NN #1 NN #2 Step 1: Find closest profile NN #3 ….etc Current Session ? NN # n

  32. ? ? ? Train each Neural Net to Complete missing puzzles (sessions from that profile/cluster) complete session Striped = Input session (incomplete) to the neural network “?” = Output that is predicted to complete the puzzle

  33. Precision: Comparison 2-step, 1-step, & K-NN (better for longer sessions)

  34. Coverage: Comparison 2-step, 1-step, and & K-NN (better for longer sessions)

  35. Semantic Personalization for Information Retrieval on an E-Learning Platform

  36. E-Learning HyperManyMedia Resources http://www.collegedegree.com/library/college-life/the_ultimate_guide_to_using_open_courseware

  37. Architecture Semantic representation ( knowledge representation), Algorithms (core software), and Personalization interface.

  38. Semantic Search Engine

  39. Experimental Evaluation A total of 1,406 lectures (documents) represented the profiles, with the size of each profile varying from one learner to another, as follows. Learner1 (English)= 86 lectures, Learner2 (Consumer and Family Sciences) = 74 lectures, Learner3 (Communication Disorders) = 160 lectures, Learner4 (Engineering) = 210 lectures, Learner5 (Architecture and Manufacturing Sciences) = 119 lectures, Learner6 (Math) = 374 lectures, Learner7 (Social Work) = 86 lectures, Learner8 (Chemistry) = 58 lectures, Learner9 (Accounting) = 107 lectures, and Learner10 (History) = 132 lectures.

  40. Handling Concept Drift / Evolving Data Streams

  41. Concept drift / Data Streams

  42. Recommender Systems in Dynamic Usage Environments • For massive Data streams, must use a stream mining framework • Furthermore must be able to continuously mine evolving data streams • TECNO-Streams: Tracking Evolving Clusters in Noisy Streams • Inspired by the immune system • Immune system: interaction between external agents (antigens) and immune memory (B-cells) • Artificial immune system: • Antigens = data stream • B-cells = cluster/profile stream synopsis = evolving memory • B-cells have an age (since their creation) • Gradual forgetting of older B-cells • B-cells compete to survive by cloning multiple copies of themselves • Cloning is proportional to the B-cell stimulation • B-cell stimulation: defined as density criterion of data around a profile (this is what is being optimized!) Nasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous Environments O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez. Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in Proc. of WebKDD 2003, Washington DC, Aug. 2003, 71-81.

  43. The Immune Network Memory External antigen (RED) stimulates binding B-cell  B-cell (GREEN) clones copies of itself (PINK) Stimulation breeds Survival Even after external antigen disappears: B-cells co-stimulate each other thus sustaining each other  Memory!

  44. General Architecture of Proposed Approach Evolving data stream  1-Pass Adaptive Immune Learning Immune network information system Stimulation (competition & memory) Age (old vs. new) Outliers (based on activation) ? Evolving Immune Network (compressed into subnetworks)

  45. S Trap Initial Data Initialize ImmuNet and MaxLimit Compress ImmuNet into KsubNet’s Identify nearest subNet* Present NEW antigen data Compute soft activations in subNet* Update subNet* ‘s ARB Influence range /scale Update subNet* ‘s ARBs’ stimulations Kill extra ARBs (based on age/stimulation strategy) OR increase acuteness of competition OR Move oldest patterns to aux. storage Clone antigen Clone and Mutate ARBs Kill lethal ARBs Compress ImmuNet Memory Constraints Start/ Reset Yes Activates ImmuNet? No Outlier? Domain Knowledge Constraints Yes #ARBs > MaxLimit? Secondary storage No ImmuNet Stat’s & Visualization

  46. Summarizing noisy data in 1 pass • Different Levels of Noise

  47. Validation for clustering streams in 1 pass:- Detect all clusters ( miss no cluster) - do not discover spurious clusters • Validation measures (averaged over 10 runs): • Hits, • Spurious Clusters

  48. Experimental Results • Arbitrary Shaped Clusters

  49. Experimental Results • Arbitrary Shaped Clusters

  50. Concept drift / Data Streams

More Related