Olfa Nasraoui Dept . of Computer Engineering & Computer Science

A journey through some of the research projects at the Knowledge Discovery & Web Mining Lab at Univ. of Louisville OlfaNasraoui Dept. of Computer Engineering & Computer Science Speed School of Engineering, University of Louisville Contact e-mail: olfa.nasraoui@louisville.edu This work is supported by NSF CAREER Award IIS-0431128, NASA Grant No. AISR-03-0077-0139 issued through the Office of Space Sciences, NSF Grant IIS-0431128, Kentucky Science & Engr. Foundation, and a grant from NSTC via US Navy.

Outline of talk • Data Mining Background • Mining Footprints Left Behind by Surfers on the Web • Web Usage Mining: WebKDD process, Profiling & Personalization • Mining Illegal Contraband Exchanges on Peer to Peer Networks • Mining Coronal Loops Created from Hot Plasma Eruptions on the Surface of the Sun

Data Mining Background

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Too Much Data! • There is often information “hidden” in the data that is not always evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all Total new disk (TB) since 1995 The Data Gap Number of analysts

Data Mining • Many Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of largequantities of data in order to discover meaningful patterns • Typical DM tasks: Clustering, classification, association rule mining • Applications: wherever there is data: • Business, World Wide Web (content, structure, usage), Biology, Medicine, Astronomy, Social Networks, E-learning, Images, …, etc

Mining Footprints Left Behind by Surfers on the Web

Introduction • Information overload: too much information to sift/browse through in order to find desired information • Most information on Web is actually irrelevant to a particular user • This is what motivated interest in techniques for Web personalization • As they surf a website, users leave a wealth of historic data about what pages they have viewed, choices they have made, etc • Web Usage Mining: A branch of Web Mining (itself a branch of data mining) that aims to discover interesting patterns from Web usage data (typically Web Access Log data/clickstreams)

Classical Knowledge Discovery Process For Web Usage Mining • Source of Data: Web Clickstreams: that get recorded inWeb Log Files: • Date, time, IP Address/Cookie, URL accessed, …etc • Goal: Extract interesting user profiles: by categorizing user sessions into groups or clusters • Profile 1: • URLs a, b, c • Sessions in profile 1 • Profile 2: • URLs x, y, z, w • Sessions in profile 2 • …etc

Classical Knowledge Discovery Process For Web Usage Mining Complete KDD process:

Web Personalization • Web Personalization: Aims to adapt the Website according to the user’s activity or interests • Intelligent Web Personalization: often relies on Web Usage Mining (for user modeling) • Recommender Systems: recommend items of interest to the users depending on their interest • Content-based filtering: recommend items similar to the items liked by current user • No notion of community of users (specialize only to one user) • Collaborative filtering: recommend items liked by “similar” users • Combine history of a community of users: explicit (ratings) or implicit (clickstreams) • Hybrids: combine above (and others) Focus of our research

Different Steps Of our Web Personalization System STEP 1: OFFLINE PROFILE DISCOVERY STEP 2: ACTIVE RECOMMENDATION User profiles/ User Model Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions

Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Recommendation Engine Site Files Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • Dealing with Ambiguity: Semantics? • Implicit taxonomy? (Nasraoui, Krishnapuram, Joshi. 1999) • Website hierarchy (can help disambiguation, but limited) • Explicit taxonomy? (Nasraoui, Soliman, Badia, 2005) • From DB associated w/ dynamic URLs • Content taxonomy or ontology (can help disambiguation, powerful) • Concept hierarchy generalization / URL compression / concept abstraction: (Saka & Nasraoui, 2006) • How does abstraction affect quality of user models? Nasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous Environments

Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • User Profile Post-processing Criteria? • (Saka & Nasraoui, 2006) • Aggregated profiles (frequency average)? • Robust profiles (discount noise data)? • How do they really perform? • How to validate? (Nasraoui & Goswami, SDM 2006)

Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions Evolution: (Nasraoui, Cerwinske, Rojas, Gonzalez. CIKM 2006) Detecting & characterizingprofile evolution & change?

Challenges & Questions in Web Usage Mining STEP 1: OFFLINE PROFILE DISCOVERY User profiles/ User Model ACTIVE RECOMMENDATION Post Processing / Derivation of User Profiles Site Files Recommendation Engine Preprocessing Recommendations Active Session Data Mining: Transaction Clustering Association Rule Discovery Pattern Discovery Server Logs User Sessions • In case of massive evolving data streams: • Need stream data mining (Nasraoui et al. ICDM’03, WebKDD 2003) • Need stream-based recommender systems? (Nasraoui et al. CIKM 2006) • How do stream-based recommender systems perform under evolution? • How to validate above? (Nasraoui et al. CIKM 2006)

Clustering: HUNC Methodology • Hierarchical Unsupervised Niche Clustering (HUNC) algorithm: a robust genetic clustering approach. • Hierarchical:clusters the data recursivelyand discovers profiles at increasing resolutions to allow finding even relatively small profiles/user segments. • Unsupervised: determines the number of clusters automatically. • Niching: maintains a diverse population in GA with members distributed among niches corresponding to multiple solutions  smaller profiles can survive alongside bigger profiles. • Genetic optimization: evolves a population of candidate solutions through generations of competition and reproduction until convergence to one solution.

0 1 0 0 1 1 1 Hierarchical Unsupervised Niche Clustering Algorithm (H-UNC):

Unsupervised Niche Clustering (UNC) Evolutionary Algorithm performed in parallel with the evolution for estimating the niche sizes accurately and automatically Niching Strategy “Hill climbing”

0 1 0 0 1 1 1 Unsupervised Niche Clustering (UNC)

Role of Similarity Measure: Adding Semantics • we exploit both • an implicit taxonomy as inferred from the website directory structure (we get this by tokenizing URL), • an explicit taxonomy as inferred from external data: Relation: object 1 is parent of object 2 …etc • Both implicit and explicit taxonomy information are seamlessly incorporated into clustering via a specialized Web session similarity measure

Similarity Measure • Map NU URLs on site to indices • User session vector s(i) : temporally compact sequence of Web accesses by a user If site structure ignored cosine similarity • Taking site structure into account  relate distinct URLs • : path from root toURL’s node

Pre-processing Data Mining Post-processing Server content DB Web Logs search queries about which companies? viewed Web pages? from which companies? Multi-faceted Web user profiles

Example: Web Usage Mining of January 05 Log Data Total Number of Users: 3333 26 profiles

Web Usage Mining of January 05 Cont… Who are these Users? /FRAMES.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX /TOP_FRAME.ASPX /LEFT1.HTM /MAINPAGE_FRAMESET.ASPX/MENU_FILE=.JS&MAIN=/UNIVERSAL.ASPX /MENU.ASPX/MENU_FILE=.JS /UNIVERSAL.ASPX/ID=Company Connection/Manufacturers / Suppliers/Soda Blasting /UNIVERSAL.ASPX/ID= Surface Treatment/Surface Preparation/Abrasive Blasting • Some are: • Highway Customers/Austria Facility • Naval Education And Training Program Management/United States • Magna International Inc/Canada • Norfolk Naval Shipyard (NNSY). What web pages did these users visit? What page did they visit before starting their session? What did they search for before visiting our website? • Some are: • http://www.eniro.se/query?q=www.nns.com&what=se&hpp=&ax= • http://www.mamma.com/Mamma?qtype=0&query=storage+tanks+in+water+treatment+plant%2Bcorrosion • http://www.comcast.net/qry/websearch?cmd=qry&safe=on&query=mesa+cathodic+protection&searchChoice=google • http://msxml.infospace.com/_1_2O1PUPH04B2YNZT__whenu.main/search/web/carboline.com. • Some are: • “shell blasting” • “Gavlon Industries • “obrien paints” • “Induron Coatings” • “epoxy polyamide” Total Number of Users:3333

From Profiles to Personalization

Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Current Session

Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 1: Find closest Profile discovered by HUNC Current Session

Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 2: Choose its specialized network NN #1 NN #2 Step 1: Find closest profile NN #3 ….etc Current Session ? NN # n

Two-Step Recommender Systems Based on a Committee of Profile-Specific URL-Predictor Neural Networks Step 2: Choose the specialized network Recommen- dations NN #1 NN #2 Step 1: Find closest profile NN #3 ….etc Current Session ? NN # n

? ? ? Train each Neural Net to Complete missing puzzles (sessions from that profile/cluster) complete session Striped = Input session (incomplete) to the neural network “?” = Output that is predicted to complete the puzzle

Precision: Comparison 2-step, 1-step, & K-NN (better for longer sessions)

Coverage: Comparison 2-step, 1-step, and & K-NN (better for longer sessions)

Semantic Personalization for Information Retrieval on an E-Learning Platform

E-Learning HyperManyMedia Resources http://www.collegedegree.com/library/college-life/the_ultimate_guide_to_using_open_courseware

Architecture Semantic representation ( knowledge representation), Algorithms (core software), and Personalization interface.

Semantic Search Engine

Experimental Evaluation A total of 1,406 lectures (documents) represented the profiles, with the size of each profile varying from one learner to another, as follows. Learner1 (English)= 86 lectures, Learner2 (Consumer and Family Sciences) = 74 lectures, Learner3 (Communication Disorders) = 160 lectures, Learner4 (Engineering) = 210 lectures, Learner5 (Architecture and Manufacturing Sciences) = 119 lectures, Learner6 (Math) = 374 lectures, Learner7 (Social Work) = 86 lectures, Learner8 (Chemistry) = 58 lectures, Learner9 (Accounting) = 107 lectures, and Learner10 (History) = 132 lectures.

Handling Concept Drift / Evolving Data Streams

Concept drift / Data Streams

Recommender Systems in Dynamic Usage Environments • For massive Data streams, must use a stream mining framework • Furthermore must be able to continuously mine evolving data streams • TECNO-Streams: Tracking Evolving Clusters in Noisy Streams • Inspired by the immune system • Immune system: interaction between external agents (antigens) and immune memory (B-cells) • Artificial immune system: • Antigens = data stream • B-cells = cluster/profile stream synopsis = evolving memory • B-cells have an age (since their creation) • Gradual forgetting of older B-cells • B-cells compete to survive by cloning multiple copies of themselves • Cloning is proportional to the B-cell stimulation • B-cell stimulation: defined as density criterion of data around a profile (this is what is being optimized!) Nasraoui: Web Usage Mining & Personalization in Noisy, Dynamic, and Ambiguous Environments O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez. Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in Proc. of WebKDD 2003, Washington DC, Aug. 2003, 71-81.

The Immune Network Memory External antigen (RED) stimulates binding B-cell  B-cell (GREEN) clones copies of itself (PINK) Stimulation breeds Survival Even after external antigen disappears: B-cells co-stimulate each other thus sustaining each other  Memory!

General Architecture of Proposed Approach Evolving data stream  1-Pass Adaptive Immune Learning Immune network information system Stimulation (competition & memory) Age (old vs. new) Outliers (based on activation) ? Evolving Immune Network (compressed into subnetworks)

S Trap Initial Data Initialize ImmuNet and MaxLimit Compress ImmuNet into KsubNet’s Identify nearest subNet* Present NEW antigen data Compute soft activations in subNet* Update subNet* ‘s ARB Influence range /scale Update subNet* ‘s ARBs’ stimulations Kill extra ARBs (based on age/stimulation strategy) OR increase acuteness of competition OR Move oldest patterns to aux. storage Clone antigen Clone and Mutate ARBs Kill lethal ARBs Compress ImmuNet Memory Constraints Start/ Reset Yes Activates ImmuNet? No Outlier? Domain Knowledge Constraints Yes #ARBs > MaxLimit? Secondary storage No ImmuNet Stat’s & Visualization

Summarizing noisy data in 1 pass • Different Levels of Noise

Validation for clustering streams in 1 pass:- Detect all clusters ( miss no cluster) - do not discover spurious clusters • Validation measures (averaged over 10 runs): • Hits, • Spurious Clusters

Experimental Results • Arbitrary Shaped Clusters

Concept drift / Data Streams

Olfa Nasraoui Dept . of Computer Engineering & Computer Science