220 likes | 405 Views
Web Usage Mining Classification. Fang Yao MEMS 2002 185029. Humboldt Uni zu Berlin. Contents:. Defination and the Usages Outputs of Classification Methods of Classification Application to EDOC Discussion on Imcomplete Data Discussion questions & Outlook. Humboldt Uni zu Berlin.
E N D
Web Usage Mining Classification • Fang Yao • MEMS 2002 • 185029 Humboldt Uni zu Berlin
Contents: • Defination and the Usages • Outputs of Classification • Methods of Classification • Application to EDOC • Discussion on Imcomplete Data • Discussion questions & Outlook Humboldt Uni zu Berlin
Usages: • Behavior predictions • improve Web design • personal marketing • …… Classification “People with age less than 40 and salary > 40k trade on-line” • •A Major Data Mining Operation • Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes. Humboldt Uni zu Berlin
Decision Tree outlook temperature humidity windy play outlook sunny hot high false no overcast sunny sunny hot high ture no rainy overcast hot high false yes …. rainy mild high false yes windy yes rainy cool normal false yes no rainy cool normal true true false …. …. …. …. …. …. humidity high …. no A Small Example WeatherData Source: Witten & Frank, table1.2 Humboldt Uni zu Berlin
outlook overcast sunny rainy …. windy yes true false …. humidity high …. no Outputs of Classification Decision Tree Classification Rules If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes ....... Humboldt Uni zu Berlin
Methods _ divide-and-conquer constructing decision trees Step 1: select a splitting attribute windy outlook humidity temp. true overcast high normal cool sunny false hot rainy mild Yes Yes Yes No No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes yes No No Yes Yes Yes No Gain(outlook): 0,247 bits > > Gain(humidity) 0,152 bits Gain(windy) 0,048 bits > Gain(temperature): 0,029 bits Humboldt Uni zu Berlin
outlook overcast sunny rainy Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Where: info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3]) Informational value of creating a branch on the „outlook“ Methods _ divide-and-conquer constructing decision trees Calculation information gain: Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits Humboldt Uni zu Berlin
Formula for information value: • Logarithms are expressed in base 2. • unit is ‚bits‘ • argument p is expressed as fraction that add up to 1. = - - - entropy ( p , p ,..., p ) p log p p log p ... p log p 1 2 n 1 1 2 2 n n Example: Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97 Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin
Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin
outlook sunny rainy overcast ? ? yes Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin
temp. cool hot mild No No Yes No Yes Gain(temperature): 0,571 bits Methods _ divide-and-conquer Step 2: select a daughter attribute___outlook = sunny windy humidity true false high normal Yes No Yes Yes No No No No No Yes Yes > Gain(humidity) 0,971 bits Gain(windy) 0,020 bits > Do this recursively !!! Humboldt Uni zu Berlin
outlook sunny rainy overcast humidity windy yes normal high false true no yes yes no Methods _ divide-and-conquer constructing decision trees Stop rules: • stop when all leaf nodes are pure • stop when no more attribute can be splited Humboldt Uni zu Berlin
The real -world data is more Complicated • Numeric attributes • Missing values • Final solution need more Operations • Pruning • From trees to rules Methods _ C 4.5 WHY C4.5? Humboldt Uni zu Berlin
A A A B C B C Methods _ C 4.5 • Numeric attributes: binary split with numeric thresholds halfway between the values • Missing values: -- Ignoring leads to losing information -- Partial instances • Pruning decision tree: -- subtree replacement -- subtree raising Humboldt Uni zu Berlin
Application in WEKA Humboldt Uni zu Berlin
Objective: Prediction of dissertation reading Attributes: HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0} HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0} Application in WEKA Data: Clickstream from log of EDOC on 30th March Method: J4.8 Algorithm Humboldt Uni zu Berlin
Application in WEKA DSS-ABSTR Result: Humboldt Uni zu Berlin
Application in WEKA DSS-Lookup Humboldt Uni zu Berlin
Discussion on Incomplete data Idea: Site-centric data v.s. User-centric data Incomplete data are inferior to the one from Complete data. Example: User-centric data: User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Site-centric data: User1: Expedia1, Expedia2, Expedia3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) Humboldt Uni zu Berlin
Discussion on Incomplete data Results: Lift curve source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9 Humboldt Uni zu Berlin
Discussion Questions & Outlook • What is the proper target attribute for an analysis of non-profit site? • What data do we prefer to have? • Which improvement could be made to the data? Humboldt Uni zu Berlin
References: • Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations.San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1 • Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“ • http://www.cs.cmu.edu/~awm/tutorials Humboldt Uni zu Berlin