Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Learning & Data Mining PowerPoint Presentation
Download Presentation
Learning & Data Mining

Learning & Data Mining

386 Views Download Presentation
Download Presentation

Learning & Data Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Learning & Data Mining

  2. Learning • Change of contents and organization of system’s knowledge enabling to improve its performance on task - Simon • Acquire new knowledge from environment • Organize its current knowledge • Inductive Inference • General conclusion from examples • Infer association between input and output • with some confidence • Incremental vs Batch

  3. General Model of Learning Agent Performance standard Environment Sensors Critics Feedback Change Learning Module Performance Module Knowledge Learning Goals Problem Generator Effectors From Artificial Intelligence : a modern Approach by Russel and Norvig

  4. Classification of Inductive Learning • Supervised Learning • given training examples • correct input-output pairs • recover unknown function from data generated from the function • generalization ability for unseen data • classification : function is discrete • concept learning : output is binary • Unsupervised Learning

  5. Classification of Inductive Learning • Supervised Learning • Unsupervised Learning • No correct input-output pairs • needs other source for determining correctness • reinforcement learning : yes/no answer only • example : chess playing • Clustering : group into clusters of common characteristics • Map Learning : explore unknown territory • Discovery Learning : uncover new relationships

  6. 데이터 마이닝 • 데이터 마이닝(data mining)의 정의 • 대량의 실제 데이터로부터 • 이전에 잘 알려지지는 않았지만 • 묵시적이고 • 잠재적으로 유용한 정보를 추출하는 작업 Cf) KDD(Knowledge Discovery in Database) 데이터로부터 지식을 추출하는 전 과정 데이터 마이닝 ⊂ KDD

  7. 데이터 마이닝 기술( I ) 전문가 시스템 기계 학습 KDD Data Mining 데이터 베이스 통계학 가시화

  8. 데이터 마이닝 기술 ( II ) • 데이터 마이닝 주요 작업(primary task) • 분류화(Classification) • 군집화(Clustering) • 특성화(Characterization, Summerization) • 경향 분석(Trend analysis) • 연관규칙 탐사(Association, Market basket analysis) • 패턴 분석(Pattern analysis) • Estimation • Prediction

  9. 데이터 마이닝 기술( III ) • 응용 분야 • Marketing & Retail • Banking • Finance • Insurance • Medicine & Health(Genetics) • Quality control • Transportation • Geo • Spatial Application

  10. Data Mining Tasks(1) • classification • Examples News ⇒ [international] [domestic] [sports] [culture]… large medium small predefinedclasses objects

  11. Data Mining Tasks(2) • Classification - continued Credit application ⇒ [high] [medium] [low] Water sample => [일급수] [이급수] … [구정물] • Algorithm • Decision trees, Memory based reasoning

  12. Data Mining Tasks(3) • Estimation cf. classification maps to discrete categories • Examples • 나이, 성별, 혈압… ⇒ 잔여수명 • 나이, 성별, 직업… ⇒ 연수입 • 지역, 수량(水), 인구 -> 오염농도 • Algorithm : Neural network • Estimating future value is called Prediction attr1 attr2 attr3 … (continuous) value data

  13. Data Mining Tasks(4) • Association (Market basket analysis) - determine which things go together • Example • shopping list ⇒ Cross-Selling(supermarket (shelf, catalog, CF…) home-shopping, E-shopping…) • Association rules

  14. Data Mining Tasks(5) • Clustering cf. classification - predefined category clustering - find new category & explain the category G1 G2 G3 G4 heterogeneous population homogeneous subgroups(clusters)

  15. Data Mining Tasks(6) • Clustering -continued • Examples • Symptom ⇒ Disease • Customer information ⇒ Selective sales • 토양(수질) data Note: clustering is dependent to the features used card 예: number, color, suite …

  16. Data Mining Tasks(7) • Clustering - continued • Clustering is useful for Exception finding • Algorithm K-means -> K clusters Note:Directed vs. Non-directed KDD exception • calling card fraud detection • credit card fraud. etc.

  17. 데이터 마이닝 기술(IV) • 데이터 마이닝 기법 • 연관규칙(association rule) • K-최단인접(k-nearest neighbor) • 의사결정 트리(decision tree) • 신경망(neural network) • 유전자 알고리즘(genetic algorithm) • 통계적 기법(statistical technique)

  18. Market Basket Analysis (Associations) (1/10) O: Orange Juice M: Milk S: Soda W: Window Cleaner D: Detergent

  19. Market Basket Analysis (Associations) (2/10) • Co.Occurrence Table

  20. Market Basket Analysis (Associations) (3/10) { S , O} : Co-Occurrence of 2 R1 - if S Then O R2 - if O Then S • Support - 전체 data중 몇 percent가 이를 포함? Confidence - 전체 LHS 중 몇 percent 가 규칙만족? eg. Support of R1  2 / 5 40% Confidence of R1  2 / 3 confidence of R2  2 / 4 determine “How Good” is the Rule

  21. Market Basket Analysis (Associations) (4/10) • Probability Table {A, B, C}

  22. Market Basket Analysis (Associations) (5/10) R1: If A ∧ B then C R2: If A ∧ C then B R3: If B ∧ C then A • Confidence Support =5

  23. Market Basket Analysis (Associations) (6/10) • R3 has the best confidence (0.33) but is it GOOD? Note: R3 : If B ∧C then A (0.33) A (0.45) 예: 머리 긴 사람 여자 • Improvement -> How good is the rule compared to random guessing? ?

  24. Market Basket Analysis (Associations) (7/10) improvement= improvement > 1: criteria P(condition and result) P(condition) P(result)

  25. Market Basket Analysis (Associations) (8/10) • Some Issues • overall algorithm build co-occurrence matrix for 1 item, 2 items, 3 items, etc. -> complex!! • Pruning eg. minimum support pruning • Virtual Item season, store, geographic information combined with real : items eg. If OJ ∧ Milk ∧Friday then Beer

  26. Market Basket Analysis (Associations) (9/10) • Level of Description How specific ! Drink Soda Coke • 장점 - explainability - undirected Data Mining - variable length data - simple computation

  27. Market Basket Analysis (Associations) (10/10) • 단점 - Complex as data grows - Limited Data Type (attributes) - Difficult to determine right number of items - Rare Items --> pruned

  28. Clustering Algorithm (1/2) • k-means method ( Mc Queen ‘61) - lot of variations • Alg. Step 1. Choose initial k-points (seeds) 2. Find closest neighbors for k points ( initial cluster) 3. Find centroid for the cluster 4. goto step 2 stop when no more change

  29. x1+ … + xn y1 + … + yn (x2, y2) (x3, y3) n n , (xn, yn) (x1, y1) Clustering Algorithm (2/2) Note: Finding neighbors • Finding Centroid

  30. Variation of k-means 1. Use probability density rather than simple distance eg. Gaussian mixture Models 2. Weighted Distance 3. Agglomeration Method - hierarchical cluster

  31. Agglomerative Algorithm 1. Start with every single record as a cluster(N) 2. Select closest cluster and combine them (N-1 clusters) 3. go to step 2 4. Stop at the right level (number) what is closest?

  32. Distance between clusters • 3 measures 1. Single linkage closest members 2. Complete linkage most distant members 3. centeroids

  33. Clustering • Strength 1. Undirected Knowledge Discovery 2. Categorical, Numeric, Textual data 에 적합 3. Easy to Apply • Weakness 1. Can be difficult to choose right (distance) measure & weight 2. Initial parameter에 sensitive 3. Can be hard to interpret

  34. Tear production normal reduced none astigmatism yes no spectacle press soft hypermetrope myope none hard Decision Tree(contact lens)

  35. Class 1 Class 2 Learning function input … … Class n classification yes concept input Concept learning no decision tree Concept Learning eg. red good customer

  36. Weather data attribute instance s: sunny h: hot o: overcast m: mild h: high n: normal r: rainy c: cool

  37. Decision Tree for weather (1/4) outlook sunny r o humidity windy no high n f t no yes yes no If Outlook = sunny then play = no and humidity = high

  38. Decision Tree for weather (2/4) note: temp, humid can be numeric data temp>30 (hot) 10<= temp <= 30 (normal) temp<10 (cool)

  39. Decision Tree for weather (3/4) • attribute • Attribute types • nominal ( categorical discreet ) • ordinal ( numeric continuous) • interval [10,20] • ratio – real numbers

  40. Decision Tree for weather (4/4) note: Leaf node doesn’t have to be yes/no --> classification tear normal reduced astigmatism none no hard soft Contact lens

  41. Decision Tree 를 이용한 Prediction A Build trees C B Training (set) ... Test (set) Evaluation set B Choose best A data real data Predict expected performance

  42. Box Diagram of Decision Tree rain sunny overcast Windy humidity yes high n n y y n n n y no n y y y y y y

  43. Prune here! Unseen data Error rate Training data Depth of Tree The effect of pruning • Some issues • where to prune? Too high -> unnecessarily complex too low -> lose information • what to split? (first)

  44. Error Rate y y y n y y n y er=2/7 • Adjusted error rate of a tree AE(T)= E(T) + α leaf-count(T) • Find sub tree α1 of T s.t. AE(α1) <= AE(T) then prune all the branches that are not part of α1

  45. Possible sub trees for weather data (1/2) first split? (a) (b) temp outlook sunny not cool rainy o mild y y y n n y y y y y y n n n y y y n y y y y n n y y n n

  46. Possible sub trees for weather data (2/2) (c ) (d) windy humidity high true normal false y y y y y y n y y y n n n n y y y n n n y y y y y y n n

  47. Information Theory & Entropy info([2,3]) = 0.971 bit info([4,0]) = 0.0 bit info([3,2]) = 0.971 bit -> info ([2,3], [4,0], [3,2]) = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 = 0.693 bit gain(outlook) = info([9,5]) - info([2,3], [4,0],[3,2]) = 0.247 bits gain(temp) = 0.029 bit gain(humid) = 0.152 bit gain(windy) = 0.048 bit

  48. Calculating info(x) - entropy • if either #yes or #no is 0 then info(x) = 0 • if #yes = #no then info(x) is max.value • can cover multi class situation eg. Info[2,3,4] = info( [2,7] + 7/9 * info[3,4] ) => entropy(p1, p2, … , pn) = - p1log p1 - p2 logp2 - … - pn log pn info([2,3,4]) = entropy ( 2/9, 3/9, 4/9 ) -> -2/9 * log 2/9 - 3/9 * log 3/9 - 4/9 log 4/9 = [-2log 2 - 3 log 3 - 4 log 4 + 9 log 9] /9

  49. Algorithms: CART, C4.5 • CART - binary tree only Briemen ‘84 • C4.5 Quinlan ‘86 => ID3 • Clementine • NCR • CHAID Hartigan ‘75