Data classification by statistical methods
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Data Classification by Statistical Methods PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Data Classification by Statistical Methods. Wei-Min Shen Information Sciences Institute University of Southern California. Outline. Basic of Probability theory Naive Bayesian Classifier Boost Algorithms Bayesian Network. Bayesian Probability Theory. What is probability?

Download Presentation

Data Classification by Statistical Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data classification by statistical methods

Data Classification by Statistical Methods

Wei-Min Shen

Information Sciences Institute

University of Southern California

UCLA Data Mining Short Course (2)


Outline

Outline

  • Basic of Probability theory

  • Naive Bayesian Classifier

  • Boost Algorithms

  • Bayesian Network

UCLA Data Mining Short Course (2)


Bayesian probability theory

Bayesian Probability Theory

  • What is probability?

    • Subjective vs objective

  • Simple and principled

    • P(A|C) + P(~A|C) = 1

    • P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC)

  • Are there anything else?

    • P(A B|C) = 1 - P(~A~B|C) = …

UCLA Data Mining Short Course (2)


Certainty and impossible

Certainty and Impossible

  • If A is certain when given C, then P(A|C)=1

    • Assume B doesn’t contradict C, then P(AB|C)=P(B|C), P(A|BC)=P(A|C), and P(B|C)=P(A|C)P(B|C) (by law2), and P(A|C)=1

  • If A is impossible when C, then P(A|C)=0

    • By law 1

UCLA Data Mining Short Course (2)


The product and sum laws

The Product and Sum Laws

  • P(AB|C) = P(A|C) P(B|AC) = P(B|C)P(A|BC)

    • If A,B independent, P(AB|C) = P(A|C)P(B|C)

    • If AB, then P(B|AC)=1, P(A|C)  P(B|C)

      • P(AB|C) = P(A|C)P(B|AC) = P(A|C)*1, or

      • P(AB|C) = min(P(A|C),P(B|C)), the “fuzzy” rule

  • P(AB|C) = P(A|C) + P(B|C) - P(AB|C)

    • If AB, then P(AB|C)=P(A|C), P(A|C)  P(B|C)

      • P(A B|C) = P(A|C)+P(B|C)-P(A|C) = P(B|C)

      • P(A B|C) = max(P(A|C),P(B|C)), the other “fuzzy” rule

UCLA Data Mining Short Course (2)


Generalization of classic logic

Generalization of Classic Logic

  • P(A|C)P(B|AC) = P(B|C)P(A|BC)

  • Quantitative deductive inference

    • If AB and A, then B

    • If AB and ~B, then ~A (abduction)

    • If AB and B, then “A becomes more plausible”

  • Quantitative inductive inference

    • If AB and ~A, then “B becomes less plausible”

    • If A”B becomes more plausible” and B, then “A becomes more plausible”

UCLA Data Mining Short Course (2)


Suitable for mining

Suitable for Mining

  • P(A|BC) = P(A|C)P(B|AC) / P(B|C)

    • P(A|BC): your new knowledge of the theory A after experiment B, given the background knowledge C

    • P(A|C): what you know about A without B,

    • P(B|AC): the possibility of B if your current understanding of A were correct,

    • P(B|C): the possibility of knowing B anyway.

    • This is recursive: your new understanding of A becomes a part of C, which is used to make a newer understanding of A.

  • Caution: the recipe of “Rabbit Stew”

UCLA Data Mining Short Course (2)


Concept learning with probability theory

Concept Learning with Probability Theory

  • Given

    • X the current information about the learning task

    • Ai: “Ci is the target concept”

    • Dj: “the instance j is in the target concept”

  • Here is how

    • P(Ai|DjX) = P(Ai|X)P(Dj|AiX) / P(Dj|X)

UCLA Data Mining Short Course (2)


Naive bayesian classifier

Naive Bayesian Classifier

  • Let A1,…,Ak be attributes, [a1, …, ak] an example, C a class to be predicted, then the optimal prediction is class value c such that P(C=c | A1=a1 … Ak=ak) is maximal.

  • By Bayesian rule, this equalsP(A1=a1 … Ak=ak |C=c) * P(C=c) P(A1=a1 … Ak=ak )

UCLA Data Mining Short Course (2)


Na ve bayesian classifier

Naïve Bayesian Classifier

  • P(C=c) is easy to estimate from training data

  • P(A1=a1 … Ak=ak ) is irrelevant (same for all c)

  • Compute P(A1=a1 … Ak=ak |C=c)

    • assume attributes are independent (naïve)

    • P(A1=a1|C=c) P(A1=a1|C=c) … P(Ak=ak|C=c)

    • where each term can be estimated as

    • P(Aj=aj|C=c) = count(Aj=aj|C=c) / count(C=c)

UCLA Data Mining Short Course (2)


Tennis example revisited

Tennis Example Revisited

  • [Day, Outlook, Temp, Humidity, Wind, PlayTennis]

    • (D1SunnyHotHighWeakNo)(D2 Sunny Hot High Strong No)(D3 Overcast Hot High Weak Yes)(D4 Rain Mild High Weak Yes)(D5 Rain Cool Normal Weak Yes)(D6 Rain Cool Normal Strong No)(D7 OvercastCool Normal Strong Yes)(D8 Sunny Mild High Weak No)(D9 Sunny Cool Normal Weak Yes)(D10 Rain Mild Normal Weak Yes)(D11 Sunny Mild Normal Strong Yes)(D12OvercastMild High strongYes)(D13 OvercastHot Normal WeakYes)(D14 Rain Mild High Strong No)

UCLA Data Mining Short Course (2)


Estimated probabilities

Estimated Probabilities

  • P(Play=Yes | [Sunny, Hot, High, Weak]) = ?

  • P(Play=No | [Sunny, Hot, High, Weak]) = ? (winner)

    • P(Play=Yes) = 9/14

    • P(Play=No) = 5/14

    • P(Outlook=Sunny | Play=Yes) = 2/9

    • P(Outlook=Sunny | Play=No) = 3/5

    • P(Temp=Hot | Play=Yes) = 2/9

    • P(Temp=Hot | Play=No) = 2/5

    • P(Humidity=High | Play=Yes) = 3/9

    • P(Humidity=High | Play=No) = 4/5

    • P(Wind=Weak | Play=Yes) = 6/9

    • P(Wind=Weak | Play=No) = 2/5

UCLA Data Mining Short Course (2)


Boosting

Boosting

  • To learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor

  • For t =1 to T times:

    • Learn the classifier Ht on the weighted training examples

    • Increases the weights of training examples misclassified by Ht

UCLA Data Mining Short Course (2)


The adaboost algorithm freund and schapire 1995

The AdaBoost Algorithm(Freund and Schapire 1995)

  • Let xiX be an example, yi its class

  • Assign equal weight to all N examples, w(1)i=1/N

  • For t =1 through T do:

    • Given w(1)i obtain a hypothesis H(t): X  [0,1].

    • Let the error of H(t): be (t) = w(t)i |yi - hi(xi)|.

    • Let (t) = (t) / (1 - (t) ) and let w(t+1)i = w(t)i ((t))1- |yi - hi(xi)|

    • Normalize w(t+1)i to sum to 1.0.

  • The final combination hypothesis H is:

    • H(x) = 1 / [1+((t))2r(x)-1], where

    • r(x) = [(log1/(t))H(t)(x)] / (log1/(t)).

UCLA Data Mining Short Course (2)


Boosted na ve bayesian algorithm elkan 97

Boosted Naïve Bayesian Algorithm (Elkan 97)

  • Let w(t)i be the weights of examples at time t, H(t)(x) = H(t)([a1,…,ak]) = maxc{P(t)(A1=a1|C=c)…P(t)(Ak=ak|C=c)P(t)(C=c)}where

    • P(t)(Aj=aj|C=c) = gw(t)g(Ai=ai|C=c) / hw(t)h(C=c)

    • P(t)(C=c) = lw(t)l(C=c)

UCLA Data Mining Short Course (2)


Tennis example revisited1

Tennis Example Revisited

  • [Day, Outlook, Temp, Humidity, Wind, PlayTennis]

    • (D1SunnyHotHighWeakNo) 1/28

    • (D2 Sunny Hot High Strong No) 3/28

    • (D3 Overcast Hot High Weak Yes) 1/28

    • (D4 Rain Mild High Weak Yes) 3/28

    • (D5 Rain Cool Normal Weak Yes) 1/28

    • (D6 Rain Cool Normal Strong No) 3/28

    • (D7 OvercastCool Normal Strong Yes) 1/28

    • (D8 Sunny Mild High Weak No) 3/28

    • (D9 Sunny Cool Normal Weak Yes) 1/28

    • (D10 Rain Mild Normal Weak Yes) 3/28

    • (D11 Sunny Mild Normal Strong Yes) 1/28

    • (D12OvercastMild High strongYes) 3/28

    • (D13 OvercastHot Normal WeakYes) 1/28

    • (D14 Rain Mild High Strong No) 3/28

UCLA Data Mining Short Course (2)


Estimate probabilities based on weighted examples

Estimate Probabilities Based on Weighted Examples

  • P(Play=Yes | [Sunny, Hot, High, Weak]) = ?

  • P(Play=No | [Sunny, Hot, High, Weak]) = ?

    • P(Play=Yes) = 15/28

    • P(Play=No) = 13/28

    • P(Outlook=Sunny | Play=Yes) = 2/15

    • P(Outlook=Sunny | Play=No) = 7/13

UCLA Data Mining Short Course (2)


How good is boosted naive bayesian classifier

How Good Is Boosted Naive Bayesian Classifier

  • Very Efficient:

    • Suppose there are e examples that are over f attributes, each with v values, then the algorithm’s complexity is O(ef) independent of v.

    • Learning a decision tree without pruning requires O(ef2) time.

  • Accuracy: Won the first place in 1997 KDDCUP

  • “Remarkably successful in practice, and no uniformly better learning method is known.” (Elkan97)

UCLA Data Mining Short Course (2)


Bayesian belief networks

Bayesian Belief Networks

  • Concise representation of conditional probabilities used in Bayesian inference

  • Plays the role of the model in data mining

    • Can encode effects of actions

    • Patterns may be probabilistic

  • Can be learned

  • Powerful representation language

    • probabilistic propositional logic

    • no quantifiers

    • relations encoded by probabilistic correlation

UCLA Data Mining Short Course (2)


Bayes nets

Bayes Nets

  • Model supports several forms of inference

    • Diagnostic: from effects to causes

      • If meningitis and whiplash both cause stiff neck and observe stiff neck, which cause is most likely

      • P(Men|Stiff) vs. P(Whip|Stiff)

      • Useful in learning action models

      • Useful in identifying hidden state

    • Causal: from causes to effects

      • What are the likely effects of contracting meningitis?

      • P(Stiff|Men)

      • Useful in predicting effects of actions

    • Mixed: combination of the above.

UCLA Data Mining Short Course (2)


More probability theory

More Probability Theory

  • Pr(Raining|C)

    • In logic “Raining” is Boolean variable

    • In probability “Raining” is special case of: random variable. (Binomial variable with probability p)

    • General case: random variables can take on a set of values. Pr(Precipitation=0.5in)

    • We associate a probability density function (pdf) with the r.v. This defines all probabilities for this set.

    • f(Toothache) = { (True)-->0.05, (False)-->0.95 }

    • f(Age) = gaussian(45,20)

UCLA Data Mining Short Course (2)


Joint distribution

Toothache

Not Toothache

.06

.04

P(Cavity) = .1

Cavity

P(-Cavity) = .9

.01

.89

Not Cavity

Joint Distribution

  • probability density function of a set of variables

    • f(Toothache, Cavity) :

  • joint distribution expresses relationship between variables

    • eg. Shoe size vs. reading-level

  • From joint, can derive all other probabilities

    • marginal probabilities : P(A), P(B)

    • conditional (from product rule): P(B|A) = P(AB)/P(A)

UCLA Data Mining Short Course (2)


Example

Example

Cavity Ache ProbeCatch Probability

TTT 0.03

TTF 0.01

TFT 0.045

TFF 0.015

FTT 0.0025

FTF 0.0075

FFT 0.2225

FFF 0.6675

P(Probe) = 0.25, P(Cavity) = 0.01, P(Ache) = 0.05

P(Cavity| Probe and Ache) ?

UCLA Data Mining Short Course (2)


Conditional independence

Conditional Independence

  • Joint distribution can be big and cumbersome

    • Boolean variables --> 2n entries

  • Often don’t need the whole table!

  • Definitions: A and B are independent if:

    • Pr(AB) = P(A)P(B)

    • Pr(A|B) = P(A) --> follows from product rule

  • A and B are conditionally independent given C if:

    • Pr(AB|C) = P(A|C)P(B|C), or Pr(A|BC) = Pr(A|C)

  • Can use this to simplify joint distribution function

UCLA Data Mining Short Course (2)


Example1

Example

  • Toothache example

    • To do inference might need all 8 conditional probabilities

    • Probe and Ache are conditionally independent given Cavity

      P(Probe | Cavity,Ache) = P(Probe | Cavity)

      P(Ache | Cavity,Probe) = P(Ache | Cavity)

      P(Ache,Probe | Cavity) = P(Ache | Cavity)P(Probe | Cavity)

  • Can get away with fewer conditional probabilities

UCLA Data Mining Short Course (2)


Intuition

Intuition

  • There is a relationship between conditional independence and causality

    • If A causes B and C

      • There is a correlation between B and C

      • This correlation is broken given A

        • B and C are conditionally independent given A

    • Shoe size and reading level are independent given Age since age is the “direct cause” of both

  • Important because

    • people find it hard to recognize conditional independence

    • but find it easy to state direct causes

UCLA Data Mining Short Course (2)


Bayes belief networks

Bayes Belief Networks

  • Simple representation that “encodes” many conditional independencies

    • Each node is a random variable

    • A has directed link to B if A is “direct cause” of B

      • In statistical terms: Knowing all parents of B makes B conditionally independent of everything else

    • Each node has a conditional probability table that quantifies the effects parents have on the node

    • The graph has no directed cycles.

  • The joint distribution is the following product:

    • f(X1,...Xn) =  P(Xi | Parents(Xi) )

UCLA Data Mining Short Course (2)


Example2

Example

Fraud

Age

Sex

Buys Gas

Buys Jewelry

Conditional Independencies

P(A|F)=P(A), P(S|FA)=P(S), P(G|FAS)=P(G|F), P(J|FASG)=P(J|FAS)

P(fasgj) = P(f)P(asgj|f)

= P(f)P(a|f)P(sgj|af)

= P(f)P(a|f)P(s|af)P(gj|saf)

= P(f)P(a|f)P(s|af)P(g|saf)P(j|gsaf) ;; use conditional independencies

= P(f) P(a) P(s) P(g|f) P(j|saf)

UCLA Data Mining Short Course (2)


Inference

Inference

  • Inference in Bayes network:

    • fix the value of a subset of variables (evidence variables)

    • compute the posterior distribution of remain variables (query variables)

    • can do this for any arbitrary subset of variables

  • Several algorithms have been proposed

  • Easy if variables discrete and no undirected cycles

  • NP-hard if there are undirected cycles

    • in this case heuristic techniques are often effective

UCLA Data Mining Short Course (2)


  • Login