- 96 Views
- Uploaded on
- Presentation posted in: General

Data Classification by Statistical Methods

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data Classification by Statistical Methods

Wei-Min Shen

Information Sciences Institute

University of Southern California

UCLA Data Mining Short Course (2)

- Basic of Probability theory
- Naive Bayesian Classifier
- Boost Algorithms
- Bayesian Network

UCLA Data Mining Short Course (2)

- What is probability?
- Subjective vs objective

- Simple and principled
- P(A|C) + P(~A|C) = 1
- P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC)

- Are there anything else?
- P(A B|C) = 1 - P(~A~B|C) = …

UCLA Data Mining Short Course (2)

- If A is certain when given C, then P(A|C)=1
- Assume B doesn’t contradict C, then P(AB|C)=P(B|C), P(A|BC)=P(A|C), and P(B|C)=P(A|C)P(B|C) (by law2), and P(A|C)=1

- If A is impossible when C, then P(A|C)=0
- By law 1

UCLA Data Mining Short Course (2)

- P(AB|C) = P(A|C) P(B|AC) = P(B|C)P(A|BC)
- If A,B independent, P(AB|C) = P(A|C)P(B|C)
- If AB, then P(B|AC)=1, P(A|C) P(B|C)
- P(AB|C) = P(A|C)P(B|AC) = P(A|C)*1, or
- P(AB|C) = min(P(A|C),P(B|C)), the “fuzzy” rule

- P(AB|C) = P(A|C) + P(B|C) - P(AB|C)
- If AB, then P(AB|C)=P(A|C), P(A|C) P(B|C)
- P(A B|C) = P(A|C)+P(B|C)-P(A|C) = P(B|C)
- P(A B|C) = max(P(A|C),P(B|C)), the other “fuzzy” rule

- If AB, then P(AB|C)=P(A|C), P(A|C) P(B|C)

UCLA Data Mining Short Course (2)

- P(A|C)P(B|AC) = P(B|C)P(A|BC)
- Quantitative deductive inference
- If AB and A, then B
- If AB and ~B, then ~A (abduction)
- If AB and B, then “A becomes more plausible”

- Quantitative inductive inference
- If AB and ~A, then “B becomes less plausible”
- If A”B becomes more plausible” and B, then “A becomes more plausible”

UCLA Data Mining Short Course (2)

- P(A|BC) = P(A|C)P(B|AC) / P(B|C)
- P(A|BC): your new knowledge of the theory A after experiment B, given the background knowledge C
- P(A|C): what you know about A without B,
- P(B|AC): the possibility of B if your current understanding of A were correct,
- P(B|C): the possibility of knowing B anyway.
- This is recursive: your new understanding of A becomes a part of C, which is used to make a newer understanding of A.

- Caution: the recipe of “Rabbit Stew”

UCLA Data Mining Short Course (2)

- Given
- X the current information about the learning task
- Ai: “Ci is the target concept”
- Dj: “the instance j is in the target concept”

- Here is how
- P(Ai|DjX) = P(Ai|X)P(Dj|AiX) / P(Dj|X)

UCLA Data Mining Short Course (2)

- Let A1,…,Ak be attributes, [a1, …, ak] an example, C a class to be predicted, then the optimal prediction is class value c such that P(C=c | A1=a1 … Ak=ak) is maximal.
- By Bayesian rule, this equalsP(A1=a1 … Ak=ak |C=c) * P(C=c) P(A1=a1 … Ak=ak )

UCLA Data Mining Short Course (2)

- P(C=c) is easy to estimate from training data
- P(A1=a1 … Ak=ak ) is irrelevant (same for all c)
- Compute P(A1=a1 … Ak=ak |C=c)
- assume attributes are independent (naïve)
- P(A1=a1|C=c) P(A1=a1|C=c) … P(Ak=ak|C=c)
- where each term can be estimated as
- P(Aj=aj|C=c) = count(Aj=aj|C=c) / count(C=c)

UCLA Data Mining Short Course (2)

- [Day, Outlook, Temp, Humidity, Wind, PlayTennis]
- (D1SunnyHotHighWeakNo)(D2 Sunny Hot High Strong No)(D3 Overcast Hot High Weak Yes)(D4 Rain Mild High Weak Yes)(D5 Rain Cool Normal Weak Yes)(D6 Rain Cool Normal Strong No)(D7 OvercastCool Normal Strong Yes)(D8 Sunny Mild High Weak No)(D9 Sunny Cool Normal Weak Yes)(D10 Rain Mild Normal Weak Yes)(D11 Sunny Mild Normal Strong Yes)(D12OvercastMild High strongYes)(D13 OvercastHot Normal WeakYes)(D14 Rain Mild High Strong No)

UCLA Data Mining Short Course (2)

- P(Play=Yes | [Sunny, Hot, High, Weak]) = ?
- P(Play=No | [Sunny, Hot, High, Weak]) = ? (winner)
- P(Play=Yes) = 9/14
- P(Play=No) = 5/14
- P(Outlook=Sunny | Play=Yes) = 2/9
- P(Outlook=Sunny | Play=No) = 3/5
- P(Temp=Hot | Play=Yes) = 2/9
- P(Temp=Hot | Play=No) = 2/5
- P(Humidity=High | Play=Yes) = 3/9
- P(Humidity=High | Play=No) = 4/5
- P(Wind=Weak | Play=Yes) = 6/9
- P(Wind=Weak | Play=No) = 2/5

UCLA Data Mining Short Course (2)

- To learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor
- For t =1 to T times:
- Learn the classifier Ht on the weighted training examples
- Increases the weights of training examples misclassified by Ht

UCLA Data Mining Short Course (2)

- Let xiX be an example, yi its class
- Assign equal weight to all N examples, w(1)i=1/N
- For t =1 through T do:
- Given w(1)i obtain a hypothesis H(t): X [0,1].
- Let the error of H(t): be (t) = w(t)i |yi - hi(xi)|.
- Let (t) = (t) / (1 - (t) ) and let w(t+1)i = w(t)i ((t))1- |yi - hi(xi)|
- Normalize w(t+1)i to sum to 1.0.

- The final combination hypothesis H is:
- H(x) = 1 / [1+((t))2r(x)-1], where
- r(x) = [(log1/(t))H(t)(x)] / (log1/(t)).

UCLA Data Mining Short Course (2)

- Let w(t)i be the weights of examples at time t, H(t)(x) = H(t)([a1,…,ak]) = maxc{P(t)(A1=a1|C=c)…P(t)(Ak=ak|C=c)P(t)(C=c)}where
- P(t)(Aj=aj|C=c) = gw(t)g(Ai=ai|C=c) / hw(t)h(C=c)
- P(t)(C=c) = lw(t)l(C=c)

UCLA Data Mining Short Course (2)

- [Day, Outlook, Temp, Humidity, Wind, PlayTennis]
- (D1SunnyHotHighWeakNo) 1/28
- (D2 Sunny Hot High Strong No) 3/28
- (D3 Overcast Hot High Weak Yes) 1/28
- (D4 Rain Mild High Weak Yes) 3/28
- (D5 Rain Cool Normal Weak Yes) 1/28
- (D6 Rain Cool Normal Strong No) 3/28
- (D7 OvercastCool Normal Strong Yes) 1/28
- (D8 Sunny Mild High Weak No) 3/28
- (D9 Sunny Cool Normal Weak Yes) 1/28
- (D10 Rain Mild Normal Weak Yes) 3/28
- (D11 Sunny Mild Normal Strong Yes) 1/28
- (D12OvercastMild High strongYes) 3/28
- (D13 OvercastHot Normal WeakYes) 1/28
- (D14 Rain Mild High Strong No) 3/28

UCLA Data Mining Short Course (2)

- P(Play=Yes | [Sunny, Hot, High, Weak]) = ?
- P(Play=No | [Sunny, Hot, High, Weak]) = ?
- P(Play=Yes) = 15/28
- P(Play=No) = 13/28
- P(Outlook=Sunny | Play=Yes) = 2/15
- P(Outlook=Sunny | Play=No) = 7/13

UCLA Data Mining Short Course (2)

- Very Efficient:
- Suppose there are e examples that are over f attributes, each with v values, then the algorithm’s complexity is O(ef) independent of v.
- Learning a decision tree without pruning requires O(ef2) time.

- Accuracy: Won the first place in 1997 KDDCUP
- “Remarkably successful in practice, and no uniformly better learning method is known.” (Elkan97)

UCLA Data Mining Short Course (2)

- Concise representation of conditional probabilities used in Bayesian inference
- Plays the role of the model in data mining
- Can encode effects of actions
- Patterns may be probabilistic

- Can be learned
- Powerful representation language
- probabilistic propositional logic
- no quantifiers
- relations encoded by probabilistic correlation

UCLA Data Mining Short Course (2)

- Model supports several forms of inference
- Diagnostic: from effects to causes
- If meningitis and whiplash both cause stiff neck and observe stiff neck, which cause is most likely
- P(Men|Stiff) vs. P(Whip|Stiff)
- Useful in learning action models
- Useful in identifying hidden state

- Causal: from causes to effects
- What are the likely effects of contracting meningitis?
- P(Stiff|Men)
- Useful in predicting effects of actions

- Mixed: combination of the above.

- Diagnostic: from effects to causes

UCLA Data Mining Short Course (2)

- Pr(Raining|C)
- In logic “Raining” is Boolean variable
- In probability “Raining” is special case of: random variable. (Binomial variable with probability p)
- General case: random variables can take on a set of values. Pr(Precipitation=0.5in)
- We associate a probability density function (pdf) with the r.v. This defines all probabilities for this set.
- f(Toothache) = { (True)-->0.05, (False)-->0.95 }
- f(Age) = gaussian(45,20)

UCLA Data Mining Short Course (2)

Toothache

Not Toothache

.06

.04

P(Cavity) = .1

Cavity

P(-Cavity) = .9

.01

.89

Not Cavity

- probability density function of a set of variables
- f(Toothache, Cavity) :

- joint distribution expresses relationship between variables
- eg. Shoe size vs. reading-level

- From joint, can derive all other probabilities
- marginal probabilities : P(A), P(B)
- conditional (from product rule): P(B|A) = P(AB)/P(A)

UCLA Data Mining Short Course (2)

Cavity Ache ProbeCatch Probability

TTT 0.03

TTF 0.01

TFT 0.045

TFF 0.015

FTT 0.0025

FTF 0.0075

FFT 0.2225

FFF 0.6675

P(Probe) = 0.25, P(Cavity) = 0.01, P(Ache) = 0.05

P(Cavity| Probe and Ache) ?

UCLA Data Mining Short Course (2)

- Joint distribution can be big and cumbersome
- Boolean variables --> 2n entries

- Often don’t need the whole table!
- Definitions: A and B are independent if:
- Pr(AB) = P(A)P(B)
- Pr(A|B) = P(A) --> follows from product rule

- A and B are conditionally independent given C if:
- Pr(AB|C) = P(A|C)P(B|C), or Pr(A|BC) = Pr(A|C)

- Can use this to simplify joint distribution function

UCLA Data Mining Short Course (2)

- Toothache example
- To do inference might need all 8 conditional probabilities
- Probe and Ache are conditionally independent given Cavity
P(Probe | Cavity,Ache) = P(Probe | Cavity)

P(Ache | Cavity,Probe) = P(Ache | Cavity)

P(Ache,Probe | Cavity) = P(Ache | Cavity)P(Probe | Cavity)

- Can get away with fewer conditional probabilities

UCLA Data Mining Short Course (2)

- There is a relationship between conditional independence and causality
- If A causes B and C
- There is a correlation between B and C
- This correlation is broken given A
- B and C are conditionally independent given A

- Shoe size and reading level are independent given Age since age is the “direct cause” of both

- If A causes B and C
- Important because
- people find it hard to recognize conditional independence
- but find it easy to state direct causes

UCLA Data Mining Short Course (2)

- Simple representation that “encodes” many conditional independencies
- Each node is a random variable
- A has directed link to B if A is “direct cause” of B
- In statistical terms: Knowing all parents of B makes B conditionally independent of everything else

- Each node has a conditional probability table that quantifies the effects parents have on the node
- The graph has no directed cycles.

- The joint distribution is the following product:
- f(X1,...Xn) = P(Xi | Parents(Xi) )

UCLA Data Mining Short Course (2)

Fraud

Age

Sex

Buys Gas

Buys Jewelry

Conditional Independencies

P(A|F)=P(A), P(S|FA)=P(S), P(G|FAS)=P(G|F), P(J|FASG)=P(J|FAS)

P(fasgj) = P(f)P(asgj|f)

= P(f)P(a|f)P(sgj|af)

= P(f)P(a|f)P(s|af)P(gj|saf)

= P(f)P(a|f)P(s|af)P(g|saf)P(j|gsaf) ;; use conditional independencies

= P(f) P(a) P(s) P(g|f) P(j|saf)

UCLA Data Mining Short Course (2)

- Inference in Bayes network:
- fix the value of a subset of variables (evidence variables)
- compute the posterior distribution of remain variables (query variables)
- can do this for any arbitrary subset of variables

- Several algorithms have been proposed
- Easy if variables discrete and no undirected cycles
- NP-hard if there are undirected cycles
- in this case heuristic techniques are often effective

UCLA Data Mining Short Course (2)