Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

1 / 59

# Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks - PowerPoint PPT Presentation

##### Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Exploiting Common SubRelations:Learning One Belief Net for Many Classification Tasks R Greiner, Wei Zhou University of Alberta

2. Situation • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from medical symptoms • Hepatitis, from medical symptoms • … • Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk} • Then use Si to deal with ith“query class” • but… • but… need to re-learn inter-relations among Factors, Symptoms, common to all k classifiers

3. Common Interrelationships Cancer Cancer Menin Menin

4. Use Common Structure! • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from symptoms • Hepatitis, from symptoms • … • Option 2: Learn 1 “structure” Sof relationships then use Sto address all k classification tasks • Actual Approach: Learn 1 Bayesian BeliefNet, inter-relating info for all k types of queries

5. Outline • Motivation • Handle multiple class variables • Framework • Formal model • Belief Nets, …-classifier • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structures; LL vs LCL • Contributions

6. MC-Learner MC Training Data

7. Multi-Classifier I/O • Given “query” “class variable” Q and “evidence” E=e Cancer=?, given Gender=F, Age=35, Smoke=t ? • Return value Q = q Cancer = Yes

8. MultiClassifier • Like standard Classifiers, can deal with • different evidence E • different evidence values e • Unlike standard Classifiers, can deal with • different class variables Q • Able to “answer queries” • classify new unlabeled tuples • Given “Q=?, given E=e”, return “q” MC(Cancer; Gender=M, Age=25, Height=6’) = No MC(Meningitis; Gender=F, BloodTest = t ) = Severe

9. MC-Learner’s I/O • Input: Set of “queries”(labeled partially-specified tuples) •  input to standard (partial-data) learners • Output: MultiClassifier

10. Error Measure • Query Distribution: • …can be uncorrelated with “tuple distribution” Prob([Q, E=e] asked) MultiClassifier MC returns MC(Q, E=e) = q’ • Classification Error of MC • [|a =? b|]  1 if a=b, 0 otherwise •  “0/1” error CE(MC) = [Q, E=e], qProb([Q, E=e] asked)* [|MC(Q, E=e) =?q|] • “Labeled query” [Q, E=e], q

11. Learner’s Task • Given • space of “MultiClassifiers” { MCi} • sample of labeled queries drawn from “query distribution” Find MC*= argmin{ MCi}{CE(MCi) } w/minimal error over query distribution.

12. Outline • Motivation • Handle multiple class variables • Framework • Formal model • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structure; LL vs LCL • Contributions Belief Nets, …-classifier

13. Simple Belief Net H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc! Skip Details

14. Example of a Belief Net P(H=1) P(H=0) H 0.05 0.95 h P(B=1 | H=h) P(B=0 | H=h) B 1 0.95 0.05 0 0.03 0.97 h b P(J=1|h,b) P(J=0|h,b) J 1 1 0.8 0.2 1 0 0.8 0.2 0 1 0.3 0.7 0 0 0.3 0.7 • Simple Belief Net: Node ~ Variable Link ~ “Causal dependency” “CPTable” ~ P(child | parents) Skip

15. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

16. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

17. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

18. Include Only Causal Links H B J P(B=1 | H=1) Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) • Sufficient Belief Net: • Requires: P(H=1) known P(J=1 | H=1) known P(B=1 | H=1) known (Only 5 parameters, not 7)

19. BeliefNet as (Multi)Classifier Prob q1 q2 q3 … qm • For query [Q, E=e], BN will return distribution • PBN(Q=q1| E=e ), PBN(Q=q2| E=e ), … PBN(Q=qm| E=e ) • (Multi)Classifier MCBN(Q, E=e ) = argmaxqi{PBN(Q= qi| E=e ) }

20. Learning Belief Nets • Belief Net =  G,  • G = directed acyclic graph (“structure” – what’s related to what”) •  = “parameters” – strength of connections • Learning Belief Net  G,   from “data”: • Learning structure G • Find parameters  that are best, for G • Our focus:#2(parameters);Best  minimal CE-error

21. Learning BN Multi-Classifier Structure G + Labeled Queries • Goal: Find CPtables  to minimize CE error … • * = argmin{ [Q, E=e], vProb([Q, E=e] asked) * [|MC G, (Q, E=e) =? q|] }

22. Issues Q1: How many labeled queries are required? Q2: How hard is learning, given distributional info? Q3: What is best algorithm for learning … • … Belief Net? • … Belief Net Classifier? • … Belief Net Multiclassifier?

23. Q1, Q2: Theoretical Results • PAC(e, d)-learn CPtables: Given BN structure, find CPtables whose CE-error is, with prob  1-d, within e of optimal Sample Complexity: … BN structure w/ N variables, K CPtable entries, ig >0, needsample of labeled queries. Computational Complexity:NP-hard to find CPtable w/ min’l CE error(over g, for any g O(1/N) ) from labeled queries… from known structure!

24. Use Conditional Likelihood Not standard model? As NP-hard… • Goal: minimize “classification error”, based on training sample [Qi, Ei=ei], qi* • Sample typically includes • high-probability queries [Q, E=e] • only most likely answers to these queries q*= argmaxq { P( Q=q | E=e ) } Maximize Conditional Likelihood LCLD(  ) = [q*,e] D log P( Q=q*| E=e )

25. Gradient Descent Alg: ILQq Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E Descend along derivative: + sum over queries “[Q=q, E=e]”, conjugate gradient, … • How to change CPtable qc|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to

26. Better Algorithm: ILQ Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E • Constrained Optimization(qc|f 0, qc=0|f + qc=1|f = 1) • New parameterization bc|f: for each “row” rj, set bc0|rj = 0 for one c0

27. Q3: How to Learn BN MultiClassifier? • Approach 1: Minimize error  Maximize Conditional Likelihood • (In)Complete Data: ILQ • Approach 2: Fit to data  Maximize Likelihood • Complete Data: Observed Frequency Estimate • Incomplete Data: EM / APN

28. Empirical Studies • Two different objectives  2 learning algs • Maximize Conditional Likelihood: ILQ • Maximize Likelihood: APN • Two different approaches to MultipleClasses • 1 copy of structure • k copies of structure • k naïve-bayes • Several “datasets” • Alarm • Insurance • … • Error: “0/1”;MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2

29. 1- vs k- Structures Menin Cancer Menin Menin Cancer Cancer

30. Empirical Study I: Alarm • Alarm Belief Net37 vars, 46 links, 505 parameters

31. Query Distribution • [HC’91] says, typically • 8 vars Q N appear as query • 16 vars E N appear as evidence • Select • Q Quniformly • Use same set of 7 evidenceEE • Assign value e for E, based on Palarm(E =e) • Find “value” v based on Palarm(Q=v | E =e) • Each run uses m such queries, m=5,10, … 100, …

32. Results (Alarm; ILQ; SmallSample) • CE • MSE

33. Results (Alarm; ILQ; LargeSample) • CE • MSE

34. Comments on Alarm Results • For small Sample Size • “ILQ- 1 structure” better than “ILQ- k structures” • For large Sample Size • “ILQ- 1 structure”  “ILQ- k structures” • ILQ-k has more parameters to fit, but … lots of data • APN ok, but much slower (did not converge in bounds)

35. Empirical Study II: Insurance • Insurance Belief Net • 27 vars, (3 query, 8 evidence) • 560 parameters • Distribution: • Select 1 query randomly from 3 • Use all 8 evidence • … • (Simplified Version)

36. Results (Insur; ILQ) • CE • MSE

37. Summary of Results Learning  for given structure, to minimize CED() or MSED() • Correct structure • Small number of samples  • ILQ-1 (APN-1) win (over ILQ-k, APN-k) • Large number of samples  • ILQ-k  ILQ-1win(over APN-1, APN-k) • Incorrect structure (naïve-bayes) •  ILQ wins

38. Future Work • Best algorithm for learning optimal BN? • Actually optimize CE-Err (not LCL) • Learning STRUCTURE as well as CPtables • Special cases where ILQ is efficient (?complete data?) • Other “learning environments” • Other prior knowledge -- Query Forms • Explicitly-Labeled Queries • Better understanding of sample complexityw/out “g” restriction

39. Related Work • Like (ML) classification but… • Probabilities, not discrete • Diff class var’s, diff evidence sets... • … see Caruana • “Learning to Reason” [KR’95]“do well on tasks that will be encountered” … but different performance system • Sample Complexity [FY, Hoeffgen] … diff learning model • Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond queryq for BN L2 conditional

40. Take Home Msgs • To max performance: • use Conditional Likelihood (ILQ) not Likelihood (APN/EM, OFE) • Especially if structure wrong, small sample, … … controversial… • To deal with MultiClassifiers • Use 1 structure, not k • If small sample, 1struct better performance If large sample, same performance, … but 1struct smaller … yes, of course… • Relation to Attrib vs Relation: • Not “1 example for many class of queries” • but “1 example for 1 class of queries, BUT IN ONE COMMON STRUCTURE” Exploiting Common Relations

41. Contributions • Appropriate model for learning Learn MultiClassifier that works well in practice • Extends standard learning environments • Labeled Queries, with different class variables • Sample Complexity • Need “few” labeled-queries • Computation ComplexityEffective Algorithm • NP-hard  Gradient descent • Empirical Evidence: works well! • http://www.cs.ualberta.ca/~greiner/BN-results.html

42. Questions? • LCL vs LL • Does diff matter? • ILQ vs APN • Query Forms • See also http://www.cs.ualberta.ca/~greiner/BN-results.html

43. Learning Model If never asked don’t care if “What is p(jaun | btest- ) ?” BN(jaun | btest- )  p(jaun | btest- ) • Most belief net learners try to maximize LIKELIHOOD LL D (  ) = xD log P( x) … as goal is “fit to data” D Our goal is different: We want to minimize error, over distribution of queries.

44. Different Optimization LL D(  ) = [q*,e]D log P( Q=q*| E=e ) +  [q*,e]D log P( E=e ) = LCL D(  ) +  [q*,e]D log P( E=e ) • As  [q*,e]D log P( E=e ) non-trivial, • LL = argmax { LL D(  ) } • LCL = argmax {LCL D(  ) } • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError • To findLCL: NP-hard, so…ILQ Return LL LCL

45. Why Alternative Model? • A belief net is … • representation for a distribution • system for answering queries • Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”but not “What is p(jaun | btest-) ?” • So… BN is good ifeven if BN(hep | jaun, btest- ) = p(hep | jaun, btest- ) BN(jaun | btest- )  p(jaun | btest- )

46. Query Distr vs Tuple Distr • Distribution over tuples p(q)p(hep, jan, btest-, …) = 0.07 p(flu, cough, ~headache, …) = 0.43 • Distribution over queries sq(q) = Prob(q asked)Ask “What isp(hep | jan, btest-)?” 30%Ask “What isp(flu | cough, ~headache)?” 22% • Can be uncorrelated: • EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100% even if Pr[ Cancer ] = p(cancer) = 0

47. Query Distr  Tuple Distr • Spse GP asks all ADULT FEMALE patients • “Pregnant” ? • Data  P( Preg | Adult, Gender=F ) = 2/3 • Is this really TUPLE distr? • P(Gender=F) = 1 ? • NO: only reflects questions asked ! • Provide info re: P(preg | Adult=+, Gender=F) • but NOT about P(Adult), …

48. Query Distr  Tuple Distr • Query Probability: independent of tuple probability: Prob([Q, E=e] asked) •  P(Q=q, E=e) • Could always ask about 0-prob situation • Always ask “[Pregnant=t, Gender=Male]”  sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0 •  P(E=e) • If sq(Q, E=ei)  P(E=ei), then • P(Gender=Female ) = P(Gender=Male )  sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male) • Note: value of query -- q* of -- IS based on P(Q=q | E=e) [Q, E=e], q* Return

49. Does it matter? • If all queries involve same query variable, ok to pretend sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION • Eg, in As no one asks “What is P(Gender)?, doesn’t matter … • But problematic in MultiClassifier… if other queries – eg, sq(Gender; .)

50. ILQ (cond likelihood) vs APN (likelihood) • Wrong structure: • ILQ better than APN/EM • Experiments… • Artificial data • Using Naïve Bayes (UCI) • Correct structure • ILQ often better than OFE, APN/EM • Experiments • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError