1 / 59

Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks. R Greiner, Wei Zhou University of Alberta. Situation. CHALLENGE: Need to learn k classifiers Cancer, from medical symptoms Meningitis, from medical symptoms Hepatitis, from medical symptoms …

brook
Download Presentation

Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Common SubRelations:Learning One Belief Net for Many Classification Tasks R Greiner, Wei Zhou University of Alberta

  2. Situation • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from medical symptoms • Hepatitis, from medical symptoms • … • Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk} • Then use Si to deal with ith“query class” • but… • but… need to re-learn inter-relations among Factors, Symptoms, common to all k classifiers

  3. Common Interrelationships Cancer Cancer Menin Menin

  4. Use Common Structure! • CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from symptoms • Hepatitis, from symptoms • … • Option 2: Learn 1 “structure” Sof relationships then use Sto address all k classification tasks • Actual Approach: Learn 1 Bayesian BeliefNet, inter-relating info for all k types of queries

  5. Outline • Motivation • Handle multiple class variables • Framework • Formal model • Belief Nets, …-classifier • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structures; LL vs LCL • Contributions

  6. MC-Learner MC Training Data

  7. Multi-Classifier I/O • Given “query” “class variable” Q and “evidence” E=e Cancer=?, given Gender=F, Age=35, Smoke=t ? • Return value Q = q Cancer = Yes

  8. MultiClassifier • Like standard Classifiers, can deal with • different evidence E • different evidence values e • Unlike standard Classifiers, can deal with • different class variables Q • Able to “answer queries” • classify new unlabeled tuples • Given “Q=?, given E=e”, return “q” MC(Cancer; Gender=M, Age=25, Height=6’) = No MC(Meningitis; Gender=F, BloodTest = t ) = Severe

  9. MC-Learner’s I/O • Input: Set of “queries”(labeled partially-specified tuples) •  input to standard (partial-data) learners • Output: MultiClassifier

  10. Error Measure • Query Distribution: • …can be uncorrelated with “tuple distribution” Prob([Q, E=e] asked) MultiClassifier MC returns MC(Q, E=e) = q’ • Classification Error of MC • [|a =? b|]  1 if a=b, 0 otherwise •  “0/1” error CE(MC) = [Q, E=e], qProb([Q, E=e] asked)* [|MC(Q, E=e) =?q|] • “Labeled query” [Q, E=e], q

  11. Learner’s Task • Given • space of “MultiClassifiers” { MCi} • sample of labeled queries drawn from “query distribution” Find MC*= argmin{ MCi}{CE(MCi) } w/minimal error over query distribution.

  12. Outline • Motivation • Handle multiple class variables • Framework • Formal model • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structure; LL vs LCL • Contributions Belief Nets, …-classifier

  13. Simple Belief Net H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc! Skip Details

  14. Example of a Belief Net P(H=1) P(H=0) H 0.05 0.95 h P(B=1 | H=h) P(B=0 | H=h) B 1 0.95 0.05 0 0.03 0.97 h b P(J=1|h,b) P(J=0|h,b) J 1 1 0.8 0.2 1 0 0.8 0.2 0 1 0.3 0.7 0 0 0.3 0.7 • Simple Belief Net: Node ~ Variable Link ~ “Causal dependency” “CPTable” ~ P(child | parents) Skip

  15. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

  16. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

  17. Encoding Causal Links (cont’d) H B J P(J | H, B=0) = P(J | H, B=1)  J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!

  18. Include Only Causal Links H B J P(B=1 | H=1) Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) • Sufficient Belief Net: • Requires: P(H=1) known P(J=1 | H=1) known P(B=1 | H=1) known (Only 5 parameters, not 7)

  19. BeliefNet as (Multi)Classifier Prob q1 q2 q3 … qm • For query [Q, E=e], BN will return distribution • PBN(Q=q1| E=e ), PBN(Q=q2| E=e ), … PBN(Q=qm| E=e ) • (Multi)Classifier MCBN(Q, E=e ) = argmaxqi{PBN(Q= qi| E=e ) }

  20. Learning Belief Nets • Belief Net =  G,  • G = directed acyclic graph (“structure” – what’s related to what”) •  = “parameters” – strength of connections • Learning Belief Net  G,   from “data”: • Learning structure G • Find parameters  that are best, for G • Our focus:#2(parameters);Best  minimal CE-error

  21. Learning BN Multi-Classifier Structure G + Labeled Queries • Goal: Find CPtables  to minimize CE error … • * = argmin{ [Q, E=e], vProb([Q, E=e] asked) * [|MC G, (Q, E=e) =? q|] }

  22. Issues Q1: How many labeled queries are required? Q2: How hard is learning, given distributional info? Q3: What is best algorithm for learning … • … Belief Net? • … Belief Net Classifier? • … Belief Net Multiclassifier?

  23. Q1, Q2: Theoretical Results • PAC(e, d)-learn CPtables: Given BN structure, find CPtables whose CE-error is, with prob  1-d, within e of optimal Sample Complexity: … BN structure w/ N variables, K CPtable entries, ig >0, needsample of labeled queries. Computational Complexity:NP-hard to find CPtable w/ min’l CE error(over g, for any g O(1/N) ) from labeled queries… from known structure!

  24. Use Conditional Likelihood Not standard model? As NP-hard… • Goal: minimize “classification error”, based on training sample [Qi, Ei=ei], qi* • Sample typically includes • high-probability queries [Q, E=e] • only most likely answers to these queries q*= argmaxq { P( Q=q | E=e ) } Maximize Conditional Likelihood LCLD(  ) = [q*,e] D log P( Q=q*| E=e )

  25. Gradient Descent Alg: ILQq Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E Descend along derivative: + sum over queries “[Q=q, E=e]”, conjugate gradient, … • How to change CPtable qc|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to

  26. Better Algorithm: ILQ Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E • Constrained Optimization(qc|f 0, qc=0|f + qc=1|f = 1) • New parameterization bc|f: for each “row” rj, set bc0|rj = 0 for one c0

  27. Q3: How to Learn BN MultiClassifier? • Approach 1: Minimize error  Maximize Conditional Likelihood • (In)Complete Data: ILQ • Approach 2: Fit to data  Maximize Likelihood • Complete Data: Observed Frequency Estimate • Incomplete Data: EM / APN

  28. Empirical Studies • Two different objectives  2 learning algs • Maximize Conditional Likelihood: ILQ • Maximize Likelihood: APN • Two different approaches to MultipleClasses • 1 copy of structure • k copies of structure • k naïve-bayes • Several “datasets” • Alarm • Insurance • … • Error: “0/1”;MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2

  29. 1- vs k- Structures Menin Cancer Menin Menin Cancer Cancer

  30. Empirical Study I: Alarm • Alarm Belief Net37 vars, 46 links, 505 parameters

  31. Query Distribution • [HC’91] says, typically • 8 vars Q N appear as query • 16 vars E N appear as evidence • Select • Q Quniformly • Use same set of 7 evidenceEE • Assign value e for E, based on Palarm(E =e) • Find “value” v based on Palarm(Q=v | E =e) • Each run uses m such queries, m=5,10, … 100, …

  32. Results (Alarm; ILQ; SmallSample) • CE • MSE

  33. Results (Alarm; ILQ; LargeSample) • CE • MSE

  34. Comments on Alarm Results • For small Sample Size • “ILQ- 1 structure” better than “ILQ- k structures” • For large Sample Size • “ILQ- 1 structure”  “ILQ- k structures” • ILQ-k has more parameters to fit, but … lots of data • APN ok, but much slower (did not converge in bounds)

  35. Empirical Study II: Insurance • Insurance Belief Net • 27 vars, (3 query, 8 evidence) • 560 parameters • Distribution: • Select 1 query randomly from 3 • Use all 8 evidence • … • (Simplified Version)

  36. Results (Insur; ILQ) • CE • MSE

  37. Summary of Results Learning  for given structure, to minimize CED() or MSED() • Correct structure • Small number of samples  • ILQ-1 (APN-1) win (over ILQ-k, APN-k) • Large number of samples  • ILQ-k  ILQ-1win(over APN-1, APN-k) • Incorrect structure (naïve-bayes) •  ILQ wins

  38. Future Work • Best algorithm for learning optimal BN? • Actually optimize CE-Err (not LCL) • Learning STRUCTURE as well as CPtables • Special cases where ILQ is efficient (?complete data?) • Other “learning environments” • Other prior knowledge -- Query Forms • Explicitly-Labeled Queries • Better understanding of sample complexityw/out “g” restriction

  39. Related Work • Like (ML) classification but… • Probabilities, not discrete • Diff class var’s, diff evidence sets... • … see Caruana • “Learning to Reason” [KR’95]“do well on tasks that will be encountered” … but different performance system • Sample Complexity [FY, Hoeffgen] … diff learning model • Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond queryq for BN L2 conditional

  40. Take Home Msgs • To max performance: • use Conditional Likelihood (ILQ) not Likelihood (APN/EM, OFE) • Especially if structure wrong, small sample, … … controversial… • To deal with MultiClassifiers • Use 1 structure, not k • If small sample, 1struct better performance If large sample, same performance, … but 1struct smaller … yes, of course… • Relation to Attrib vs Relation: • Not “1 example for many class of queries” • but “1 example for 1 class of queries, BUT IN ONE COMMON STRUCTURE” Exploiting Common Relations

  41. Contributions • Appropriate model for learning Learn MultiClassifier that works well in practice • Extends standard learning environments • Labeled Queries, with different class variables • Sample Complexity • Need “few” labeled-queries • Computation ComplexityEffective Algorithm • NP-hard  Gradient descent • Empirical Evidence: works well! • http://www.cs.ualberta.ca/~greiner/BN-results.html

  42. Questions? • LCL vs LL • Does diff matter? • ILQ vs APN • Query Forms • See also http://www.cs.ualberta.ca/~greiner/BN-results.html

  43. Learning Model If never asked don’t care if “What is p(jaun | btest- ) ?” BN(jaun | btest- )  p(jaun | btest- ) • Most belief net learners try to maximize LIKELIHOOD LL D (  ) = xD log P( x) … as goal is “fit to data” D Our goal is different: We want to minimize error, over distribution of queries.

  44. Different Optimization LL D(  ) = [q*,e]D log P( Q=q*| E=e ) +  [q*,e]D log P( E=e ) = LCL D(  ) +  [q*,e]D log P( E=e ) • As  [q*,e]D log P( E=e ) non-trivial, • LL = argmax { LL D(  ) } • LCL = argmax {LCL D(  ) } • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError • To findLCL: NP-hard, so…ILQ Return LL LCL

  45. Why Alternative Model? • A belief net is … • representation for a distribution • system for answering queries • Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”but not “What is p(jaun | btest-) ?” • So… BN is good ifeven if BN(hep | jaun, btest- ) = p(hep | jaun, btest- ) BN(jaun | btest- )  p(jaun | btest- )

  46. Query Distr vs Tuple Distr • Distribution over tuples p(q)p(hep, jan, btest-, …) = 0.07 p(flu, cough, ~headache, …) = 0.43 • Distribution over queries sq(q) = Prob(q asked)Ask “What isp(hep | jan, btest-)?” 30%Ask “What isp(flu | cough, ~headache)?” 22% • Can be uncorrelated: • EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100% even if Pr[ Cancer ] = p(cancer) = 0

  47. Query Distr  Tuple Distr • Spse GP asks all ADULT FEMALE patients • “Pregnant” ? • Data  P( Preg | Adult, Gender=F ) = 2/3 • Is this really TUPLE distr? • P(Gender=F) = 1 ? • NO: only reflects questions asked ! • Provide info re: P(preg | Adult=+, Gender=F) • but NOT about P(Adult), …

  48. Query Distr  Tuple Distr • Query Probability: independent of tuple probability: Prob([Q, E=e] asked) •  P(Q=q, E=e) • Could always ask about 0-prob situation • Always ask “[Pregnant=t, Gender=Male]”  sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0 •  P(E=e) • If sq(Q, E=ei)  P(E=ei), then • P(Gender=Female ) = P(Gender=Male )  sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male) • Note: value of query -- q* of -- IS based on P(Q=q | E=e) [Q, E=e], q* Return

  49. Does it matter? • If all queries involve same query variable, ok to pretend sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION • Eg, in As no one asks “What is P(Gender)?, doesn’t matter … • But problematic in MultiClassifier… if other queries – eg, sq(Gender; .)

  50. ILQ (cond likelihood) vs APN (likelihood) • Wrong structure: • ILQ better than APN/EM • Experiments… • Artificial data • Using Naïve Bayes (UCI) • Correct structure • ILQ often better than OFE, APN/EM • Experiments • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError

More Related