Loading in 2 Seconds...

Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

Loading in 2 Seconds...

70 Views

Download Presentation
##### Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Exploiting Common SubRelations:Learning One Belief Net for**Many Classification Tasks R Greiner, Wei Zhou University of Alberta**Situation**• CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from medical symptoms • Hepatitis, from medical symptoms • … • Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk} • Then use Si to deal with ith“query class” • but… • but… need to re-learn inter-relations among Factors, Symptoms, common to all k classifiers**Common Interrelationships**Cancer Cancer Menin Menin**Use Common Structure!**• CHALLENGE: Need to learn k classifiers • Cancer, from medical symptoms • Meningitis, from symptoms • Hepatitis, from symptoms • … • Option 2: Learn 1 “structure” Sof relationships then use Sto address all k classification tasks • Actual Approach: Learn 1 Bayesian BeliefNet, inter-relating info for all k types of queries**Outline**• Motivation • Handle multiple class variables • Framework • Formal model • Belief Nets, …-classifier • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structures; LL vs LCL • Contributions**MC-Learner**MC Training Data**Multi-Classifier I/O**• Given “query” “class variable” Q and “evidence” E=e Cancer=?, given Gender=F, Age=35, Smoke=t ? • Return value Q = q Cancer = Yes**MultiClassifier**• Like standard Classifiers, can deal with • different evidence E • different evidence values e • Unlike standard Classifiers, can deal with • different class variables Q • Able to “answer queries” • classify new unlabeled tuples • Given “Q=?, given E=e”, return “q” MC(Cancer; Gender=M, Age=25, Height=6’) = No MC(Meningitis; Gender=F, BloodTest = t ) = Severe**MC-Learner’s I/O**• Input: Set of “queries”(labeled partially-specified tuples) • input to standard (partial-data) learners • Output: MultiClassifier**Error Measure**• Query Distribution: • …can be uncorrelated with “tuple distribution” Prob([Q, E=e] asked) MultiClassifier MC returns MC(Q, E=e) = q’ • Classification Error of MC • [|a =? b|] 1 if a=b, 0 otherwise • “0/1” error CE(MC) = [Q, E=e], qProb([Q, E=e] asked)* [|MC(Q, E=e) =?q|] • “Labeled query” [Q, E=e], q**Learner’s Task**• Given • space of “MultiClassifiers” { MCi} • sample of labeled queries drawn from “query distribution” Find MC*= argmin{ MCi}{CE(MCi) } w/minimal error over query distribution.**Outline**• Motivation • Handle multiple class variables • Framework • Formal model • Results • Theoretical Analysis • Algorithms (Likelihood vs Conditional Likelihood) • Empirical Comparison • 1 Structure vs k Structure; LL vs LCL • Contributions Belief Nets, …-classifier**Simple Belief Net**H B J P(J | H, B=0) = P(J | H, B=1) J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc! Skip Details**Example of a Belief Net**P(H=1) P(H=0) H 0.05 0.95 h P(B=1 | H=h) P(B=0 | H=h) B 1 0.95 0.05 0 0.03 0.97 h b P(J=1|h,b) P(J=0|h,b) J 1 1 0.8 0.2 1 0 0.8 0.2 0 1 0.3 0.7 0 0 0.3 0.7 • Simple Belief Net: Node ~ Variable Link ~ “Causal dependency” “CPTable” ~ P(child | parents) Skip**Encoding Causal Links (cont’d)**H B J P(J | H, B=0) = P(J | H, B=1) J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!**Encoding Causal Links (cont’d)**H B J P(J | H, B=0) = P(J | H, B=1) J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!**Encoding Causal Links (cont’d)**H B J P(J | H, B=0) = P(J | H, B=1) J, H !P( J | H, B) = P(J | H) J is INDEPENDENT of B, once we know H Don’t need B J arc!**Include Only Causal Links**H B J P(B=1 | H=1) Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) • Sufficient Belief Net: • Requires: P(H=1) known P(J=1 | H=1) known P(B=1 | H=1) known (Only 5 parameters, not 7)**BeliefNet as (Multi)Classifier**Prob q1 q2 q3 … qm • For query [Q, E=e], BN will return distribution • PBN(Q=q1| E=e ), PBN(Q=q2| E=e ), … PBN(Q=qm| E=e ) • (Multi)Classifier MCBN(Q, E=e ) = argmaxqi{PBN(Q= qi| E=e ) }**Learning Belief Nets**• Belief Net = G, • G = directed acyclic graph (“structure” – what’s related to what”) • = “parameters” – strength of connections • Learning Belief Net G, from “data”: • Learning structure G • Find parameters that are best, for G • Our focus:#2(parameters);Best minimal CE-error**Learning BN Multi-Classifier**Structure G + Labeled Queries • Goal: Find CPtables to minimize CE error … • * = argmin{ [Q, E=e], vProb([Q, E=e] asked) * [|MC G, (Q, E=e) =? q|] }**Issues**Q1: How many labeled queries are required? Q2: How hard is learning, given distributional info? Q3: What is best algorithm for learning … • … Belief Net? • … Belief Net Classifier? • … Belief Net Multiclassifier?**Q1, Q2: Theoretical Results**• PAC(e, d)-learn CPtables: Given BN structure, find CPtables whose CE-error is, with prob 1-d, within e of optimal Sample Complexity: … BN structure w/ N variables, K CPtable entries, ig >0, needsample of labeled queries. Computational Complexity:NP-hard to find CPtable w/ min’l CE error(over g, for any g O(1/N) ) from labeled queries… from known structure!**Use Conditional Likelihood**Not standard model? As NP-hard… • Goal: minimize “classification error”, based on training sample [Qi, Ei=ei], qi* • Sample typically includes • high-probability queries [Q, E=e] • only most likely answers to these queries q*= argmaxq { P( Q=q | E=e ) } Maximize Conditional Likelihood LCLD( ) = [q*,e] D log P( Q=q*| E=e )**Gradient Descent Alg: ILQq**Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E Descend along derivative: + sum over queries “[Q=q, E=e]”, conjugate gradient, … • How to change CPtable qc|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to**Better Algorithm: ILQ**Q … F1 F2 F1 F2 P(C|f1, f2) … C … qc|f … … … E • Constrained Optimization(qc|f 0, qc=0|f + qc=1|f = 1) • New parameterization bc|f: for each “row” rj, set bc0|rj = 0 for one c0**Q3: How to Learn BN MultiClassifier?**• Approach 1: Minimize error Maximize Conditional Likelihood • (In)Complete Data: ILQ • Approach 2: Fit to data Maximize Likelihood • Complete Data: Observed Frequency Estimate • Incomplete Data: EM / APN**Empirical Studies**• Two different objectives 2 learning algs • Maximize Conditional Likelihood: ILQ • Maximize Likelihood: APN • Two different approaches to MultipleClasses • 1 copy of structure • k copies of structure • k naïve-bayes • Several “datasets” • Alarm • Insurance • … • Error: “0/1”;MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2**1- vs k- Structures**Menin Cancer Menin Menin Cancer Cancer**Empirical Study I: Alarm**• Alarm Belief Net37 vars, 46 links, 505 parameters**Query Distribution**• [HC’91] says, typically • 8 vars Q N appear as query • 16 vars E N appear as evidence • Select • Q Quniformly • Use same set of 7 evidenceEE • Assign value e for E, based on Palarm(E =e) • Find “value” v based on Palarm(Q=v | E =e) • Each run uses m such queries, m=5,10, … 100, …**Results (Alarm; ILQ; SmallSample)**• CE • MSE**Results (Alarm; ILQ; LargeSample)**• CE • MSE**Comments on Alarm Results**• For small Sample Size • “ILQ- 1 structure” better than “ILQ- k structures” • For large Sample Size • “ILQ- 1 structure” “ILQ- k structures” • ILQ-k has more parameters to fit, but … lots of data • APN ok, but much slower (did not converge in bounds)**Empirical Study II: Insurance**• Insurance Belief Net • 27 vars, (3 query, 8 evidence) • 560 parameters • Distribution: • Select 1 query randomly from 3 • Use all 8 evidence • … • (Simplified Version)**Results (Insur; ILQ)**• CE • MSE**Summary of Results**Learning for given structure, to minimize CED() or MSED() • Correct structure • Small number of samples • ILQ-1 (APN-1) win (over ILQ-k, APN-k) • Large number of samples • ILQ-k ILQ-1win(over APN-1, APN-k) • Incorrect structure (naïve-bayes) • ILQ wins**Future Work**• Best algorithm for learning optimal BN? • Actually optimize CE-Err (not LCL) • Learning STRUCTURE as well as CPtables • Special cases where ILQ is efficient (?complete data?) • Other “learning environments” • Other prior knowledge -- Query Forms • Explicitly-Labeled Queries • Better understanding of sample complexityw/out “g” restriction**Related Work**• Like (ML) classification but… • Probabilities, not discrete • Diff class var’s, diff evidence sets... • … see Caruana • “Learning to Reason” [KR’95]“do well on tasks that will be encountered” … but different performance system • Sample Complexity [FY, Hoeffgen] … diff learning model • Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond queryq for BN L2 conditional**Take Home Msgs**• To max performance: • use Conditional Likelihood (ILQ) not Likelihood (APN/EM, OFE) • Especially if structure wrong, small sample, … … controversial… • To deal with MultiClassifiers • Use 1 structure, not k • If small sample, 1struct better performance If large sample, same performance, … but 1struct smaller … yes, of course… • Relation to Attrib vs Relation: • Not “1 example for many class of queries” • but “1 example for 1 class of queries, BUT IN ONE COMMON STRUCTURE” Exploiting Common Relations**Contributions**• Appropriate model for learning Learn MultiClassifier that works well in practice • Extends standard learning environments • Labeled Queries, with different class variables • Sample Complexity • Need “few” labeled-queries • Computation ComplexityEffective Algorithm • NP-hard Gradient descent • Empirical Evidence: works well! • http://www.cs.ualberta.ca/~greiner/BN-results.html**Questions?**• LCL vs LL • Does diff matter? • ILQ vs APN • Query Forms • See also http://www.cs.ualberta.ca/~greiner/BN-results.html**Learning Model**If never asked don’t care if “What is p(jaun | btest- ) ?” BN(jaun | btest- ) p(jaun | btest- ) • Most belief net learners try to maximize LIKELIHOOD LL D ( ) = xD log P( x) … as goal is “fit to data” D Our goal is different: We want to minimize error, over distribution of queries.**Different Optimization**LL D( ) = [q*,e]D log P( Q=q*| E=e ) + [q*,e]D log P( E=e ) = LCL D( ) + [q*,e]D log P( E=e ) • As [q*,e]D log P( E=e ) non-trivial, • LL = argmax { LL D( ) } • LCL = argmax {LCL D( ) } • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError • To findLCL: NP-hard, so…ILQ Return LL LCL**Why Alternative Model?**• A belief net is … • representation for a distribution • system for answering queries • Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”but not “What is p(jaun | btest-) ?” • So… BN is good ifeven if BN(hep | jaun, btest- ) = p(hep | jaun, btest- ) BN(jaun | btest- ) p(jaun | btest- )**Query Distr vs Tuple Distr**• Distribution over tuples p(q)p(hep, jan, btest-, …) = 0.07 p(flu, cough, ~headache, …) = 0.43 • Distribution over queries sq(q) = Prob(q asked)Ask “What isp(hep | jan, btest-)?” 30%Ask “What isp(flu | cough, ~headache)?” 22% • Can be uncorrelated: • EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100% even if Pr[ Cancer ] = p(cancer) = 0**Query Distr Tuple Distr**• Spse GP asks all ADULT FEMALE patients • “Pregnant” ? • Data P( Preg | Adult, Gender=F ) = 2/3 • Is this really TUPLE distr? • P(Gender=F) = 1 ? • NO: only reflects questions asked ! • Provide info re: P(preg | Adult=+, Gender=F) • but NOT about P(Adult), …**Query Distr Tuple Distr**• Query Probability: independent of tuple probability: Prob([Q, E=e] asked) • P(Q=q, E=e) • Could always ask about 0-prob situation • Always ask “[Pregnant=t, Gender=Male]” sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0 • P(E=e) • If sq(Q, E=ei) P(E=ei), then • P(Gender=Female ) = P(Gender=Male ) sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male) • Note: value of query -- q* of -- IS based on P(Q=q | E=e) [Q, E=e], q* Return**Does it matter?**• If all queries involve same query variable, ok to pretend sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION • Eg, in As no one asks “What is P(Gender)?, doesn’t matter … • But problematic in MultiClassifier… if other queries – eg, sq(Gender; .)**ILQ (cond likelihood) vs APN (likelihood)**• Wrong structure: • ILQ better than APN/EM • Experiments… • Artificial data • Using Naïve Bayes (UCI) • Correct structure • ILQ often better than OFE, APN/EM • Experiments • Discriminant analysis: • Maximize Overall Likelihood vs • Minimize PredictiveError