Abduction uncertainty and probabilistic reasoning
Download
1 / 66

Abduction, Uncertainty, and Probabilistic Reasoning - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Abduction, Uncertainty, and Probabilistic Reasoning. Chapters 13, 14, and more. Introduction. Abduction is a reasoning process that tries to form plausible explanations for abnormal observations Abduction is distinct different from deduction and induction Abduction is inherently uncertain

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Abduction, Uncertainty, and Probabilistic Reasoning' - sinjin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Abduction uncertainty and probabilistic reasoning

Abduction, Uncertainty, and Probabilistic Reasoning

Chapters 13, 14, and more


Introduction
Introduction

  • Abduction is a reasoning process that tries to form plausible explanations for abnormal observations

    • Abduction is distinct different from deduction and induction

    • Abduction is inherently uncertain

  • Uncertainty becomes an important issue in AI research

  • Some major formalisms for representing and reasoning about uncertainty

    • Mycin’s certainty factor (an early representative)

    • Probability theory (esp. Bayesian networks)

    • Dempster-Shafer theory

    • Fuzzy logic

    • Truth maintenance systems


Abduction
Abduction

  • Definition (Encyclopedia Britannica): reasoning that derives an explanatory hypothesis from a given set of facts

    • The inference result is a hypothesis, which if true, could explain the occurrence of the given facts

  • Examples

    • Dendral, an expert system to construct 3D structure of chemical compounds

      • Fact: mass spectrometer data of the compound and the chemical formula of the compound

      • KB: chemistry, esp. strength of different types of bounds

      • Reasoning: form a hypothetical 3D structure which meets the given chemical formula, and would most likely produce the given mass spectrum if subjected to electron beam bombardment


  • Medical diagnosis

    • Facts: symptoms, lab test results, and other observed findings (called manifestations)

    • KB: causal associations between diseases and manifestations

    • Reasoning: one or more diseases whose presence would causally explain the occurrence of the given manifestations

  • Many other reasoning processes (e.g., word sense disambiguation in natural language process, image understanding, detective’s work, etc.) can also been seen as abductive reasoning.


Comparing abduction deduction and induction
Comparing abduction, deduction and induction

A => B

A

---------

B

Deduction: major premise: All balls in the box are black

minor premise: This ball is from the box

conclusion: This ball is black

Abduction: rule: All balls in the box are black

observation: This ball is black

explanation: This ball is from the box

Induction: case: These balls are from the box

observation: These balls are black

hypothesized rule: All ball in the box are black

A => B

B

-------------

Possibly A

Whenever A then B but not vice versa

-------------

Possibly

A => B

Induction: from specific cases to general rules

Abduction and deduction:

both from part of a specific case to other part of

the case using general rules (in different ways)


Characteristics of abduction reasoning
Characteristics of abduction reasoning

  • Reasoning results are hypotheses, not theorems (may be false even if rules and facts are true),

    • e.g., misdiagnosis in medicine

  • There may be multiple plausible hypotheses

    • When given rules A => B and C => B, and fact B

      both A and C are plausible hypotheses

    • Abduction is inherently uncertain

    • Hypotheses can be ranked by their plausibility if that can be determined

  • Reasoning is often a Hypothesize-and-test cycle

    • hypothesize phase: postulate possible hypotheses, each of which could explain the given facts (or explain most of the important facts)

    • test phase: test the plausibility of all or some of these hypotheses


  • One way to test a hypothesis H is to test if something that is currently unknown but can be predicted from H is actually true.

    • If we also know A => D and C => E, then ask if D and E are true.

    • If it turns out D is true and E is false, then hypothesis A becomes more plausible (support for A increased, support for C decreased)

    • Alternative hypotheses compete with each other (Okam’s razor prefers simpler hypotheses)

  • Reasoning is non-monotonic

    • Plausibility of hypotheses can increase/decrease as new facts are collected (deductive inference determines if a sentence is true but would never change its truth value)

    • Some hypotheses may be discarded/defeated, and new ones may be formed when new observations are made


  • Source of uncertainty in intelligent systems
    Source of Uncertainty in Intelligent Systems is currently unknown but can be predicted from H is actually true.

    • Uncertain data (noise)

    • Uncertain knowledge (e.g, causal relations)

      • A disorder may cause any and all POSSIBLE manifestations in a specific case

      • A manifestation can be caused by more than one POSSIBLE disorders

    • Uncertain reasoning results

      • Abduction and induction are inherently uncertain

      • Default reasoning, even in deductive fashion, is uncertain

      • Incomplete deductive inference may be uncertain


    Probabilistic inference
    Probabilistic Inference is currently unknown but can be predicted from H is actually true.

    • Based on probability theory (especially Bayes’ theorem)

      • Well established discipline about uncertain outcomes

      • Empirical science like physics/chemistry, can be verified by experiments

    • Probability theory is too rigid to apply directly in many knowledge-based applications

      • Some assumptions have to be made to simplify the reality

      • Different formalisms have been developed in which some aspects of the probability theory are changed/modified.

    • We will briefly review the basics of probability theory before discussing different approaches to uncertainty

    • The presentation uses diagnostic process (an abductive and evidential reasoning process) as an example


    Probability of events
    Probability of Events is currently unknown but can be predicted from H is actually true.

    • Sample space and events

      • Sample space S: (e.g., all people in an area)

      • Events E1  S: (e.g., all people having cough)

        E2  S: (e.g., all people having cold)

    • Prior (marginal) probabilities of events

      • P(E) = |E| / |S| (frequency interpretation)

      • P(E) = 0.1 (subjective probability)

      • 0 <= P(E) <= 1 for all events

      • Two special events:  and S: P() = 0 and P(S) = 1.0

    • Boolean operators between events (to form compound events)

      • Conjunctive (intersection): E1 ^ E2 ( E1  E2)

      • Disjunctive (union): E1 v E2 ( E1  E2)

      • Negation (complement): ~E (E = S – E)

    C


    ~E is currently unknown but can be predicted from H is actually true.

    E1

    E

    E2

    E1 ^ E2

    • Probabilities of compound events

      • P(~E) = 1 – P(E) because P(~E) + P(E) =1

      • P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2)

      • But how to compute the joint probability P(E1 ^ E2)?

    • Conditional probability (of E1, given E2)

      • How likely E1 occurs in the subspace of E2


    • Independence assumption is currently unknown but can be predicted from H is actually true.

      • Two events E1 and E2 are said to be independent of each other if

        (given E2 does not change the likelihood of E1)

      • Computation can be simplified with independent events

    • Mutually exclusive (ME) and exhaustive (EXH) set of events

      • ME:

      • EXH:


    Bayes theorem
    Bayes’ Theorem is currently unknown but can be predicted from H is actually true.

    • In the setting of diagnostic/evidential reasoning

      • Know prior probability of hypothesis

        conditional probability

      • Want to compute the posterior probability

    • Bayes’ theorem (formula 1):

    • If the purpose is to find which of the n hypotheses

      is more plausible given , then we can ignore the denominator and rank them use relative likelihood


    • can be computed from and , if we assume all hypotheses are ME and EXH

    • Then we have another version of Bayes’ theorem:

      where , the sum of relative likelihood of all n hypotheses, is a normalization factor


    Probabilistic inference for simple diagnostic problems
    Probabilistic Inference for simple diagnostic problems , if we assume all hypotheses are ME and EXH

    • Knowledge base:

    • Case input:

    • Find the hypothesis with the highest posterior probability

    • By Bayes’ theorem

    • Assume all pieces of evidence are conditionally independent, given any hypothesis


    • The relative likelihood , if we assume all hypotheses are ME and EXH

    • The absolute posterior probability

    • Evidence accumulation (when new evidence discovered)

      If El+1 present

      If El+1 present


    Assessing the assumptions
    Assessing the Assumptions , if we assume all hypotheses are ME and EXH

    • Assumption 1: hypotheses are mutually exclusive and exhaustive

      • Single fault assumption (one and only hypothesis must true)

      • Multi-faults do exist in individual cases

      • Can be viewed as an approximation of situations where hypotheses are independent of each other and their prior probabilities are very small

    • Assumption 2: pieces of evidence are conditionally independent of each other, given any hypothesis

      • Manifestations themselves are not independent of each other, they are correlated by their common causes

      • Reasonable under single fault assumption

      • Not so when multi-faults are to be considered


    Limitations of the simple bayesian system
    Limitations of the simple Bayesian system , if we assume all hypotheses are ME and EXH

    • Cannot handle well hypotheses of multiple disorders

      • Suppose are independent of each other

      • Consider a composite hypothesis

      • How to compute the posterior probability (or relative likelihood)

      • Using Bayes’ theorem


    B: burglar , if we assume all hypotheses are ME and EXH

    E: earth quake

    A: alarm set off

    E and B are independent

    But when A is given, they are (adversely) dependent because they become competitors to explain A

    P(B|A,E) <<P(B|A)

    but this is a very unreasonable assumption

    • Cannot handle causal chaining

      • Ex. A: weather of the year

        B: cotton production of the year

        C: cotton price of next year

      • Observed: A influences C

      • The influence is not direct (A –> B –> C)

        P(C|B, A) = P(C|B): instantiation of B blocks influence of A on C

    • Need a better representation and a better assumption


    Bayesian networks bns
    Bayesian Networks (BNs) , if we assume all hypotheses are ME and EXH

    • Definition: BN = (DAG, CPD)

      • DAG: directed acyclic graph (BN’s structure)

        • Nodes: random variables (typically binary or discrete, but methods also exist to handle continuous variables)

        • Arcs: indicate probabilistic dependencies between nodes (lack of link signifies conditional independence)

      • CPD: conditional probability distribution (BN’s parameters)

        • Conditional probabilities at each node, usually stored as a table (conditional probability table, or CPT)

      • Root nodes are a special case – no parents, so just use priors in CPD:


    Example bn
    Example BN , if we assume all hypotheses are ME and EXH

    P(a0) = 0.001

    A

    B C

    D E

    P(c0|a0) = 0.2 P(c0|a0) = 0.005

    P(b0|a0) = 0.3 P(b0|a1) = 0.001

    P(d0|b0, c0) = 0.1 P(d0|b0, c1) = 0.01

    P(d0|b1, c0) = 0.01 P(d0|b1, c1) = 0.00001

    P(e0|c0) = 0.4 P(e0|c1) = 0.002

    Uppercase: variables (A, B, …)

    Lowercase: values/states of variables (A has two states a0 and a1)

    Note that we only specify P(a0) etc., not P(a1), since they have to add to one


    Netica
    Netica , if we assume all hypotheses are ME and EXH

    • An commercial BN package by Norsys

    • Down load limited version for free from

      http://www.norsys.com/

    • May also down load APIs


    Conditional independence and chaining
    Conditional independence and chaining , if we assume all hypotheses are ME and EXH

    • Conditional independence assumption

      where q is any set of variables

      (nodes) other than and its successors

      • blocks influence of other nodes on

        and its successors (q influences only

        through variables in )

      • With this assumption, the complete joint probability distribution of all variables in the network can be represented by (recovered from) local CPDs by chaining these CPDs:

    q


    Chaining example

    A , if we assume all hypotheses are ME and EXH

    B C

    D E

    Chaining: Example

    Computing the joint probability for all

    variables is easy:

    The joint distribution of all variables

    P(A, B, C, D, E)

    = P(E | A, B, C, D) P(A, B, C, D) by Bayes’ theorem

    = P(E | C) P(A, B, C, D) by cond. indep. assumption

    = P(E | C) P(D | A, B, C) P(A, B, C)

    = P(E | C) P(D | B, C) P(C | A, B) P(A, B)

    = P(E | C) P(D | B, C) P(C | A) P(B | A) P(A)

    For a particular state:

    P(a0, b0, c1, d1, e0) = P(a0)P(b0|a0)P(c1|a0)P(d1|b0, c1)P(e0| c1)

    = 0.001*0.3*0.8*0.99*0.002 = 4.752*10^(-7)


    B: burglar , if we assume all hypotheses are ME and EXH

    E: earth quake

    A: alarm set off

    P(E) = 0.002

    P(B) = 0.01

    P(E|A) = 0.167; P(B|A) = 0.835

    P(E|A, E) = 1.0; P(B|A, E) = 0.0112

    P(B|A, E) = P(B,A,E)/P(A,E)

    = P(B,A,E)/(P(B,A,E) + P(~B,A,E)

    = 0.01*0.002*0.9/(0.01*0.002*0.9 + 0.99*0.002*0.8)

    = 0.000018/(0.000018 + 0.001548)

    = 0.000018/0.001566

    = 0.01123


    Topological semantics

    B , if we assume all hypotheses are ME and EXH

    C

    A

    A

    A

    B

    B

    C

    C

    Topological semantics

    • A node is conditionally independent of its non-descendants given its parents

    • A node is conditionally independent of all other nodes in the network given its parents, children, and children’s parents (also known as its Markov blanket)

    • The method called d-separation can be applied to decide whether a set of nodes X is independent of another set Y, given a third set Z

    Chain: A and C are independent, given B

    Converging: B and C are independent, NOT given A

    Diverging: B and C are independent, given A


    Inference tasks
    Inference tasks , if we assume all hypotheses are ME and EXH

    • Simple queries: Computer posterior probability P(Xi | E=e)

      • E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)

      • Posteriors for ALL nonevidence nodes

      • Priors for all nodes (E = )

    • Conjunctive queries:

      • P(Xi, Xj | E=e) = P(Xi | E=e) P(Xj | Xi, E=e)

    • Optimal decisions:Decision networks or influence diagrams

      • include utility information and actions;

      • inference is to find P(outcome | action, evidence)

    • Value of information: Which evidence should we seek next?

    • Sensitivity analysis:Which probability values are most critical?

    • Explanation: Why do I need a new starter motor?


    • MAP problems , if we assume all hypotheses are ME and EXH (explanation)

      • The solution provides a good explanation for your action

      • This is an optimization problem


    Approaches to inference
    Approaches to inference , if we assume all hypotheses are ME and EXH

    • Exact inference

      • Enumeration

      • Variable elimination

      • Belief propagation in polytrees (singly connected BNs)

      • Clustering / junction tree algorithms

    • Approximate inference

      • Stochastic simulation / sampling methods

        • Markov chain Monte Carlo methods

      • Loopy propagation

      • Others

        • Mean field theory

        • Neural networks


    Inference by enumeration
    Inference by enumeration , if we assume all hypotheses are ME and EXH

    • To compute P(X|E=e), where X is a single variable and E is evidence (instantiation of a set of variables)

    • Add all of the terms (atomic event probabilities) from the full joint distribution that are consistent with E

    • If Y are the other (unobserved) variables, excluding X, then the posterior distribution

      P(X|E=e) = α P(X, e) = α ∑yP(X, e, Y)

      • Sum is over all possible instantiations of variables in Y

  • Each P(X, e, Y) term can be computed using the chain rule

  • Computationally expensive!


  • Example enumeration

    A , if we assume all hypotheses are ME and EXH

    BC

    DE

    Example: Enumeration

    • P(xi) = Σπi P(xi | πi) P(πi)

    • Suppose we want P(D), and only the value of E is given as true

    • P (D|e) =  ΣA,B,CP(a, b, c, d, e) =  ΣA,B,CP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)

    • With simple iteration to compute this expression, there’s going to be a lot of repetition (e.g., P(e|c) has to be recomputed every time we iterate over C for all possible assignments of A and B))


    Exercise enumeration
    Exercise: Enumeration , if we assume all hypotheses are ME and EXH

    p(smart)=.8

    p(study)=.6

    smart

    study

    p(fair)=.9

    prepared

    fair

    pass

    Query: What is the probability that a student studied, given that they pass the exam?


    Variable elimination
    Variable elimination , if we assume all hypotheses are ME and EXH

    • Basically just enumeration, but with caching of local calculations

    • Linear for polytrees

    • Potentially exponential for multiply connected BNs

      • Exact inference in Bayesian networks is NP-hard!


    Variable elimination1
    Variable elimination , if we assume all hypotheses are ME and EXH

    General idea:

    • Write query in the form

    • Iteratively

      • Move all irrelevant terms outside of innermost sum

      • Perform innermost sum, getting a new term

      • Insert the new term into the product


    Variable elimination2
    Variable elimination , if we assume all hypotheses are ME and EXH

    8 x 4 = 32 multiplications

    8 x 2 + 4 + 2 = 22

    multiplications

    Example:

    ΣAΣBΣCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)

    = ΣAΣBP(a)P(b|a)ΣCP(c|a) P(d|b,c) P(e|c)

    = ΣAP(a)ΣBP(b|a)ΣCP(c|a) P(d|b,c) P(e|c)

    for each state of A = a

    for each state of B = b

    compute fC(a, b) = ΣCP(c|a) P(d|b,c) P(e|c)

    compute fB(a) = ΣBP(b)fC(a, b)

    Compute result = ΣAP(a)fB(a)

    Here fC(a, b), fB(a) are called factors, which are vectors or matrices

    Variable C is summed out

    variable B is summed out


    Exercise variable elimination
    Exercise: Variable elimination , if we assume all hypotheses are ME and EXH

    p(smart)=.8

    p(study)=.6

    smart

    study

    p(fair)=.9

    prepared

    fair

    pass

    Query: What is the probability that a student is smart, given that they pass the exam?


    Belief propagation

    A , if we assume all hypotheses are ME and EXH

    BC

    D E = e F

    Belief Propagation

    • Singly connected network, SCN (also known as polytree)

      • there is at most one undirected path between any two nodes (i.e., the network is a tree if the direction of arcs are ignored)

      • The influence of the instantiated variable (evidence) spreads to the rest of the network along the arcs

    • The instantiated variable influences

    • its predecessors and successors differently (using CPT along opposite directions)

    • Computation is linear to the diameter of

    • the network (the longest undirected path)

    • Update belief (posterior) of every non-evidence node in one pass

  • For multi-connected net: conditioning


  • Conditioning

    A , if we assume all hypotheses are ME and EXH

    BC

    DE

    Conditioning

    • Conditioning: Find the network’s smallest cutset S (a set of nodes whose removal renders the network singly connected)

      • In this network, S = {A} or {B} or {C} or {D}

    • For each instantiation of S, compute the belief update with the belief propagation algorithm

    • Combine the results from all instantiations of S (each is weighted by P(S = s))

    • Computationally expensive (finding the smallest cutset is in general NP-hard, and the total number of possible instantiations of S is O(2|S|))


    Junction tree
    Junction Tree , if we assume all hypotheses are ME and EXH

    • Convert a BN to a junction tree

      • Moralization: add undirected edge between every pair of parents, then drop directions of all arc: Moralized Graph

      • Triangulation: add an edge to any cycle of length > 3: Triangulated Graph

      • A junction tree is a tree of cliques of the triangulated graph

      • Cliques are connected by links

        • A link stands for the set of all variables S shared by these two cliques

        • Each clique has a potential (similar to CPT), constructed from CPT of variables in the original BN


    Junction tree1

    A , if we assume all hypotheses are ME and EXH

    A

    C

    B

    C

    B

    A,B,C

    D

    E

    D

    E

    (B, C)

    A simple BN

    Triangulated graph

    B,C,D

    A

    (C, D)

    C

    B

    C,D.E

    Junction tree of 3 nodes

    D

    E

    Moralized graph

    Junction Tree

    • Example


    Junction tree2
    Junction Tree , if we assume all hypotheses are ME and EXH

    • Reasoning

      • Since it is now a tree, polytree algorithm can be applied, but now two cliques exchange P(S), the distribution over S, their shared variables.

      • Complexity:

        • O(n) steps, where n is the number of cliques

        • Each step is expensive if cliques are large (CPT exponential to clique size)

        • Construction of CPT of JT is expensive as well, but it needs to compute only once.


    Some comments on bn reasoning
    Some comments on BN reasoning , if we assume all hypotheses are ME and EXH

    • Let be the set of all variables in a BN. Any BN reasoning task can be expressed in the form of calculating

    • This can be done by marginalization of the joint distribution P(X) over Y = X \ U \ V:

      where each entry P(x) = P(u,v,y) can be calculated by chain rule from CPTs

    • Computation can be done more efficiently using, say Junction tree, by utilizing variable interdependencies

    • Computational complexity of BN reasoning is proved to be NP-hard by reducing 3SAT problems to BN reasoning (Cooper 1990)


    Approximate inference direct sampling
    Approximate inference: Direct sampling , if we assume all hypotheses are ME and EXH

    • Suppose you are given values for some subset of the variables, E, and want to infer distributions for unknown variables, Z

    • Randomly generate a very large number of instantiations from the BN according to the distribution

      • Generate instantiations for all variables – start at root variables and work your way “forward” in topological order

    • Rejection sampling: Only keep those instantiations that are consistent with the values for E

    • Use the frequency of values for Z to get estimated probabilities

    • Accuracy of the results depends on the size of the sample (asymptotically approaches exact results)

    • Very expensive and inefficient


    Likelihood weighting
    Likelihood weighting , if we assume all hypotheses are ME and EXH

    • Idea: Don’t generate samples that need to be rejected in the first place!

    • Sample only from the unknown variables Z and X (E are fixed)

    • Weight each sample according to the likelihood that it would occur, given the evidence E

      • A weight w is associated with each sample (w initialized to 1)

      • When a evidence node (say E1 = e1-0) is selected for weighting, its parents are already instantiated (say parents A and B are assigned state a and b)

      • Modify w = w * P(e1-0 | a, b) based on E1’s CPT

      • Repeat for the other evidence nodes


    Markov chain monte carlo algorithm
    Markov chain Monte Carlo algorithm , if we assume all hypotheses are ME and EXH

    • So called because

      • Markov chain – each instance generated in the sample is dependent on the previous instance

      • Monte Carlo – statistical sampling method

    • Perform a random walk through variable assignment space, collecting statistics as you go

      • Start with a random instantiation, consistent with evidence variables

      • At each step, randomly select a non-evidence variable x, randomly sample its value by

    • Given enough samples, MCMC gives an accurate estimate of the true distribution of values


    Loopy propagation
    Loopy Propagation , if we assume all hypotheses are ME and EXH

    • Belief propagation

      • Works only for polytrees (exact solution)

      • Each evidence propagates once throughout the network

    • Loopy propagation

      • Let propagation continue until the network stabilize (hope)

    • Experiments show

      • Many BN stabilize with loopy propagation

      • If it stabilizes, often yielding exact or very good approximate solutions

    • Analysis

      • Conditions for convergence and quality approximation are under intense investigation


    Noisy or bn
    Noisy-Or BN , if we assume all hypotheses are ME and EXH

    • A special BN of binary variables (Peng & Reggia, Cooper)

      • Causation independence: parent nodes influence a child independently

    • Advantages:

      • One-to-one correspondence between causal links and causal strengths

      • Easy for humans to understand (acquire and evaluate KB)

      • Fewer # of probabilities needed in KB

      • Computation is less expensive

    • Disadvantage: less expressive (less general)


    Learning bn from case data
    Learning BN (from case data) , if we assume all hypotheses are ME and EXH

    • Needs for learning

      • Difficult to construct BN by humans (esp. CPT)

      • Experts’ opinions are often biased, inaccurate, and incomplete

      • Large databases of cases become available

    • What to learn

      • Parameter learning: learning CPT when DAG is known (easy)

      • Structural learning: learning DAG (hard)

    • Difficulties in learning DAG from case data

      • There are too many possible DAG when # of variables is large (more than exponential)

        n # of possible DAG

        3 25

        10 4*10^18

      • Missing values in database

      • Noisy data


    Bn learning approaches
    BN Learning Approaches , if we assume all hypotheses are ME and EXH

    • Early effort: Based on variable dependencies (Pearl)

      • Find all pairs of variables that are dependent of each other (applying standard statistical method on the database)

      • Eliminate (as much as possible) indirect dependencies

      • Determine directions of dependencies

      • Learning results are often incomplete (learned BN contains indirect dependencies and undirected links)


    BN Learning Approaches , if we assume all hypotheses are ME and EXH

    • Bayesian approach (Cooper)

      • Find the most probable DAG, given database DB, i.e.,

        max(P(DAG|DB))or max(P(DAG, DB))

      • Based on some assumptions, a formula is developed to compute P(DAG, DB) for a given pair of DAG and DB

      • A hill-climbing algorithm (K2) is developed to search a (sub)optimal DAG

      • Compute CPTs after the DAG is determined

      • Extensions to handle some form of missing values


    BN Learning Approaches , if we assume all hypotheses are ME and EXH

    • Minimum description length(MDL) (Lam, etc.)

      • Sacrifices accuracy for simpler (less dense) structure

        • Case data not always accurate

        • Outliers are hard to model (needs more links)

        • Fewer links imply smaller CPD tables and less expensive inference

      • L = L1 + L2 where

        • L1: the length of the encoding of DAG (smaller for simpler DAG)

        • L2: the length of the encoding of the difference between DAG and DB (smaller for better match of DAG with DB)

        • Smaller L1 implies less accurate DAG, and thus larger L2

      • Find DAG by heuristic best-first search, that Minimizes L


    BN Learning Approaches , if we assume all hypotheses are ME and EXH

    • Neural network approach (Neal, Peng)

      • For noisy-or BN

      • Change inter-node link strength locally, following gradient descent approach to maximize L.



    Current research in BN , if we assume all hypotheses are ME and EXH

    • Missing data

      • Missing value: EM (expectation maximization)

      • Missing (hidden) variables are harder to handle

    • BN with time

      • Dynamic BN: assuming temporal relation obey Markov chain

    • Cyclic relations

      • Often found in social-economic analysis

      • Using dynamic BN?

    • Continuous variable

      • Some work on variables obeying Gaussian distribution

    • Connecting to other fields

      • Databases; Statistics; Symbolic AI (FOL); Semantic web

    • Reasoning with uncertain evidence

      • Virtual evidence

      • Soft evidence


    Other formalisms for Uncertainty , if we assume all hypotheses are ME and EXH Fuzzy sets and fuzzy logic

    • Ordinary set theory

      • There are sets that are described by vague linguistic terms (sets without hard, clearly defined boundaries), e.g., tall-person, fast-car

        • Continuous

        • Subjective (context dependent)

        • Hard to define a clear-cut 0/1 membership function


    1 - , if we assume all hypotheses are ME and EXH

    Set of teenagers

    0 12 19

    1 -

    Set of young people

    0 12 19

    1 -

    Set of mid-age people

    20 35 50 65 80

    • Fuzzy set theory

      height(john) = 6’5” Tall(john) = 0.9

      height(harry) = 5’8” Tall(harry) = 0.5

      height(joe) = 5’1” Tall(joe) = 0.1

      • Examples of membership functions


    • Fuzzy logic: many-value logic , if we assume all hypotheses are ME and EXH

      • Fuzzy predicates (degree of truth)

      • Connectors/Operators

    • Compare with probability theory

      • Prob. Uncertainty of outcome,

        • Based on large # of repetitions or instances

        • For each experiment (instance), the outcome is either true or false (without uncertainty or ambiguity)

          unsure before it happens but sure after it happens

          Fuzzy: vagueness of conceptual/linguistic characteristics

        • Unsure even after it happens

          whether a child of tall mother and short father is tall

          unsure before the child is born

          unsure after grown up (height = 5’6”)


    • Empirical vs subjective (testable vs agreeable) , if we assume all hypotheses are ME and EXH

    • Fuzzy set connectors may lead to unreasonable results

      • Consider two events A and B with P(A) < P(B)

      • If A => B (or A  B) then

        P(A ^ B) = P(A) = min{P(A), P(B)}

        P(A v B) = P(B) = max{P(A), P(B)}

      • Not the case in general

        P(A ^ B) = P(A)P(B|A)  P(A)

        P(A v B) = P(A) + P(B) – P(A ^ B)  P(B)

        (equality holds only if P(B|A) = 1, i.e., A => B)

    • Something prob. theory cannot represent

      • Tall(john) = 0.9, ~Tall(john) = 0.1

        Tall(john) ^ ~Tall(john) = min{0.1, 0.9) = 0.1

        john’s degree of membership in the fuzzy set of “median-height people” (both Tall and not-Tall)

      • In prob. theory: P(john  Tall ^ john Tall) = 0


    Uncertainty in rule based systems
    Uncertainty in rule-based systems , if we assume all hypotheses are ME and EXH

    • Elements in Working Memory (WM) may be uncertain because

      • Case input (initial elements in WM) may be uncertain

        Ex: the CD-Drive does not work 70% of the time

      • Decision from a rule application may be uncertain even if the rule’s conditions are met by WM with certainty

        Ex: flu => sore throat with high probability

    • Combining symbolic rules with numeric uncertainty: Mycin’s

      Certainty Factor (CF)

      • An early attempt to incorporate uncertainty into KB systems

      • CF  [-1, 1]

      • Each element in WM is associated with a CF: certainty of that assertion

      • Each rule C1,...,Cn => Conclusion is associated with a CF: certainty of the association (between C1,...Cn and Conclusion).


    • CF propagation: , if we assume all hypotheses are ME and EXH

      • Within a rule: each Ci has CFi, then the certainty of Action is

        min{CF1,...CFn} * CF-of-the-rule

      • When more than one rules can apply to the current WM for the same Conclusion with different CFs, the largest of these CFs will be assigned as the CF for Conclusion

      • Similar to fuzzy rule for conjunctions and disjunctions

    • Good things of Mycin’s CF method

      • Easy to use

      • CF operations are reasonable in many applications

      • Probably the only method for uncertainty used in real-world rule-base systems

    • Limitations

      • It is in essence an ad hoc method (it can be viewed as a probabilistic inference system with some strong, sometimes unreasonable assumptions)

      • May produce counter-intuitive results.


    Dempster shafer theory
    Dempster-Shafer theory , if we assume all hypotheses are ME and EXH

    • A variation of Bayes’ theorem to represent ignorance

    • Uncertainty and ignorance

      • Suppose two events A and B are ME and EXH, given an evidence E

        A: having cancer B: not having cancer E: smoking

      • By Bayes’ theorem: our beliefs on A and B, given E, are measured by P(A|E) and P(B|E), and P(A|E) + P(B|E) = 1

      • In reality,

        I may have some belief in A, given E

        I may have some belief in B, given E

        I may have some belief not committed to either one,

      • The uncommitted belief (ignorance) should not be given to either A or B, even though I know one of the two must be true, but rather it should be given to “A or B”, denoted {A, B}

      • Uncommitted belief may be given to A and B when new evidence is discovered


    {A,B,C} 0.15 , if we assume all hypotheses are ME and EXH

    {A,B} 0.1 {A,C} 0.1 {B,C}0.05

    {A} 0.1 {B} 0.2 {C}0.3

    {} 0

    • Representing ignorance

      • Ex: q = {A,B,C}

    • Belief function


    {A,B,C} 0.15 , if we assume all hypotheses are ME and EXH

    {A,B} 0.1 {A,C} 0.1 {B,C}0.05

    {A} 0.1 {B} 0.2 {C}0.3

    {} 0

    • Plausibility (upper bound of belief of a node)

    Lower bound (known belief)

    Upper bound (maximally possible)


    {A,B} 0.3 , if we assume all hypotheses are ME and EXH

    {A} 0.2 {B} 0.5

    {} 0

    {A,B} 0.1

    {A} 0.7 {B} 0.2

    {} 0

    q

    =

    {

    A

    ,

    B

    }

    :

    A

    :

    having

    cancer;

    B

    :

    not

    having

    cancer

    • Evidence combination (how to use D-S theory)

      • Each piece of evidence has its own m(.) function for the same q

      • Belief based on combined evidence can be computed from

    normalization factor incompatible combination


    {A,B} 0.3 , if we assume all hypotheses are ME and EXH

    {A} 0.2 {B} 0.5

    {} 0

    {A,B} 0.1

    {A} 0.7 {B} 0.2

    {} 0

    {A,B} 0.049

    {A} 0.607 {B} 0.344

    {} 0

    E1 E2 E1 ^ E2


    • Ignorance is reduced , if we assume all hypotheses are ME and EXH

      from m1({A,B}) = 0.3 to m({A,B}) = 0.049)

    • Belief interval is narrowed

      A: from [0.2, 0.5] to [0.607, 0.656]

      B: from [0.5, 0.8] to [0.344, 0.393]

  • Advantage:

    • The only formal theory about ignorance

    • Disciplined way to handle evidence combination

  • Disadvantages

    • Computationally very expensive (lattice size 2^|q|)

    • Assuming hypotheses are ME and EXH

    • How to obtain m(.) for each piece of evidence is not clear, except subjectively


  • ad