Information Mining with Relational and Possibilistic Graphical Models

Information Mining with Relational and PossibilisticGraphical Models

Example: Continuously Adapting Gear Shift Schedule in VW New Beetle

Continuously Adapting Gear Shift Schedule: Technical Details • Mamdani controller with 7 rules • Optimized program • 24 Byte RAM on Digimat • 702 Byte ROM • Runtime 80 ms12 times per second a new sport factor is assigned • How to find suitable rules? AG4

Information Mining • Information mining is the non-trivial process of identifying valid, novel, potentially useful, and understandable information and patterns in heterogeneous information sources. • Information sources are • data bases, • expert background knowledge, • textual description, • images, • sounds, ...

Information Mining

Example: Line Filtering • Extraction of edge segments (Burns’ operator) • Production net:edges  lines  long lines  parallel lines  runways

SOMAccess V1.0 Available on CD-ROM: G. Hartmann, A. Nölle, M. Richards, and R. Leitinger (eds.),Data Utilization Software Tools 2 (DUST-2 CD-ROM),Copernicus Gesellschaft e.V., Katlenburg-Lindau, 2000(ISBN3-9804862-3-0)

Current Research Topics • Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) • Association, correlation, and causality analysis • Classification: scalability and new approaches • Clustering and outlier analysis • Sequential patterns and time-series analysis • Similarity analysis: curves, trends, images, texts, etc. • Text mining, web mining and weblog analysis • Spatial, multimedia, scientific data analysis • Data preprocessing and database completion • Data visualization and visual data mining • Many others, e.g., collaborative filtering

Fuzzy Methods in Information Mining here: Exploiting quantitative and qualitative information • Fuzzy Data Analysis (Projects with Siemens) • Dependency Analysis (Project with Daimler)

Analysis of Imprecise Data Fuzzy Database A B C A B C 1 1 Large Very large Medium Linguistic modeling 2 2 2.5 Medium About 7 3 3 [3,4] Small [7,8]     Computing with words Statistics with fuzzy sets Mean of attribute A Linguistic approximation The mean w.r.t. A is „approximately 5“

Fuzzy Data Analysis Strong law of large numbers (Ralescu, Klement, Kruse, Miyakoshi, ...) Let {xk | k1} be independent and identically distributedfuzzy random variables such that E||supp x1|| <  . Then Books: Kruse, Meyer: Statistics with Vague Data, Reidel, 1987 Bandemer, Näther: Fuzzy Data Analysis, Kluwer, 1992 Seising, Tanaka and Guo, Wolkenhauer, Viertl, ...

Analysis of Daimler/Chrysler Database Database: ~ 18.500 passenger cars > 100 attributes per car Analysis of dependencies between special equipment andfaults. Results used as a starting point for technical experts looking for causes.

Bayesian Networks

Example: Genotype Determination of Jersey Cattle variables: 22, state space 6  1013, parameters: 324 Graphical Model • node random variable • edges conditional dependencies • decomposition  • diagnosis P(  | knowledge) Phenogr.1(3 diff.) Phenogr.2(3 diff.) Genotype(6 diff.)

Learning Graphical Models local models

The Learning Problem

Information Mining Computing with words Learning graphical models Linguistic modeling 18.500 passenger cars 130 attributes per car Imprecise data Fuzzy Database Rule generation IF air conditioning and electr. roof top Then more battery faults relational/possibilistic graphical model

A Simple Example Example World Relation color shape size                   small medium small medium medium large medium medium medium large • 10 simple geometric objects, 3 attributes • one object is chosen at random and examined. • Inferences are drawn about the unobserved attributes.

The Reasoning Space Geometric Interpretation Relation color shape size                    small medium small medium medium large medium medium medium large    large medium small Each cube represents one tuple

Prior Knowledge and Its Projections         large large medium medium small small         large large medium medium small small

Cylindrical Extensions and Their Intersection  Intersecting the cylindrical extensions of the projection to the subspace formed by color and shape and of the projection to the subspace formed by shape and size yields the original three-dimensional relation.    large medium small         large large medium medium small small

Reasoning • Let it be known (e.g. from an observation) that the given object is green. This information considerably reduces the space of possible value combinations. • From the prior knowledge it follows that the given object must be - either a triangle or a square and - either medium or large         large large medium medium small small

Reasoning with Projections The same result can be obtained using only the projections to the subspaces without reconstructing the original three-dimensional space:   s m l color size extend shape project  project extend        s m l This justifies a network representation color shape size

Interpretation of Graphical Models • Relational Graphical Model  Decomposition + local models Example size size colour shape colour shape graph hypergraph • Learning a relational graphical model  Searching for a suitable decomposition + local relations

Genotype Determination of Danish Jersey Cattle

Qualitative Knowledge

Example: Genotype Determination of Jersey Cattle variables: 22, state space 6  1013, parameters: 324 Graphical Model • node random variable • edges conditional dependencies • decomposition  • diagnosis P(  | knowledge) Phenogr.1(3 diff.) Phenogr.2(3 diff.) Genotype(6 diff.)

Learning Graphical Models from Data • Test whether a distribution is decomposable w.r.t. a given graph. This is the most direct approach. It is not bound to a graphical representation, but can also be carried out w.r.t. other representations of the set of subspaces to be used to compute the (candidate) decomposition of the given distribution. • Find an independence map by conditional independence tests. This approach exploits the theorems that connect conditional independence graphs and graphs that represent decompositions. it has the advantage that a single conditional independence test, if it fails, can exclude several candidate graphs. • Find a suitable graph by measuring the strength of dependences. This is a heuristic, but often highly successful approach, which is based on the frequently valid assumption that in a distribution that is decomposable w.r.t. a graph an attribute is more strongly dependent on adjacent attributes than on attributes that are not directly connected to them.

Is Decomposition Always Possible?         2 large large medium medium 1 small small         large large medium medium small small

Direct Test for decomposability 2. 3. 4. 1. color color color color shape size shape size shape size shape size                 large large large large medium medium medium medium small small small small 5. 6. 7. 8. color color color color shape size shape size shape size shape size                 large large large large medium medium medium medium small small small small

Evaluation Measures and Search Methods • An exhaustive search over all graphs is too expensive: • possible undirected graphs for n attributes. • possible directed acyclic graphs. • Therefore all learning algorithms consist of an evaluation measure (scoring function), e.g. • Hartley information gain • relative number of occurring value combinations and a (heuristic) search method, e.g. • guided random search • greedy search (K2 algorithm) • conditional independence search

Measuring the Strengths of Marginal Dependences • Relational networks: Find a set of subspaces, for which the intersection of the cylindrical extensions of the projections to these subspaces contains as few additional states as possible. • This size of the intersection depends on the sizes of the cylindrical extensions, which in turn depend on the sizes of the projections. • Therefore it is plausible to use the relative number of occurring value combinations to assess the quality of a subspace. • The relational network can be obtained by interpreting the relative numbers as edge weights and constructing the minimal weight spanning tree. subspace color  shape shape  size size  color possible combinations occurring combinations relative number 12 6 50% 9 5 56% 12 8 67%

Conditional Independence Tests  Hartley information needed to determine coordinates: log24+ log23= log212 3.58 coordinate pair: log26 2.58 gain: log212- log26= log22 =1    Definition: Let A and B be two attributes and R a discrete possibility measure with adom(A): bdom(B):R(A=a,B=b)=1 Then is called the Hartley information gain of A and B w.r.t. R.

Conditional Independence Tests (continued) The Hartley information gain can be used directly to test for (approximate) marginal independence. • In order to test for (approximate) conditional independence: • Compute the Hartley information gain for each possible instantiation of the conditioning attributes. • Aggregate the result over all possible instantiations, for instance, by simply averaging them.

Direct Test for Decomposability Definition: Let p1 and p2 be two strictly positive probability distributions on the same set e of events. Then is called the Kullback-Leibler information divergence of p1 and p2. The Kullback-Leibler information divergence is non-negative. It is zero if and only if p1p2. Therefore it is plausible that this measure can be used to asses the quality of the approximation of a given multi-dimensional distribution p1 by the distribution p2 that is represented by a given graph: The smaller the value of this measure, the better the approximation.

Direct Test for Decomposability (continued) 1. 2. 3. 4. A A A A B B B B C C C C 0.137 -4612 0.429 -4830 0.566 -5041 0.540 -4991 8. 6. 7. 5. A A A A B B B C C C B C 0.111 -4563 0.402 -4780 0 -4401 0 -4401 Upper numbers: The Kullback-Leibler information divergence of the original distribution and its approximation. Lower numbers: The binary logarithms of the probability of an example database (log-likelihood of data).

Evaluation Measures / Scoring Functions Relational Networks • Relative number of occurring value combinations • Hartley Information Gain Probabilistic Networks • 2-Measure • Mutual Information / Cross Entropy / Information Gain • (Symmetric) Information Gain Ratio • (Symmetric/Modified) Gini Index • Bayesian Measures (g-function, BDeu metric) • Other measures that are known from Decision Tree Induction

A Probabilistic Evaluation Measure Mutual Information / Cross Entropy / Information Gain based on Shannon entropy Idea:

Possiblity Theory fuzzy set induces possibility axioms

Possibility Distributions and the Context Model Let W be the set of all possible states of the world, w0 the actual (but unknown) state. Let C={c1,…,ck} be a set of contexts (observers, frame conditions etc.), (C,2C,P) a finite probability space (context weights). Let g:C2W be a set-valued mapping, assigning to each context the most specific correct set-valued specification ofw0. g is called a random set (since it is a set-valued random variable); thesets g(c) are also called focal sets. The induced one point coverage of g or the induced possibility distribution is

Database-induced Possibility Distributions Focal Sets Imprecise Database Each imprecise tuple – or, more precisely, the set of all precise tuples compatible with it – is interpreted as a focal set of a random set. In the absence of other information equal weights are assigned to the contexts. In this way an imprecise database induces a possibility distribution.

Reasoning 0 0 0 70 all numbers in parts per 1000 0 0 0 70 70 0 0 0 20 60 0 0 0 10 10 large s m l 0 0 0 70 70 20 70 70 0 0 0 60 40 60 20 0 0 0 10 medium 10 10 10 0 0 0 20 70 0 0 0 40 0 0 0 10 large medium small 0 0 0 70 small 0 0 0 70 40 0 0 0 70 0 0 0 40 0 0 0 60 • Using the information that the given object is green. 0 0 0 10

Reasoning with Projections Again the same result can be obtained using only projections to subspaces (maximal degrees of possibility): s m l new old old new 0 0 0 70 90 80 70 size color 80 90 70 70 40 70 70 shape new old min new max column old new old new 20 20 80 70 70 70 40 0 80 0 10 0 70 70 80 70 max line min new 40 40 20 20 70 60 30 0 10 0 60 60 70 0 60 70 90 10 60 10 30 10 20 0 90 80 0 90 0 10 10 10 s m l This justifies a network representation: color shape size

POSSINFER

Possibilistic Evaluation Measures / Scoring Functions • Specificity Gain [Gebhardt and Kruse 1996, Borgelt et al. 1996] • (Symmetric) Specificity Gain Ratio [Borgelt et al. 1996] • Analog of Mutual Information [Borgelt and Kruse 1997] • Analog of the 2-measure [Borgelt and Kruse 1997]

Possibilistic Evaluation Measures Reduction to the relational case via -cuts log21 + log21 - log21 = 0 log22 + log22 - log23  0.42 log23 + log22 - log25  0.26 log24 + log23 - log28  0.58 log24 + log23 - log212 = 0 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 Usable relational measures • relative number of value combinations/Hartley information gain  specificity gain • number of additional value combinations in the Cartesian product of the marginal distributions

Specificity Gain Definition: Let A and B be two attributes and  a possibility measure. is called the specificity gain of A and B w.r.t. . • Generalization of Hartley information gain on the basis of the -cut view of possibility distributions. • Analogous to Shannon information gain.

Specificity Gain in the Example projection to subspace minimum of marginals specificity gain 40 80 10 70 80 80 70 70 0.055 bit 30 10 70 60 70 70 70 70 80 90 20 10 80 90 70 70 s m l s m l 70 70 70 20 80 70 80 70 80 0.048 bit 40 70 20 90 70 80 90 60 30 large medium small 40 70 20 70 large medium small 70 70 70 70 0.027 bit 60 80 70 70 80 80 70 70 80 90 40 40 80 90 70 70

Learning Graphical Models from Data 2. 3. 4. 1. color color color color shape size shape size shape size shape size                 large large large large medium medium medium medium small small small small 5. 6. 7. 8. color color color color shape size shape size shape size shape size                 large large large large medium medium medium medium small small small small

Data Mining Tool Clementine

Information Mining with Relational and Possibilistic Graphical Models

Information Mining with Relational and Possibilistic Graphical Models

Presentation Transcript

Graphical Models

Graphical Models and Applications

Integration and Graphical Models

Incomplete Graphical Models

Graphical Models

Graphical Models

Graphical Models

GRAPHICAL MODELS

Lecture 33 of 42 Fall, 2005 Relational Graphical Models

Possibilistic Information Flow

Probabilistic Graphical Models

Probabilistic and Possibilistic Graphical Models in Complex Applications

Graphical Models

- Relational - Graphical Models

Graphical Causal Models

Representation and Reasoning with Graphical Models

Probabilistic Graphical Models

Relational Models

Information Extraction Data Mining and Topic Discovery with Probabilistic Models

Integration and Graphical Models

Graphical Models