1 / 51

Understanding of complex data using Computational Intelligence methods

Understanding of complex data using Computational Intelligence methods . Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch. What am I going to say. Data and CI What we hope for. Forms of understanding.

felix
Download Presentation

Understanding of complex data using Computational Intelligence methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding of complex data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch

  2. What am I going to say • Data and CI • What we hope for. • Forms of understanding. • Visualization. • Prototypes. • Logical rules. • Some knowledge discovered. • Expert system for psychometry. • Conclusions, or why am I saying this?

  3. Types of Data • Data was precious! Now it is overwhelming ... • Statistical data – clean, numerical, controlled experiments, vector space model. • Relational data – marketing, finances. • Textual data – Web, NLP, search. • Complex structures – chemistry, economics. • Sequence data – bioinformatics. • Multimedia data – images, video. • Signals – dynamic data, biosignals. • AI data – logical problems, games, behavior …

  4. Evolutionaryalgorithms PatternRecognition Multivariatestatistics Expert systems Fuzzylogic Machinelearning Visuali-zation Neuralnetworks Probabilistic methods Computational Intelligence Soft computing Computational IntelligenceData => KnowledgeArtificial Intelligence

  5. Turning data into knowledge What should CI methods do? • Provide descriptive and predictive non-parametric models of data. • Allow to classify, approximate, associate, correlate, complete patterns. • Allow to discover new categories and interesting patterns. • Help to visualize multi-dimensional relationships among data samples. • Allow to understand the data in some way. • Help to model brains!

  6. Forms of useful knowledge AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever. But ... knowledge accessible to humans is in: • symbols, • similarity to prototypes, • images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists. Different answers in different fields.

  7. Data understanding • Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do. • Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. • Logical rules are the highest form of summarization of knowledge. Types of explanation: • visualization-based: maps, diagrams, relations ... • exemplar-based: prototypes and similarity; • logic-based: symbols and rules.

  8. Visualization: dendrograms All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure. Normal and malignant lymphocytes.

  9. Visualization: 2D projections All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure. 3-bit parity + all 5-bit combinations.

  10. Visualization: MDS mapping Results of pure MDS mapping + centers of hierarchical clusters connected. 3-bit parity + all 5-bit combinations.

  11. Visualization: 3D projections Only age is continuous, other values are binary Fine Needle Aspirate of Breast Lesions, red=malignant, green=benignA.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521

  12. Visualization: MDS mappings Try to preserve all distances in 2D nonlinear mapping MDS large sets using LVQ + relative mapping: Antoine Naud + WD, this conference.

  13. Prototype-based rules C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules have the form: IF P = arg minR D(X,R) THAN Class(X)=Class(P) D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=SupermanTHAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “Similar” may involve different features or D(X,P).

  14. P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function =>m(X;P)=exp{-|X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:

  15. Crisp P-rules New distance functions from info theory => interesting MF. Membership Functions => new distance function, with local D(X,R) for each cluster. Crisp logic rules: use Lnorm: D(X,P) = ||X-P|| = maxiWi |Xi-Pi| D(X,P) = const => rectangular contours. LChebyshev distance with thresholds P IF D(X,P) PTHENC(X)=C(P) is equivalent to a conjunctive crisp rule IFX1[P1-P/W1,P1+P/W1] ……XN[PN -P/WN,PN+P/WN]THENC(X)=C(P)

  16. Decision borders D(P,X)=const and decision borders D(P,X)=D(Q,X). Euclidean distance from 3 prototypes, one per class. Minkovski a=20 distance from 3 prototypes.

  17. P-rules for Wine L distance (crisp rules): 15 prototypes kept, 5 errors, f2, f8, f10 removed Euclidean distance: 11 prototypes kept, 7 errors Manhattan distance: • prototypes kept, 4 errors, f2 removed Many other solutions.

  18. Complex objects Vector space concept is not sufficient for complex object. A common set of features is meaningless. AI: complex objects, states, problem descriptions. General approach: sufficient to evaluate similarity D(Oi,Oj). Compare Oi, Oj: define transformation Elementary operators Wk, eg. substring’s substitutions. Many T connecting a pair of objects Oi and Oj objects exist. Cost of transformation = sum of Wk costs. Similarity: lowest transformation cost. Bioinformatics: sophisticated similarity functions for sequences.Dynamic programming finds similarities. Use adaptive costs and general framework for SBL methods.See Marczak et al (this conference).

  19. Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)

  20. Crisp logic rules: for continuous x use linguistic variables (predicate functions). Logical rules sk(x) şTrue [XkŁxŁX'k], for example: small(x) = True{x|x<1} medium(x) = True{x|xÎ[1,2]} large(x) = True{x|x>2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...

  21. Crisp logic is based on rectangular membership functions: Crisp logic decisions True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Severe limitation on the expressive power of crisp logical rules!

  22. Decision trees lead to specific decision borders. SSV tree on Wine data, proline + flavanoids content DT decisions borders

  23. Logical rules, if simple enough, are preferable. IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice. Logical rules - advantages • Rules may expose limitations of black box solutions. • Only relevant features are used in rules. • Rules may sometimes be more accurate than NN and other CI methods. • Overfitting is easy to control, rules usually have small number of parameters. • Rules forever !? A logical rule about logical rules is:

  24. Logical rules are preferred but ... Logical rules - limitations • Only one class is predicted p(Ci|X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications. • Discontinuous cost function allow only non-gradient optimization. • Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. • Reliable crisp rules may reject some cases as unclassified. • Interpretation of crisp rules may be misleading. • Fuzzy rules are not so comprehensible.

  25. Rules - choices Simplicity vs. accuracy. Confidence vs. rejection rate. p++ is a hit; p-+ false alarm; p+- is a miss.

  26. Inputs: -1 65 1 5 3 1 Pain Intensity Neural networksand rules ~ p(MI|X) Myocardial Infarction 0.7 Outputweights Inputweights Sex Age Smoking Elevation Pain ECG: ST Duration

  27. Knowledge from networks Simplify networks: force most weights to 0, quantize remaining parameters, be constructive! • Regularization: mathematical technique improving predictive abilities of the network. • Result: MLP2LN neural networks that are equivalent to logical rules.

  28. Converts MLP neural networks into a network performing logical operations (LN). MLP2LN Input layer Output: one node per class. Aggregation: better features Linguistic units: windows, filters Rule units: threshold logic

  29. Learning dynamics Decision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.

  30. Neurofuzzy systems Fuzzy: m(x)=0,1 (no/yes) replaced by a degree m(x)[0,1]. Triangular, trapezoidal, Gaussian ...MF. Feature Space Mapping (FSM) neurofuzzy system. Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: M.f-s in many dimensions:

  31. GhostMiner Philosophy GhostMiner, data mining tools from our lab. • Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer • There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. • Provide tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

  32. Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. 286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%) no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes 9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.

  33. Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. Many systems used, 65-78% accuracy reported. Single rule: IF (nodes-involved  [0,2] Ù degree-malignant = 3 THEN recurrence, ELSE no-recurrence 76.2% accuracy, only trivial knowledge in the data: Highly malignant breast cancer involving many nodes is likely to strike back.

  34. Recurrence - comparison. Method 10xCV accuracy MLP2LN 1 rule 76.2 SSV DTstable rules 75.7 1.0 k-NN, k=10, Canberra 74.1 1.2 MLP+backprop. 73.5  9.4 (Zarndt)CART DT 71.4  5.0 (Zarndt) FSM, Gaussian nodes 71.7  6.8 Naive Bayes 69.3  10.0 (Zarndt) Other decision trees < 70.0

  35. Breast cancer diagnosis. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. 699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses Tasks: distinguish benign from malignant cases.

  36. Breast cancer rules. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. Simplest rule from MLP2LN, large regularization: If uniformity of cell size < 3 Thenbenign Elsemalignant Sensitivity=0.97, Specificity=0.85 More complex NN solutions, from 10CV estimate: Sensitivity =0.98, Specificity=0.94

  37. Breast cancer comparison. Method 10xCV accuracy k-NN, k=3, Manh 97.0 2.1 (GM)FSM, neurofuzzy96.9 1.4 (GM) Fisher LDA 96.8 MLP+backprop. 96.7 (Ster, Dobnikar)LVQ 96.6 (Ster, Dobnikar) IncNet (neural) 96.42.1 (GM)Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0 2.9 (GM)LDA (linear discriminant) 96.0 Various decision trees 93.5-95.6

  38. Melanoma skin cancer • Collected in the Outpatient Center of Dermatology in Rzeszów, Poland. • Four types of Melanoma: benign, blue, suspicious, or malignant. • 250 cases, with almost equal class distribution. • Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5). • TDS (Total Dermatoscopy Score) - single index • Goal: hardware scanner for preliminary diagnosis.

  39. Melanoma results Method Rules Training % Test% MLP2LN, crisp rules 4 98.0 all 100 SSV Tree, crisp rules 4 97.5±0.3 100FSM, rectangular f. 7 95.5±1.0 100 knn+ prototype selection 13 97.5±0.0 100 FSM, Gaussian f. 15 93.7±1.0 95±3.6 knn k=1, Manh, 2 features -- 97.4±0.3 100 LERS, rough rules 21 -- 96.2

  40. Antibiotic activity of pyrimidine compounds. Pyrimidines: which compound has stronger antibiotic activity? Common template, substitutions added at 3 positions, R3, R4 and R5. 27 features taken into account: polarity, size, hydrogen-bond donor or acceptor, pi-donor or acceptor, polarizability, sigma effect. Pairs of chemicals, 54 features, are compared, which one has higher activity? 2788 cases, 5-fold crossvalidation tests.

  41. Antibiotic activity - results. Pyrimidines: which compound has stronger antibiotic activity? Mean Spearman's rank correlation coefficient used: -1< rs < +1 Method Rank correlation FSM, 41 Gaussian rules 0.77±0.03Golem (ILP) 0.68Linear regression 0.65CART (decision tree) 0.50

  42. Thyroid screening. Clinical findings Finaldiagnoses Hidden units Age sex … … Normal Hypothyroid TSH Hyperthyroid T4U T3 TT4 TBG Garavan Institute, Sydney, Australia 15 binary, 6 continuous Training: 93+191+3488 Validate: 73+177+3178 • Determine important clinical factors • Calculate prob. of each diagnosis.

  43. Thyroid – some results. Accuracy of diagnoses obtained with different systems. Method Rules/Features Training % Test% MLP2LN optimized 4/6 99.9 99.36 CART/SSV Decision Trees 3/5 99.8 99.33 Best Backprop MLP -/21 100 98.5 Naïve Bayes -/- 97.0 96.1 k-nearest neighbors -/- - 93.8

  44. Psychometry MMPI (Minnesota Multiphasic Personality Inventory) psychometric test. Printed forms are scanned or computerized version of the test is used. • Raw data: 550 questions, ex:I am getting tired quickly: Yes - Don’t know - No • Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients. • Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.

  45. Psychometry • There is no simple correlation between single values and final diagnosis. • Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks. Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level. Problem: agreement between experts only 70% of the time; alternative diagnosis and personality changes over time are important.

  46. Psychometric data 1600 cases for woman, same number for men. 27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to ... Extraction of logical rules: 14 scales = features. Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.

  47. Psychometric data 10-CV for FSM is 82-85%, for C4.5 is 79-84%. Input uncertainty +Gx around 1.5% (best ROC) improves FSM results to 90-92%.

  48. Psychometric Expert Probabilities for different classes. For greater uncertainties more classes are predicted. Fitting the rules to the conditions: typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green. Verbal interpretation of each case, rule and scale dependent.

  49. Visualization Probability of classes versus input uncertainty. Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’. Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.

  50. Conclusions Data understanding is challenging problem. • Classification rules are frequently only the first step and may not be the best solution. • Visualization is always helpful. • P-rules may be competitive if complex decision borders are required, providing different types of rules. • Understanding of complex objects is possible, although difficult, using adaptive costs and distance as least expensive transformations (action principles in physics). • Why am I saying all this?Because we have hopes for great applications!

More Related