T he future of computational intelligence & meta-learning.

The future of computational intelligence & meta-learning. Włodzisław Duch & Collaborators Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch W IISA 2010, Ganzhou, Jiangxi University of Science & Technology

Toruń

Copernicus Nicolaus Copernicus: born in Torun in 1472

Norbert Tomek Marek Krzysztof

DI NCU Projects: CI Google Duch W => List of projects, talks, papers • Computational intelligence (CI), main themes: • Foundations of computational intelligence: transformation based learning, k-separability, learning hard boole’an problems. • Meta-learning, or learning how to learn. • Understanding of data: prototype-based rules, visualization. • Novel learning: projection pursuit networks, QPC (Quality of Projected Clusters), search-based neural training, transfer learning or learning from others (ULM), aRPM, SFM ... • Similarity based framework for metalearning, heterogeneous systems, new transfer functions for neural networks. • Feature selection, extraction, creation.

DI NCU Projects:NCI Neurocognitive Informatics projects. • Computational creativity, insight, intuition, consciousness. • Neurocognitive approach to language, word games. • Medical information retrieval, analysis, visualization. • Global analysis of EEG, visualization of high-D trajectories. • Brain stem models and consciousness in artificial systems. • Autism - ADHD, comprehensive theory. • Imagery agnosia, especially imagery amusia. • Infants: observation, guided development. • A test-bed for integration of different Humanized Interface Technologies (HIT), with Singapore C2I Center. • Free will, neural determinism and social consequences.

In the year 1900 at the International Congress of Mathematicians in Paris David Hilbert delivered what is now considered the most important talk ever given in the history of mathematics, proposing 23 major problems worth working at in future. 100 years later the impact of this talk is still strong: some problems have been solved, new problems have been added, but the direction once set - identify the most important problems and focus on them - is still important.It became quite obvious that this new field also requires a series of challenging problems that will give it a sense of direction.

Failures of AI Many ambitious general AI projects failed, for example: A. Newell, H. Simon, General Problem Solver (1957). Eduardo Caianiello (1961): mnemonic equations explain everything. 5th generation computer project 1982-1994. • AI has failed in many areas: • problem solving, reasoning • flexible control of behavior • perception, computer vision • language ... • Why? • Too naive? • Not focused on applications? • Not addressing real challenges?

Ambitious approaches… CYC, started by Douglas Lenat in 1984, commercial since 1995. Developed by CyCorp, with 2.5 millions of assertions linking over 250.000 concepts and in addition thousands of micro-theories. Cyc-NL is still a “potential application”, knowledge representation in frames is quite complicated and thus difficult to use. Hall baby brain – developmental approach, www.a-i.com, failed. Open Mind Common Sense Project (MIT): a WWW collaboration with over 14,000 authors, who contributed 710,000 sentences; used to generate ConceptNet, very large semantic network. Some interesting projects are being developed now around this network but no systematic knowledge has been collected. Other such projects: HowNet (Chinese Academy of Science), FrameNet (Berkeley), various large-scale ontologies, MindNet (Microsoft) project, to improve translation. Mostly focused on understanding of all relations the in text/dialogue.

Artificial General Intelligence (AGI): • architectures that can solve many problems and transfer knowledge between the tasks. • Roadmaps: • A Ten Year Roadmap to Machines with Common Sense (Push Singh, Marvin Minsky, 2002) • Euron (EU Robotics) Research Roadmap (2004) • Neuro-IT Roadmap (EU, A. Knoll, M de Kamps, 2006) • Challenges: Word games of increasing complexity: • 20Q is the simplest, only object description. • Yes/No game to understand situation. • The Mind and Brain Model Project, 500 million $ EU grant (submitted Dec 2010): 13 groups, including 1 USA + 1 Singapore. Steps Toward an AGI Roadmap

Neuroimaging words Predicting Human Brain Activity Associated with the Meanings of Nouns," T. M. Mitchell et al, Science, 320, 1191, May 30, 2008 • Clear differences between fMRI brain activity when people read and think about different nouns. • Reading words and seeing the drawing invokes similar brain activations, presumably reflecting semantics of concepts. • Although individual variance is significant similar activations are found in brains of different people, a classifier may still be trained on pooled data. • Model trained on ~10 fMRI scans + very large corpus (1012) predicts brain activity for over 100 nouns for which fMRI has been done. Overlaps between activation of the brain for different words may serve as expansion coefficients for word-activation basis set. In future: I may know what you’ll think before you will know it yourself! Intentions may be known seconds before they become conscious!

Connectome Project Brain-based representations of concepts should be possible.

Exponential growth of power From R. Kurzweil, “The Law of Accelerating Returns”. By 2020 PC computers will match the raw speed of all brain operations! What about organization of info flow? Memristor brain projects!

Robot development Darwin/Nomad, DB, Cog, Kismet, Hal, iCube, Thespian – develop robot mind in the same way as babies’ minds, by social interactions. Cog: saccadic eye movements, sound localization, motor coordination, balance, auditory/visual signal coordination, eye, hand and head movement coordination, face recognition, eye contact, haptic (tactile) object recognition ... Interesting model of autism! Kismet: sociable humanoidwith emotional responses, that seems to be alive and aware. DB: learning from demonstration, dance, pole balancing, tennis swing, juggling ... complex eye movements, visuo-motor tasks, such as catching a ball.

Query Semantic memory Applications, search, 20 questions game. Humanized interface Store Part of speech tagger & phrase extractor verification On line dictionaries Active search and dialogues with users Parser Manual

Web/text/databases interface Text to speech NLP functions Natural input modules Talking head Behavior control Cognitive functions Control of devices Affectivefunctions Specialized agents DREAM top-level architecture DREAM project is focused on perception (visual, auditory, text inputs), cognitive functions (reasoning based on perceptions), natural language communication in well defined contexts, real time control of the simulated/physical head.

What is there to learn? Industry: what happens with our machines? Cognitive robotics: vision, perception, language. Bioinformatics, life sciences. Brains ... what is in EEG? What happens in the brain?

Conscious machines: Haikonen Haikonen has done some simulations based on a rather straightforward design, with neural models feeding the sensory information (with WTA associative memory) into the associative “working memory” circuits.

What can we learn? What can we learn using pattern recognition, machine lerning, computational intelligence techniques ? Everything? Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not at all! All non-trivial problems are hard, need deep transformations. • Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems: • Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same. • Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. • There is no problem-independent or “best” set of features. • “Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.” • In practice: try as many models as you can, rely on your experience and intuition. There is no free lunch, but do we have to cook ourselves?

Data mining packages GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/ • Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GM Developer & Analyzer • No free lunch => provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees, tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects. • Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list! • We are building Intemi, radically new DM tools.

What DM packages do? Hundreds of components ... transforming, visualizing ... Visual “knowledge flow” to link components, or script languages (XML) to define complex experiments. Rapid Miner 5.0, type and # components Process control 34 Data transformations 111 Data modeling 231 Clustering & segmentation19 Performance evaluation 30 Text, series, web ... specific transformations.Visualization, presentation, plugin extensions ... ~ billions of models!

Press the button and wait for the truth! Computer power is with us, meta-learning should replace us in find all interesting data models =sequences of transformations/procedures. Many considerations: optimal cost solutions, various costs of using feature subsets; simple & easy to understand vs optimal accuracy; various representation of knowledge: crisp, fuzzy or prototype rules, visualization, confidence in predictions ... May the force be with you Hundreds of components ... billions of combinations ... Our treasure box is full! We can publish forever! Specialized transformations are still missing in many packages. What would we really like to have?

With all these tools, are we really so good? Surprise! Almost nothing can be learned using such tools!

Meta-learning Meta-learning means different things for different people. Some will call “meta” learning of many models, ranking them, boosting, bagging, or creating an ensemble in many ways , so heremeta  optimization of parameters to integrate models. Landmarking: characterize many datasets and remember which method worked the best on each dataset.Compare new dataset to the reference ones; define various measures (not easy) and use similarity-based methods. Regression models: created for each algorithm on parameters that describe data to predict expected accuracy, ranking potentially useful algorithms. Stacking: learn new models on errors of the previous ones. Deep learning:DARPA 2009 call, methods are „flat”, shallow, build a universal machine learning engine that generates progressively more sophisticated representations of patterns, invariants, correlations from data.Rather limited success so far … Meta-learning: learning how to learn.

Similarity-based framework (Dis)similarity: • more general than feature-based description, • no need for vector spaces (structured objects), • more general than fuzzy approach (F-rules are reduced to P-rules), • includes nearest neighbor algorithms, MLPs, RBFs, separable function networks, SVMs, kernel methods, specialized kernels, and many others! Similarity-Based Methods (SBMs) are organized in a framework: p(Ci|X;M) posterior classification probability or y(X;M) approximators, models M are parameterized in increasingly sophisticated way. A systematic search (greedy, beam, evolutionary) in the space of all SBM models is used to select optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem.Results: several candidate models are created, even very limited version gives best results in 7 out of 12 Stalog problems.

SBM framework components • Pre-processing: objects O => features X, or (diss)similarities D(O,O’). • Calculation of similarity between features d(xi,yi) and objects D(X,Y). • Reference (or prototype) vector R selection/creation/optimization. • Weighted influence of reference vectors G(D(Ri,X)), i=1..k. • Functions/procedures to estimate p(C|X;M) or y(X;M). • Cost functions E[DT;M] and model selection/validation procedures. • Optimization procedures for the whole model Ma. • Search control procedures to create more complex models Ma+1. • Creation of ensembles of (local, competent) models. • M={X(O), d(.,.), D(.,.), k, G(D), {R}, {pi(R)}, E[.], K(.), S(.,.)}, where: • S(Ci,Cj) is a matrix evaluating similarity of the classes; a vector of observed probabilities pi(X) instead of hard labels. The kNN model p(Ci|X;kNN) = p(Ci|X;k,D(.),{DT}); the RBF model: p(Ci|X;RBF) = p(Ci|X;D(.),G(D),{R}), MLP, SVM and many other models may all be “re-discovered” as a part of SBF.

k-NN 67.5/76.6% k-NN 67.5/76.6% k-NN 67.5/76.6% +ranking, 67.5/76.6 % +selection, 67.5/76.6 % +selection, 67.5/76.6 % +d(x,y); Canberra 89.9/90.7 % +d(x,y); Canberra 89.9/90.7 % +d(x,y); Canberra 89.9/90.7 % +k opt; 67.5/76.6 % +k opt; 67.5/76.6 % +k opt; 67.5/76.6 % + si=(0,0,1,0,1,1); 71.6/64.4 % + si=(0,0,1,0,1,1); 71.6/64.4 % + si=(0,0,1,0,1,1); 71.6/64.4 % +d(x,y) + si=(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + si=(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + si=(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + selection; Canberra 89.9/90.7 % +d(x,y) + selection; Canberra 89.9/90.7 % +d(x,y) + sel. or opt k; Canberra 89.9/90.7 % Meta-learning in SBM scheme Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

k-NN 67.5/76.6% +selection, 67.5/76.6 % +d(x,y); Canberra 89.9/90.7 % +k opt; 67.5/76.6 % + si=(0,0,1,0,1,1); 71.6/64.4 % +d(x,y) + si=(1,0,1,0.6,0.9,1); Canberra 74.6/72.9 % +d(x,y) + selection; Canberra 89.9/90.7 % Meta-learning in SBM scheme Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

Real meta-learning! Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments. Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search. • Abstract view, uniform I/O, uniform results management. • Directed acyclic graphs (DAG) of boxes representing scheme • placeholders and particular models, interconnected through I/O. • Configuration level for meta-schemes, expanded at runtime level. • An exercise in software engineering for data mining!

Intemi, Intelligent Miner Meta-schemes: templates with placeholders. • May be nested; the role decided by the input/output types. • Machine learning generators based on meta-schemes. • Granulation level allows to create novel methods. • Complexity control: Length + log(time) • A unified meta-parameters description, defining the range of sensible values and the type of the parameter changes.

Advanced meta-learning • Extracting meta-rules, describing interesting search directions. • Finding the correlations occurring among different items in most accurate results, identifying different machine (algorithmic) structures with similar behavior in an area of the model space. • Depositing the knowledge they gain in a reusable meta-knowledge repository (for meta-learning experience exchange between different meta-learners). • A uniform representation of the meta-knowledge, extending expert knowledge, adjusting the prior knowledge according to performed tests. • Finding new successful complex structures and converting them into meta-schemes (which we call meta abstraction) by replacing proper substructures by placeholders. • Beyond transformations & feature spaces: actively search for info. Intemi software (N. Jankowski and K. Grąbczewski) incorporating these ideas and more is coming “soon” ...

Meta-learning architecture Inside meta-parameter search a repeater machine composed of distribution and test schemes are placed.

Generating machines Search process is controlled by a variant of approximated Levin’s complexity: estimation of program complexity combined with time. Simpler machines are evaluated first, machines that work too long (approximations may be wrong) are put into quarantine.

Pre-compute what you can and use “machine unification” to get substantial savings!

Complexities on vowel data ……………

Simple machines on vowel data Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Complex machines on vowel data Left: final ranking, gray bar=accuracy, small bars: memory, time & total complexity, middle numbers = process id (models in previous table).

Principles: information compression Neural information processing in perception and cognition: information compression, or algorithmic complexity. In computing: minimum length (message, description) encoding. Wolff (2006): all cognition and computation is information compression! Analysis and production of natural language, fuzzy pattern recognition, probabilistic reasoning and unsupervised inductive learning. Talks about multiple alignment, unification and search, but so far only models for sequential data and 1D alignment. Information compression: encoding new information in terms of old has been used to define the measure of syntactic and semantic information (Duch, Jankowski 1994); based on the size of the minimal graph representing a given data structure or knowledge-base specification, thus it goes beyond alignment.

Knowledge transfer Brains learn new concepts in terms of old; use large semantic network and add new concepts linking them to the known. Knowledge should be transferred between the tasks. Not just learned from a single dataset. Need to discover good building blocks for higher level concepts/features.

How to become an expert? Textbook knowledge in medicine: detailed description of all possibilities. Effect: neural activation flows everywhere and correct diagnosis is impossible. Correlations between observations forming prototypes are not firmly established. Expert has correct associations. Example: 3 diseases, clinical case descriptions in spaces derived from texts and medical ontologies + MDS visualization. • System that has been trained on textbook knowledge. • Same system that has learned on real cases, not textbook. • Experienced expert that has learned on real cases (higher-order associations). Conclusion: abstract presentation of knowledge in complex domains leads to poor expertise, random real case learning is a bit better, learning with real cases that cover the whole spectrum of different cases is the best. I hear and I forget. I see and I remember. I do and I understand. Confucius, -500 r.

Maximization of margin/regularization Among all discriminating hyperplanes there is one defined by support vectors that is clearly better.

LDA in larger space Use LDA, but add new dimensions, functions of your inputs!Add to inputXi2, and products Xi Xj, as new features. Example: 2D => 5D case Z={z1…z5}={X1, X2, X12, X22, X1X2} The number of such tensor products grows exponentially – no good. Suppose that strongly non-linear borders are needed. Fig. 4.1 Hasti et al.

Kernels = similarity functions Gaussian kernels in SVM: zi (X)=G(X;XI ,s) radial features, X=>ZGaussian mixtures are close to optimal Bayesian errors. Solution requires continuous deformation of decision borders and is therefore rather easy. Gaussian kernel, C=1. In the kernel space Z decision borders are flat, but in the X space highly non-linear! SVM is based on quadratic solver, without explicit features, but using Z features explicitly has some advantages: Multiresolution: different sfor different support features, or using several kernels zi (X)=K(X;XI ,s) in one set of features. Linear solvers, or Naïve Bayes, or any other algorithms may be used. Support Feature Machines (SFM): construct features based on projections, restricted linear combinations, kernel features, use feature selection.

Neural networks: thyroid screening Clinical findings Finaldiagnoses Hidden units Age sex … … Normal Hypothyroid TSH Hyperthyroid T4U T3 TT4 TBG Garavan Institute, Sydney, Australia 15 binary, 6 continuous Training: 93+191+3488 Validate: 73+177+3178 • Determine important clinical factors • Calculate prob. of each diagnosis. Poor results of SBL or SVM … see summary at this addresshttp://www.is.umk.pl/projects/datasets.html#Hypothyroid

SVNT algorithm Initialize the network parameters W, set De=0.01, emin=0, set SV=T (training set). • Until no improvement is found in the last Nlast iterations do: • Optimize network parameters for Nopt steps on SV data. • Run feedforward step on SV to determine overall accuracy and errors, make new SV={X | e(X) [emin,1-emin]}. • If the accuracy increases: • compare current network with the previous best one, choose the better one as the current best. • increase emin=emin+De and make forward step selecting SVs • If the number of support vectors |SV| increases: • decrease emin=emin-De; • decrease De = De/1.2 to avoid large changes

Hypothyroid data 2 years real medical screening tests for thyroid diseases, 3772 cases with 93 primary hypothyroid and 191 compensated hypothyroid, the remaining 3488 cases are healthy; 3428 test, similar class distribution. 21 attributes (15 binary, 6 continuous) are given, but only two of the binary attributes (on thyroxine, and thyroid surgery) contain useful information, therefore the number of attributes has been reduced to 8. Method % train % test SFM, SSV+2 B1 features ------- 99.6 SFM, SVMlin+2 B1 features ------- 99.5 MLP+SCG, 4 neurons 99.8 99.2 Cascade correlation 100 98.5 MLP+backprop 99.6 98.5 SVM Gaussian kernel 99.8 98.4 SVM lin 94.1 93.3

Hypothyroid data

Heterogeneous systems Problems requiring different scales (multiresolution). 2-class problems, two situations: C1 inside the sphere, C2 outside. MLP: at least N+1 hyperplanes, O(N2) parameters. RBF: 1 Gaussian, O(N) parameters. C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside. MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N2) parameters, poor approx. Combination: needs both hyperplane and hypersphere! Logical rule: IF x1>0 & x2>0 THEN C1 Else C2 is not represented properly neither by MLP nor RBF! Different types of functions in one model, first step beyond inspirations from single neurons => heterogeneous models.

Heterogeneous everything Homogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs Committees combine many models together, but lead to complex models that are difficult to understand. Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS). HAS examples: NN with different types of neuron transfer functions. k-NN with different distance functions for each prototype. Decision Trees with different types of test criteria. 1. Start from large networks, use regularization to prune. 2. Construct network adding nodes selected from a candidate pool. 3. Use very flexible functions, force them to specialize.

Taxonomy - TF

T he future of computational intelligence & meta-learning.