Lifelong Machine Learning Systems : Beyond Learning Algorithms

Lifelong Machine Learning Systems: Beyond Learning Algorithms Daniel L. Silver Acadia University, Wolfville, NS, Canada Qiang Yang and Lianghao Li Dept. of CS and Engineering Hong Kong University of Science and Technology, Clearwater Bay, Hong Kong

Talk Outline • Position and Motivation • Rapid Review of Prior Work on LML • Moving Beyond Learning Algorithms • Framework for LML and Essential Ingredients • Challenges and Benefits • Next Steps

Position • It is now appropriate to seriously consider the nature of systems that learn over a lifetime • Advocate a systems approach in the context of an agent that can: • Acquire new knowledge through learning • Retain and consolidate that knowledge • Use it in future learning and other aspects of AI

Motivation • Placing ML in the context of goal-oriented systems is a step toward Big AI • LML is the logical next step for ML • Strong foundation in prior work • Investigate retention and use of inductive bias • Embraces non-stationary learning problems • Generate new theory where ML meets KR • Numerous practical apps in agents an robotics

Prior Work - Supervised • Michalski (1980s) • Constructive inductive learning • Principle: New knowledge is easier to induce if search is done using the correct representation • Two interrelated searches during learning: • Search for the best representational space for hypotheses • Search for best hypothesis in the current representational space • Utgoff and Mitchell (1983) • Importance of inductive bias to learning - systems should be able to search for an appropriate inductive bias using prior knowledge • Proposed a system that shifted its bias by adjusting the operations of the modeling language

Prior Work - Supervised • Solomonoff (1989) • Incremental learning • System primed on a small, incomplete set of primitive concepts; first learns to express the solutions to a set of simple problems • Then given more difficult problems and, if necessary, additional primitive concepts, etc • Thrun and Mitchell (1990s) • Explanation-based neural networks (EBNN) • Transfers knowledge across multiple learning tasks • Uses domain knowledge of previous learning tasks (back-prop. gradients) to guide the development of a new task

Prior Work – MTL and Task Rehearsal Virtual examples from related prior tasks for knowledge transfer Rehearsal of virtual examples for f2 –f6 ensures knowledge retention f1(x) f2 f3 f4 f5 f6 Long-term Consolidated Domain Knowledge x1 xn f1(x) f2 f3 f5 Virtual Examples of f1(x) for Long-term Consolidation x1 xn Various researchers: Caruana, Baxter, Robins, French, Thrun, Silver, Naik Short Term Learning Network

Prior Work - Unsupervised • Grossberg and Carpenter (1987) • Stability-Plasticity problem • Integrating new knowledge in with old? • ART – Adaptive Resonance Theory • Strehl and Ghosh(2003) • Cluster ensemble framework • Reuses prior partitionings to cluster data for new task • Three techniques for obtaining high quality ensemble combiners

Prior Work - Unsupervised • Raina et al. (2007) • Self-taught Learning • Large body of randon unlabeled data is used to create higher-level (more abstract) features • Labeled data is transformed to these features and used to train a classifier

Related Work - Unsupervised • Hinton and Bengio (2007+) • Learning of deep architectures of neural networks • Layered networks of unsupervised auto-encoders efficiently develop hierarchies of features that capture regularities in their respective inputs • Used to develop models for families of tasks • Carlson et al (2010) • NELL Never-ending language learner • Each day: Extracts information from the web to populate a growing knowledge base • Learns to perform this task better than on previous day • Uses a semi-supervised MTL approach in which a large number different semantic functions are trained together

Prior Work - Reinforcement • Ring (1997) • Continual learning - CHILD • Builds more complicated hypotheses on top of those already developed both incrementally and hierarchically using reinforcement learning methods. • Parr and Russell (1997) • Use prior knowledge to reduce the hypothesis space of a reinforcement learner • Tanaka and Yamamura (1999) • Lifelong reinforcement learning method for robots • Treats multiple environments as multiple tasks

Prior Work - Reinforcement • Sutton et al. (2007) • ML has focused on “the results of learning and not the on-going process of learning” • Traditional ML “converge” to good solutions for a stationary problem from a set of training examples • Does not work for non-stationary problems in complex environments; requires “tracking” • Shows that tracking can work better for even stationary problems and provides a tell-tail for the need for meta-learning

Moving Beyond Learning Algorithms - Rationale 1. Inductive bias is essential to learning (Mitchell, Utgoff1983; Wolpert 1996) • Learning systems should search for an appropriate inductive bias using all available knowledge • LML systems that retain and use prior knowledge as a source for shifting inductive bias promotes this perspective • Many real-world problems are non-stationary; have drift

Moving Beyond Learning Algorithms - Rationale 2. Theoretical advances in AI: ML meets KR • “The acquisition, representation and transfer of domain knowledge are the key scientific concerns that arise in lifelong learning.” (Thrun 1997) • KR plays an important a role in LML - Interaction between knowledge retention & transfer is key • LML has the potential to make serious advances on the learning of common background knowledge (example: CMU’s NELL project)

Moving Beyond Learning Algorithms - Rationale 3. Practical Agents/Robots Require LML • Advances in autonomous robotics and intelligent agents that run on the web or in mobile devices present opportunities for employing LML systems. • The ability to retain and use learned knowledge is very attractive to the researchers designing these systems.

Moving Beyond Learning Algorithms - Rationale 4. Increasing Capacity of Computers • Advances in modern computers provide the computational power for implementing and testing practical LML systems (e.g. product recommendation)

Increasing Capacity of Computers • Andrew Ng’s work on Deep Learning Networks (ICML-2012) • Problem: Learn to recognize human faces, cats, etc from unlabeled data • Dataset of 10 million images; each image has 200x200 pixels • 9-layered locally connected neural network (1B connections) • Parallel algorithm; 1,000 machines (16,000 cores) for three days Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng ICML 2012: 29th International Conference on Machine Learning, Edinburgh, Scotland, June, 2012.

Domain Knowledge Retention Knowledge Transfer Inductive Bias, BD Knowledge Selection Definition / Framework for LML Testing Examples Instance Space X Universal Knowledge (xi, y =f(xi)) Model of Classifier h Inductive Learning System short-term memory Training Examples Prediction/Action = h(x) S

Essential Ingredients of LML • The retention (or consolidation) of learned task knowledge • KR perspective • Effective and Efficient Retention • Resists the accumulation of erroneous knowledge • Maintains or improves model performance • Mitigates redundant representation • Allows the practice of tasks

Essential Ingredients of LML • The selective transfer of prior knowledge when learning new tasks • ML perspective • Effective and Efficient Transfer Learning • Produce models that perform better • Knowledge transfer should reduce learning time

Essential Ingredients of LML • A systems approach • Ensures the effective and efficient interaction of the retention and transfer components • Much to be learned from the writings of early cognitive scientists, AI researchers and neuroscientists such as Albus, Holland, Newel, Langley, Johnson-Laird and Minsky

Challenges and Benefits • Which learning approach is best for LML • Unsupervised, Supervised, Reinforcement • Semi-supervised, Self-taught … • Heterogenous domains of tasks • Leverage feature correspondence across domains • Affective feature mapping (Yang et al. 2009) • Subject of greater attention over next 10 years

Challenges and Benefits • Weighing the relevance and accuracy of prior versus new knowledge (training examples)? • How do we select relevant prior knowledge? Task relateness? • Need for meta-knowledge

A Functional B A Representational Challenges and Benefits • Method of knowledge retention ? • Weights (ANN) • Distance metric (kNN) • Branches (IDT) • Choice of kernel (SVM) • Examples (ANN) • Hyper-priors (NB) • Minimization guides • (EBNN) A B Representational • Method of knowledge transfer ? A B Functional

Challenges and Benefits • Stability-Plasticity problem - How do we integrate new knowledge in with old? • No loss of new knowledge • No loss or prior knowledge • Efficient methods of storage and recall • LML methods that can efficiently and effectively retain learned knowledge will suggest approaches to “common knowledge” representation – a “Big AI” problem

Challenges and Benefits • Practice makes perfect ! • An ML3 system must be capable of learning from examples of tasks over a lifetime • Practice should increase model accuracy and overall domain knowledge • How can this be done? • Research important to AI, Psych, and Education

Challenges and Benefits • Scalability • Often a difficult but important challenge • Must scale with increasing: • Number of inputs and outputs • Number of training examples • Number of tasks • Complexity of tasks, size of hypothesis representation • Preferably, polynomial growth

Challenges and Benefits • Computational Curricula • Insight into curriculum and training sequences • Best practices for rapid, accurate learning • Best practices for knowledge consolidation • Of interest to AI and Education

Challenges and Benefits • Applications in software agents and robots • Examples encountered periodically, intermittently • Practice is often necessary • Consolidation of new knowledge with old is needed for continual learning • Opportunity to test theories on curricula

Next Steps • We call for a move beyond the development of learning algorithms – to systems that learn, retain and use knowledge over a lifetime • Suggest two action items: • Consider a grand challenge to help further define the field of LML and raise the profile of research • Establish an open source project similar to WEKA that allows researchers to share knowledge and collaborate on LML systems

Thank You! QUESTONS? • danny.silver@acadiau.ca • http://plato.acadiau.ca/courses/comp/dsilver/ • http://ml3.acadiau.ca

EXTRA SLIDES

Inductive Bias Human learners use Inductive Bias ASH ST FIR ST SEC OND THI RD ELM ST • Inductive bias depends upon: • Having prior knowledge • Selection of most related knowledge PINE ST OAK ST

Inductive Biases • Universal heuristics - Occam’s Razor • Knowledge of intended use – Medical diagnosis • Knowledge of the source - Teacher • Knowledge of the task domain • Analogy with previously learned tasks

A Framework for LML L ∧ BD ∧ S ≻ h

Domain Knowledge long-term memory Retention & Consolidation Knowledge Transfer Inductive Bias Selection Machine Lifelong Learning Framework Testing Examples Instance Space X (x, f(x)) Model of Classifier h Inductive Learning System short-term memory Training Examples h(x) ~ f(x)

Knowledge Transfer Inductive Bias Selection f5(x) f1(x) f2(x) x1 xn Machine Lifelong Learning One Implementation Testing Examples Instance Space X … fk(x) f2(x) f3(x) f9(x) Domain Knowledge long-term memory Consolidated MTL Retention & Consolidation (x, f(x)) Model of Classifier h Training Examples Multiple Task Learning (MTL) h(x) ~ f(x)

fk(x) f1(x) f2(x) Task specific representation Common feature layer Common internal Representation [Caruana, Baxter] x1 xn Multiple Task Learning (MTL) • Multiple hypotheses develop in parallel within one back-propagation network [Caruana, Baxter 93-95] • An inductive bias occurs through shared use of common internal representation • Knowledge or Inductive transfer to primary task f1 (x) depends on choice of secondary tasks

Prior Work – MTL and Task Rehearsal Rehearsal of virtual examples for f2 –f6 ensures knowledge retention Virtual examples from related prior tasks for knowledge transfer f1(x) f2 f3 f4 f5 f6 • Lots of internal representation • Rich set of virtual training examples • Small learning rate = slow learning • Validation set to prevent growth of high magnitude weights [Poirier04] Long-term Consolidated Domain Knowledge x1 xn f1(x) f2 f3 f5 Virtual Examples of f1(x) for Long-term Consolidation x1 xn Various researchers: Caruana, Baxter, Robins, French, Thrun, Silver, Naik Short Term Learning Network

x = weather data f(x) = flow rate An Environmental Example Stream flow rate prediction [Lisa Gaudette, 2006]

csMTL and Tasks with Multiple Outputs • Liangliang Tu (2010) • Image Morphing: Inductive transfer between tasks that have multiple outputs • Transforms 30x30 grey scale images using inductive transfer

csMTL and Tasks with Multiple Outputs

csMTL and Tasks with Multiple Outputs Demo

f1(c,x) Short-term Learning Network Representational transfer from CDK for rapid learning Task Context Standard Inputs A ML3 based on csMTL Stability-Plasticity Problem One output for all tasks Functional transfer virtual examples) for consolidation f’(c,x) Long-term Consolidated Domain Knowledge Network c1 ck x1 xn Work with Ben Fowler, 2010

Lifelong Machine Learning Systems : Beyond Learning Algorithms