Advanced Algorithms for Biological Data Analysis

Advanced Algorithms for Biological Data Analysis Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University http://bi.snu.ac.kr/ http://cbit.snu.ac.kr/

Lecture Schedule • Day 1: Introduction to Machine Learning • Day 2: Neural Networks • Day 3: Hidden Markov Models • Day 4: Principal Component Analysis • Day 5: Clustering Analysis

Introduction to Machine Learning Algorithms in Bioinformatics Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University E-mail: btzhang@cse.snu.ac.kr http://bi.snu.ac.kr./ http://cbit.snu.ac.kr/

Outline • Part I • Concept of Machine Learning (ML) • Machine Learning Algorithms and Applications • Applications in Bioinformatics • Part II • Version Space Learning • Decision Tree Learning

What is Artificial Intelligence (AI)? • Design and study of computer programs that behave intelligently. • Designing computer programs to make computers smarter. • Study of how to make computers do things at which, at the moment, people are better. • (No satisfactory definition of AI)

Research Areas and Approaches Learning Algorithms Inference Mechanisms Knowledge Representation Intelligent System Architecture Research Intelligent Agents Information Retrieval Electronic Commerce Data Mining Bioinformatics Natural Language Proc. Expert Systems Artificial Intelligence Application Rationalism (Logical) Empiricism (Statistical) Connectionism (Neural) Evolutionary (Genetic) Biological (Molecular) Paradigm

Concept of Machine Learning

Context Computer Science (AI) Cognitive Science Machine Learning Statistics Information Theory

Why Machine Learning? • Recent progress in algorithms and theory • Growing flood of online data • Computational power is available • Budding industry Three niches for machine learning • Data mining: using historical data to improve decisions • Medical records --> medical knowledge • Software applications we can’t program by hand • Autonomous driving • Speech recognition • Self-customizing programs • Newsreader that learns user interests

Brief History of Machine Learning • 1950’s: Samuels checker player • 1960’s: Neural networks, perceptron; pattern recognition; learning in the limit theory; Minsky &Papert. • 1970’s: Symbolic concept induction; Winstons’s arch learner; knowledge acquisition bottleneck; Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results; scientific discovery with BACON; mathematical discovery with AM. • 1980’s: Continued progress on decision-tree and rule learning; Explanation-based learning; speedup learning; utility problem, analogy; resurgence of connectionism (PDP, ANN); Valiant’s PAC learning; experimental evaluation • 1990’s: Data mining; adaptive software agents & IR; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks.

Learning: Definition • Definition • Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment. the improvement of behavior through acquisition of knowledge on some performance task based on partial task experience

A Learning Problem: EnjoySport Sky Temp Humid Wind Water Forecast EnjoySports Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes What is the general concept?

Possible Uses of Machine Learning configuration and design planning and scheduling data mining and knowledge discovery diagnostic reasoning execution and control language understanding vision and speech

Metaphors and Methods Neurobiology Connectionist Learning Biological Evolution Heuristic Search Genetic Learning Tree / Rule Induction Statistical Inference Memory and Retrieval Probabilistic Induction Case-Based Learning

Learning: Components • Components of a learning system • Performance: accuracy, efficiency, understandability • Environment: external setting to the learner • Knowledge: internal data structure • Experience: perception, action, mental traces • Improvement: desirable change in performance

Learning System Performance problem improve behavior solution Environment Knowledge get knowledge get data acquired knowledge Learning

What is the Learning Problem? • Learning = improving with experience at some task • Improve over task T, • With respect to performance measure P, • Based on experience E. E.g., Learn to play checkers • T: Play checkers • P: % of games won in world tournament • E: opportunity to play against self

Machine Learning: Tasks • Supervised Learning • Estimate an unknown mapping from known input- output pairs • Learn fw from training set D={(x,y)} s.t. • Classification: y is discrete • Regression: y is continuous • Unsupervised Learning • Only input values are provided • Learn fw from D={(x)} s.t. • Compression • Clustering • Reinforcement Learning

Machine Learning: Strategies • Rote learning • Concept learning • Learning from examples • Learning by instruction • Inductive learning • Deductive learning • Explanation-based learning (EBL) • Learning by analogy • Learning by observation

Supervised Learning • Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input and yi is the output associated with xi. • Learn a function f that accounts for the examples seen so far, f(xi) = yi for all i, and that makes a good guess for the outputs of the inputs that it has not seen.

Examples of Input-Output Pairs Inputs Task Outputs Recognition Classes that the objects belong to Descriptions of objects Actions or predictions Action Descriptions of situations Yes or No (indicating whether or not the office contains a recycling bin) Descriptions of offices (floor, prof’s office) Janitor robot problem

Classification and Concept Learning • Classification • If the function is discrete valued, then the outputs are called classes • Concept learning • Learned function has only two possible outputs

Unsupervised Learning • Clustering • A clustering algorithm partitions the inputs into a fixed number of subsets or clusters so that inputs in the same cluster are close to one another. • Discovery learning • The objective is to uncover new relations in the data. • Reinforcement learning • Uses a feedback signal (not the target output) that gives the learning program an indication of whether or not what it has learned is correct.

Online and Batch Learning • Batch methods • Process large sets of examples all at once. • Online (incremental) methods • Process examples one at a time.

Machine Learning Algorithms and Applications

Machine Learning Algorithms (1/2) • Symbolic Learning (covered on Day 1) • Version Space Learning • Case-Based Learning • Neural Learning (covered on Day 2) • Multilayer Perceptrons (MLPs) • Self-Organizing Maps (SOMs) • Support Vector Machines (SVMs) • Evolutionary Learning (very briefly explained on Day 1) • Evolution Strategies • Evolutionary Programming • Genetic Algorithms • Genetic Programming

Machine Learning Algorithms (2/2) • Probabilistic Learning (covered on Days 3 and 5) • Bayesian Networks (BNs) • Helmholtz Machines (HMs) • Latent Variable Models (LVMs) • Generative Topographic Mapping (GTM) • Other Machine Learning Methods (partially covered on Days 1 and 4) • Decision Trees (DTs) • Reinforcement Learning (RL) • Boosting Algorithms • Mixture of Experts (ME) • Independent Component Analysis (ICA)

Example Applications of ML (1/2) • Banking & Investment • Credit card fraud • Delinquent accounts • Authorization of purchases • Predict stock market • Health Care • Disease diagnosis • Managing resources • Look for causal relationships between environment and disease • Marketing • Credit card applications • Use past buying habits to predict likelihood of customer purchasing some new product • Textual Data Mining

Example Applications of ML (2/2) • Astronomy • Bioinformatics • Chemistry • Human resources: evaluating job performance • Insurance & Finance • Manufacturing: process control • Signal and image processing • Speech recognition • …

Neural Nets for Handwritten Digit Recognition … … … Pre-processing ? 0 1 2 3 9 0 1 2 3 9 Output units … … … Hidden units … … Input units … Training Test …

ALVINN System: Neural Network Learning to Steer an Autonomous Vehicle

Learning to Navigate a Vehicle by Observing an Human Expert (1/2) • Inputs • The images produces by a camera mounted on the vehicle • Outputs • The actions taken by the human driver to steer the vehicle or adjust its speed. • Result of learning • A function mapping images to control actions

Learning to Navigate a Vehicle by Observing an Human Expert (2/2)

Data Recorrection by a Hopfield Network corrupted input data original target data Recorrected data after 20 iterations Recorrected data after 10 iterations Fully recorrected data after 35 iterations

Predicting the Sunspot Number with Neural Networks

ANN for Face Recognition 960 x 3 x 4 network is trained on gray-level images of faces to predict whether a person is looking to their left, right, ahead, or up.

Transformation & reduction Selection & Sampling Preprocessing & Cleaning Interpretation/ Evaluation Data Mining -- -- -- -- -- -- -- -- -- Database/data warehouse Target data Cleaned data Transformed data Patterns/ model Knowledge Performance system Data Mining

Customer Relationship Management (CRM) • Increased Customer Lifetime Value • Increased Wallet Share • Improved Customer Retention • Segmentation of Customers by Profitability • Segmentation of Customers by Risk of Default • Integrating Data Mining into the Full Marketing Proce

Hot Water Flashing Nozzle with Evolutionary Algorithms Hans-Paul Schwefel performed the original experiments Start Hot water entering Steam and droplet at exit At throat: Mach 1 and onset of flashing

Case-Based Reasoning (Aamodt & Plaza, 1994) Input New Problem 1. Retrieve Case Base Learned Case Retrived Cases General Knowledge 4. Retain 2. Reuse Retrived Solution Retrived Solution Output 3. Revise

Machine Learning Applications in Bioinformatics

Bioinformatics • What is a Bioinformatics? Bioinformatics is a new term referring to the discipline that employs computers to store, retrieve, analyze and assist in understanding biological information. • The application of information technology and computer science to the study of biological systems. • The analysis of the massive (and constantly increasing) amount of genetic information • Sophisticated computer technologies to enable discovery in all fields of life sciences.

Sequence analysis • Sequence alignment • Structure and function prediction • Gene finding • Structure analysis • Protein structure comparison • Protein structure prediction • RNA structure modeling • Expression analysis • Gen expression analysis • Gene clustering • Pathway analysis • Metabolic pathway • Regulatory networks Problems in Bioinformatics

Applications of Bioinformatics • Drug design • Identification of genetic risk factors • Gene therapy • Genetic modification of food crops and animals • Forensics • Biological warfare • Personalized Medicine • E-Doctor

knowledge knowledge Drug Development Pharmacology Ecology Machine Learning and Bioinformatics Machine learning Bio DB Medical therapy research

Machine Learning Techniques for Bio Data Mining • Sequence Alignment • Simulated Annealing • Genetic Algorithms • Structure and Function Prediction • Hidden Markov Models • Multilayer Perceptrons • Decision Trees • Molecular Clustering and Classification • Support Vector Machines • Nearest Neighbor Algorithms • Expression (DNA Chip Data) Analysis • Self-Organizing Maps • Bayesian Networks

Structure and Function Prediction Protein structure prediction Protein modeling Gene finding and gene prediction

Effect and Applications of Biological Data Mining Biocomputing Increase and Improvement of Farm Products Renewable Energy Biological Data Mining store, retrieve, analyze and assist in understanding biological information Diagnosis with Chip SNP (Single Nucleotide Polymorphism) Customized Drug

Advanced Algorithms for Biological Data Analysis

Advanced Algorithms for Biological Data Analysis

Presentation Transcript

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Advanced Algorithms

Algorithms for Biological Sequence Analysis

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Network analysis of biological data

Advanced Algorithms Analysis and Design

Advanced Algorithms Analysis and Design

Advanced Algorithms Analysis and Design

Advanced Data Structures and Algorithms

Algorithms for Biological Sequence Analysis

Advanced Data Structures and Algorithms

Advanced Algorithms

Algorithms for Biological Sequence Analysis

Advanced Algorithms Analysis and Design

Algorithms for Biological Sequence Analysis

Advanced Algorithms

Algorithms for Biological Sequence Analysis ─ Class Presentation

Algorithms for Biological Sequence Analysis

Advanced Algorithms