network ensembles committees for improved classification and regression
Skip this Video
Download Presentation
Radwan E. Abdel-Aal Computer Engineering Department November 2006

Loading in 2 Seconds...

play fullscreen
1 / 37

Radwan E. Abdel-Aal Computer Engineering Department November 2006 - PowerPoint PPT Presentation

  • Uploaded on

Network Ensembles (Committees) for Improved Classification and Regression. Radwan E. Abdel-Aal Computer Engineering Department November 2006. Contents. Data-based Predictive Modeling - Approach, advantages, Scope, and Main tools

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Radwan E. Abdel-Aal Computer Engineering Department November 2006' - toshi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
network ensembles committees for improved classification and regression

Network Ensembles (Committees) for Improved Classification and Regression

Radwan E. Abdel-Aal

Computer Engineering Department

November 2006



  • Data-based Predictive Modeling

- Approach, advantages, Scope, and Main tools

  • Need for high prediction accuracy
  • The network ensemble (committee) approach

- Need for diversity among members and How to achieve it

  • Some Results

- Classification: Medical diagnosis

- Regression: Electric peak load forecasting

  • Summary

Data-based Predictive Modeling

  • The process of creating a statistical model of future behavior based on data collected on observed past behavior
  • The model uses a number of predictors (input variables that are likely to influence the output)
  • The model relationship between such inputs and behavior is determined using a machine learning algorithm

Input Vector, X




Y = F(X)

Output, Y


Advantages over other modeling


• Thorough theoretical knowledge is not necessary

  • Less user intervention (Let the data speak!)

(No biases or pre-assumptions on relationships)

  • Better handling of nonlinearities, complexities
  • Greater tolerance to noise, uncertainties

(Soft Computing)

  • Faster and easier to develop
  • Utilizes the loads of computerized historical data

available now in many disciplines




- Pollution monitoring, Weather forecasting

Finance and business:

- Loan assessment, Fraud detection, Market forecasting

- Basket analysis, Product targeting, Efficient mailing


- Process modeling and optimization, Load forecasting

- Machine diagnostics, Predictive maintenance

Medical and Bio Informatics

- Screening, Diagnosis, Prognosis, Therapy, Gene classification


- Web access analysis, Site personalization


How? Two basic steps

Develop Model

UsingKnown Cases

Use Model

For New Cases



Supervised Learning





Known O/P, Y

Unknown O/P, Y


Attributes, X





Y = F(X)

Determine F(X)

data based predictive modeling by supervised machine learning
Data-based Predictive Modeling by supervised Machine learning
  • Database of solved examples (input-output)
  • Preparation: cleanup, transform, derive new attributes
  • Split data into a training and a test set
  • Training:

Develop model on the training set

  • Evaluation:

See how the model fares on the test set

  • Actual use:

Use promising model on new input data to estimate unknown output

example medical screening


F(x) ?

 G(x)


Example: Medical Screening
  • Y=F(x): true function (usually not known) for population P
  • 1. Collect Data:“labeled” training sample drawn from P





  • 2. Training: Get G(x); model learned from training sample, Goal: E<(F(x)-G(x))2> ≈ 0 for future samples drawn from P – Not just data fitting!
  • 3. Test/Use:


data based modeling tools learning paradigms
Data-Based Modeling Tools(Learning Paradigms)
  • Decision Trees
  • Nearest-Neighbor Classifiers
  • Support Vector Machines
  • Neural Networks
  • Abductive Networks
neural networks nn




Neural Networks (NN)


Input Layer

Output Layer





Actual: 0.65















Error: 0.05

Transfer Function



Dependent Output Variable

Independent Input Variables (Attributes)

Error back-propagation

limitations of neural networks


F(x) ?

 G(x)


Limitations of Neural Networks
  • Ad hoc approach by user to determine network structure and training parameters- Trial & Error ?
  • Opacity or black-box nature gives poor explanation capabilities which are important, e.g. in medicine

G(x) is ‘distributed’

in a maze of network weights



Significant inputs are not immediately obvious

When to stop training to avoid over-fitting ?

Local Minima may hinder optimum solution


Self-Organizing Abductive (Polynomial) Networks


“Double” Element:

y = w0 + w1 x1 + w2 x2

+ w3 x12 + w4 x22

+ w5 x1 x2

+ w6 x13 + w7 x23

- Network of polynomial functional elements- not simple neurons

- No fixed a priori model structure. Model evolves with training

- Automatic selection of: Significant inputs, Network size, Element types, Connectivity, and Coefficients

- Automatic stopping criteria, with simple control on complexity

- Analytical input-output relationships

need for high prediction accuracy medical diagnosis
Need for high prediction accuracy:Medical Diagnosis

- Ideally FN = FP = 0

- FN: Actual positives missed as

negatives by classifier

- FP: Actual negatives mistaken

as positives by classifier

Both types of errors are costly!

“Cost” can be given

to each type of error

need for high prediction accuracy hourly electric load forecasting
Need for high prediction accuracy:Hourly Electric Load Forecasting

Overestimation: Spin up reserve units unnecessarily

Underestimation: Need to deploy expensive peaking units

or buy costly generation from other utilities

 higher operating costs

An extra 1% in forecast error increased operating cost of

a UK power utility by 10 million sterling pounds in 1985

how to ensure good predictive models
How to ensure good predictive models?
  • Use effective predictors
  • Use representative datasets for model training and evaluation
  • Large training and evaluation datasets
  • Pre-process datasets to remove outliers, errors, etc. and perform normalization, transformations, etc.
  • Avoid over-fitting during training (i.e. use parsimonious models)
  • Use proven learning algorithms
what if a single is not good enough the network ensemble approach
What if a single is not good enough? The Network Ensemble Approach

If member networks

are independent, diversity in

the decision making process

boosts generalization,

thus improving accuracy,

robustness, and reliability

of the overall prediction

Identical members

 No gain in performance

Improvement expected only

when members err in different ways

and directions, so errors can cancel out!

n is usually odd to

suit majority voting

methods of combining member outputs
Methods of combining member outputs

1. Simple combination of member outputs:

- Simple averaging of continuous outputs

- Weighed averaging of continuous outputs (fixed weights)

, where

e.g. where is the variance of member i output on its training set

- Majority voting of categorical outputs

methods of combining member outputs contd
Methods of combining member outputs, Contd.

2. A gating network uses the input vector to determine optimum

weights for member outputs for each case to be classified

3. Stacked generalization approach: the output combiner is

another higher-level network trained on the outputs of

individual members to generate the committee classification


network ensembles the need for diversification
Network Ensembles:The need for diversification

The committee error can be shown to have two components:

- One measuring the average generalization error

of individual members

- The other measuring the disagreement among

outputs of individual members


  • Individual members should ideally be uncorrelated or even

negatively correlated (Diversity)

  • An ideal committee would consist of:
    • Highly accurate members,
    • which disagree among themselves as much as possible

Possible tradeoffs between the above two requirements

ensuring diversity committee of experts
Ensuring diversity: Committee of Experts

Individual member networks belong to totally different learning paradigms, e.g. Neural networks, Nearest neighbor classifiers, Classification and Regression Trees (CART), etc.


  • Members can use the same full training dataset and the same full set of input features

 No conflict between individual quality and collective diversity


  • Requires the use of different tools and expertise
ensuring diversity same learning paradigm
Ensuring diversity: Same learning paradigm

Develop member networks with same paradigm using:

  • Different training subsets
  • or Different input features
  • or Different training conditions
    • Neural networks:

- Different: architectures (MLP, RBF), algorithms (BP, SA) topologies, initial random weights, neuron transfer functions, learning rate, momentum, stopping criteria, (Research topic)

    • Abductive networks (Self organizing- Limited user choices!):

- Different values for the model complexity parameter (CPM)

Sacrifice individual quality

for collective diversity!

cpm complexity penalty parameter 0 1 to 10
CPM (Complexity Penalty Parameter: 0.1 to 10)

Lower CPM  More Complex Model

Can be used as a method for ranking input features:

Those selected earlier are better predictors

some results classification for medical diagnosis
Some Results: Classification for Medical Diagnosis

1. The Pima Indians Diabetes Dataset from the UCI Machine Learning Repository

  • 768 cases: (669 for training and 99 for evaluation)
  • 8 numerical attributes on physiological measurements and medical test results
  • A binary class variable (Diabetic:1, Not Diabetic: 0)
  • Percentage of positives in the total set: 34.9%
  • Typical classification accuracies reported for C4.5 decision tree tool: 74.6%
classification of the diabetes dataset
Classification of the Diabetes Dataset
  • Optimum monolithic abductive network model using full training dataset and feature set
  • Two abductive network ensemble approaches:
    • A: 3 members trained on the same (full) training set at different CPM values (model complexity parameters) (NT = 669 cases)
    • B: 3 members trained on 3 mutually exclusives subsets of the training set at same CPM value (NT = 223 cases)




trained on same training data and Different CPMs


trained on different training data and same CPMs

Errors by different members are poorly correlated (i.e. err differently, more independent)

Errors by different members are highly correlated (i.e. err together, less independent)

classification of the diabetes dataset1
Classification of the Diabetes Dataset




some results classification for medical diagnosis1
Some Results: Classification for Medical Diagnosis

2. The Cleveland Heart Disease Dataset from the UCI Machine Learning Repository

  • 270 cases: (190 for training and 80 for evaluation)
  • 13 numerical attributes
  • A binary class variable: Presence 1/Absence 0 of heart disease
  • Percentage of positives in the total set: 44.4 %
  • Typical classification accuracies reported with neural networks: 81.8%
classification of the heart disease dataset
Classification of the Heart Disease Dataset
  • Optimum monolithic abductive network model using full training dataset and feature set
  • 3-member abductive network ensemble:
    • Training set available is small  Not practical to split
    • For diversity: Members trained on thesame (full) training set but using different (mutually exclusive) subsets of input features
classification of the heart disease dataset1
Classification of the Heart Disease Dataset
  • To ensure good (and uniform) quality of all member networks, good quality input features must be distributed uniformly amongst members
  • First, rank the input feature subset based on predictive quality
  • Then distribute the features on the 3 members fairly
some results regression electrical load forecasting



Some Results: Regression:Electrical Load Forecasting
  • Short term: Hours, a week
      • Medium term: Months, a year
          • Long term: Up to 20 years

Short term (ST) Forecasting:

- Hourly load profile

- Daily peak load ()

ST Forecasts useful for scheduling:

Generator unit commitment

Short-term maintenance

Fuel allocation

Evaluating interchange transactions in deregulated markets



General upward trend


Seasonal yearly


28 years

Short-Term Forecasting:

Factors affecting the load

  • Time, Calendar:

Hourly, daily, seasonal,

holidays, school year, …

  • Weather: (Heating/cooling loads)

Temperature, humidity, wind speed,

cloud cover, …

 Economic, Societal:

(Slow trending effect)Industrial growth,electricity pricing,Population growth,…

  • Events:

Start/stop of large loads: Sports and TV shows, …


Forecasting tomorrow’s Peak Load

  • 47 Inputs representing:

- Peak load,

- Max and min temperatures,


over the previous 7 days and the forecasted day

  • Training: 3 years (1987-89)

Evaluation: 4th year (1990)

  • Trend management (2 ways):

- Use an additional trend input 

- Normalize all training years to last training year and then denormalize model output

24 off


Peak Load


7 x 6 = 42






Total: 48 inputs


Forecasting tomorrow’s Peak Load:

Monolithic neural and abductive Networks

Abductive Model:

Only 8 of the 48 inputs available

are automatically selected during training




Actual, MW




Performance over evaluation year:

  • Mean Absolute Percentage Error (MAPE) = 2.52%
  • Correlation cofft. between true and predicted data = 0.986

Improving Forecasting Accuracy

Using Abductive Ensembles

  • Three-member ensemble to forecast load for the year 1990


  • Members are trained on raw data for the three preceding years 1987, 1988, 1989
  • No need for trend input when training on 1-year data

For evaluation:

  • For each model:
  • Normalize evaluation year load data to the model training year at input
  • Denormalize model output to evaluation year before combining the 3 outputs

Evaluation Year

abductive network ensembles results
Abductive Network EnsemblesResults

Statistically significant error reduction

  • Network ensembles (committees) can lead to significant performance gains in classification and regression
  • Members need to be both accurate and independent
  • Independence is more difficult to achieve with abductive compared to neural networks
  • Effective ways to achieve this are: Different training datasets, different input features, and different model complexity (CPMs)
  • Demonstrated the technique on medical data (classification) and electrical load forecasting (regression)