Network ensembles committees for improved classification and regression
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Radwan E. Abdel-Aal Computer Engineering Department November 2006 PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Network Ensembles (Committees) for Improved Classification and Regression. Radwan E. Abdel-Aal Computer Engineering Department November 2006. Contents. Data-based Predictive Modeling - Approach, advantages, Scope, and Main tools

Download Presentation

Radwan E. Abdel-Aal Computer Engineering Department November 2006

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Network ensembles committees for improved classification and regression

Network Ensembles (Committees) for Improved Classification and Regression

Radwan E. Abdel-Aal

Computer Engineering Department

November 2006


Radwan e abdel aal computer engineering department november 2006

Contents

  • Data-based Predictive Modeling

    - Approach, advantages, Scope, and Main tools

  • Need for high prediction accuracy

  • The network ensemble (committee) approach

    - Need for diversity among members and How to achieve it

  • Some Results

    - Classification: Medical diagnosis

    - Regression: Electric peak load forecasting

  • Summary


Radwan e abdel aal computer engineering department november 2006

Data-based Predictive Modeling

  • The process of creating a statistical model of future behavior based on data collected on observed past behavior

  • The model uses a number of predictors (input variables that are likely to influence the output)

  • The model relationship between such inputs and behavior is determined using a machine learning algorithm

Input Vector, X

(Predictors)

(Attributes)

(Features)

Y = F(X)

Output, Y


Radwan e abdel aal computer engineering department november 2006

Advantages over other modeling

approaches

• Thorough theoretical knowledge is not necessary

  • Less user intervention (Let the data speak!)

    (No biases or pre-assumptions on relationships)

  • Better handling of nonlinearities, complexities

  • Greater tolerance to noise, uncertainties

    (Soft Computing)

  • Faster and easier to develop

  • Utilizes the loads of computerized historical data

    available now in many disciplines


Radwan e abdel aal computer engineering department november 2006

Scope

Environmental:

- Pollution monitoring, Weather forecasting

Finance and business:

- Loan assessment, Fraud detection, Market forecasting

- Basket analysis, Product targeting, Efficient mailing

Engineering:

- Process modeling and optimization, Load forecasting

- Machine diagnostics, Predictive maintenance

Medical and Bio Informatics

- Screening, Diagnosis, Prognosis, Therapy, Gene classification

Internet:

- Web access analysis, Site personalization


Radwan e abdel aal computer engineering department november 2006

How? Two basic steps

Develop Model

UsingKnown Cases

Use Model

For New Cases

1

2

Supervised Learning

IN

OUT

IN

OUT

Known O/P, Y

Unknown O/P, Y

F(X)

Attributes, X

Rock

Properties

Attributes

(X)

Y = F(X)

Determine F(X)


Data based predictive modeling by supervised machine learning

Data-based Predictive Modeling by supervised Machine learning

  • Database of solved examples (input-output)

  • Preparation: cleanup, transform, derive new attributes

  • Split data into a training and a test set

  • Training:

    Develop model on the training set

  • Evaluation:

    See how the model fares on the test set

  • Actual use:

    Use promising model on new input data to estimate unknown output


Example medical screening

x

F(x) ?

 G(x)

Y

Example: Medical Screening

  • Y=F(x): true function (usually not known) for population P

  • 1. Collect Data:“labeled” training sample drawn from P

    57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,00

    78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,01

    69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,00

    18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,01

  • 2. Training: Get G(x); model learned from training sample, Goal: E<(F(x)-G(x))2> ≈ 0 for future samples drawn from P – Not just data fitting!

  • 3. Test/Use:

    71,M,160,1,132,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0?


Data based modeling tools learning paradigms

Data-Based Modeling Tools(Learning Paradigms)

  • Decision Trees

  • Nearest-Neighbor Classifiers

  • Support Vector Machines

  • Neural Networks

  • Abductive Networks


Neural networks nn

S

S

S

Neural Networks (NN)

HiddenLayer

Input Layer

Output Layer

Neurons

.6

Age

34

Actual: 0.65

.4

.2

0.60

.5

.1

Gender

2

.2

.3

.8

.7

4

.2

Stage

Error: 0.05

Transfer Function

Weights

Weights

Dependent Output Variable

Independent Input Variables (Attributes)

Error back-propagation


Limitations of neural networks

x

F(x) ?

 G(x)

Y

Limitations of Neural Networks

  • Ad hoc approach by user to determine network structure and training parameters- Trial & Error ?

  • Opacity or black-box nature gives poor explanation capabilities which are important, e.g. in medicine

G(x) is ‘distributed’

in a maze of network weights

x

Y

Significant inputs are not immediately obvious

When to stop training to avoid over-fitting ?

Local Minima may hinder optimum solution


Radwan e abdel aal computer engineering department november 2006

Self-Organizing Abductive (Polynomial) Networks

GMDH-based

“Double” Element:

y = w0+ w1 x1 + w2 x2

+ w3 x12 + w4 x22

+ w5 x1 x2

+ w6 x13 + w7 x23

- Network of polynomial functional elements- not simple neurons

- No fixed a priori model structure. Model evolves with training

- Automatic selection of: Significant inputs, Network size, Element types, Connectivity, and Coefficients

- Automatic stopping criteria, with simple control on complexity

- Analytical input-output relationships


Need for high prediction accuracy medical diagnosis

Need for high prediction accuracy:Medical Diagnosis

- Ideally FN = FP = 0

- FN: Actual positives missed as

negatives by classifier

- FP: Actual negatives mistaken

as positives by classifier

Both types of errors are costly!

“Cost” can be given

to each type of error


Need for high prediction accuracy hourly electric load forecasting

Need for high prediction accuracy:Hourly Electric Load Forecasting

Overestimation: Spin up reserve units unnecessarily

Underestimation: Need to deploy expensive peaking units

or buy costly generation from other utilities

 higher operating costs

An extra 1% in forecast error increased operating cost of

a UK power utility by 10 million sterling pounds in 1985


How to ensure good predictive models

How to ensure good predictive models?

  • Use effective predictors

  • Use representative datasets for model training and evaluation

  • Large training and evaluation datasets

  • Pre-process datasets to remove outliers, errors, etc. and perform normalization, transformations, etc.

  • Avoid over-fitting during training (i.e. use parsimonious models)

  • Use proven learning algorithms


What if a single is not good enough the network ensemble approach

What if a single is not good enough? The Network Ensemble Approach

If member networks

are independent, diversity in

the decision making process

boosts generalization,

thus improving accuracy,

robustness, and reliability

of the overall prediction

Identical members

 No gain in performance

Improvement expected only

when members err in different ways

and directions, so errors can cancel out!

n is usually odd to

suit majority voting


Methods of combining member outputs

Methods of combining member outputs

1. Simple combination of member outputs:

- Simple averaging of continuous outputs

- Weighed averaging of continuous outputs (fixed weights)

, where

e.g. where is the variance of member i output on its training set

- Majority voting of categorical outputs


Methods of combining member outputs contd

Methods of combining member outputs, Contd.

2. A gating network uses the input vector to determine optimum

weights for member outputs for each case to be classified

3. Stacked generalization approach: the output combiner is

another higher-level network trained on the outputs of

individual members to generate the committee classification

output


Network ensembles the need for diversification

Network Ensembles:The need for diversification

The committee error can be shown to have two components:

- One measuring the average generalization error

of individual members

- The other measuring the disagreement among

outputs of individual members

Therefore:

  • Individual members should ideally be uncorrelated or even

    negatively correlated (Diversity)

  • An ideal committee would consist of:

    • Highly accurate members,

    • which disagree among themselves as much as possible

Possible tradeoffs between the above two requirements


Ensuring diversity committee of experts

Ensuring diversity: Committee of Experts

Individual member networks belong to totally different learning paradigms, e.g. Neural networks, Nearest neighbor classifiers, Classification and Regression Trees (CART), etc.

Advantages:

  • Members can use the same full training dataset and the same full set of input features

     No conflict between individual quality and collective diversity

    Disadvantages:

  • Requires the use of different tools and expertise


Ensuring diversity same learning paradigm

Ensuring diversity: Same learning paradigm

Develop member networks with same paradigm using:

  • Different training subsets

  • or Different input features

  • or Different training conditions

    • Neural networks:

      - Different: architectures (MLP, RBF), algorithms (BP, SA) topologies, initial random weights, neuron transfer functions, learning rate, momentum, stopping criteria, (Research topic)

    • Abductive networks (Self organizing- Limited user choices!):

      - Different values for the model complexity parameter (CPM)

Sacrifice individual quality

for collective diversity!


Cpm complexity penalty parameter 0 1 to 10

CPM (Complexity Penalty Parameter: 0.1 to 10)

Lower CPM  More Complex Model

Can be used as a method for ranking input features:

Those selected earlier are better predictors


Some results classification for medical diagnosis

Some Results: Classification for Medical Diagnosis

1. The Pima Indians Diabetes Dataset from the UCI Machine Learning Repository

  • 768 cases: (669 for training and 99 for evaluation)

  • 8 numerical attributes on physiological measurements and medical test results

  • A binary class variable (Diabetic:1, Not Diabetic: 0)

  • Percentage of positives in the total set: 34.9%

  • Typical classification accuracies reported for C4.5 decision tree tool: 74.6%


Classification of the diabetes dataset

Classification of the Diabetes Dataset

  • Optimum monolithic abductive network model using full training dataset and feature set

  • Two abductive network ensemble approaches:

    • A: 3 members trained on the same (full) training set at different CPM values (model complexity parameters) (NT = 669 cases)

    • B: 3 members trained on 3 mutually exclusives subsets of the training set at same CPM value (NT = 223 cases)


Radwan e abdel aal computer engineering department november 2006

Ensemble-B

Ensemble-A

Members

trained on same training data and Different CPMs

Members

trained on different training data and same CPMs

Errors by different members are poorly correlated (i.e. err differently, more independent)

Errors by different members are highly correlated (i.e. err together, less independent)


Classification of the diabetes dataset1

Classification of the Diabetes Dataset

Monolithic

Ensemble-A

Ensemble-B


Some results classification for medical diagnosis1

Some Results: Classification for Medical Diagnosis

2. The Cleveland Heart Disease Dataset from the UCI Machine Learning Repository

  • 270 cases: (190 for training and 80 for evaluation)

  • 13 numerical attributes

  • A binary class variable: Presence 1/Absence 0 of heart disease

  • Percentage of positives in the total set: 44.4 %

  • Typical classification accuracies reported with neural networks: 81.8%


Classification of the heart disease dataset

Classification of the Heart Disease Dataset

  • Optimum monolithic abductive network model using full training dataset and feature set

  • 3-member abductive network ensemble:

    • Training set available is small  Not practical to split

    • For diversity: Members trained on thesame (full) training set but using different (mutually exclusive) subsets of input features


Classification of the heart disease dataset1

Classification of the Heart Disease Dataset

  • To ensure good (and uniform) quality of all member networks, good quality input features must be distributed uniformly amongst members

  • First, rank the input feature subset based on predictive quality

  • Then distribute the features on the 3 members fairly


Classification of the heart disease dataset2

Classification of the Heart Disease Dataset


Some results regression electrical load forecasting

Weekend

Weekdays

Some Results: Regression:Electrical Load Forecasting

  • Short term: Hours, a week

    • Medium term: Months, a year

      • Long term: Up to 20 years

Short term (ST) Forecasting:

- Hourly load profile

- Daily peak load ()

ST Forecasts useful for scheduling:

Generator unit commitment

Short-term maintenance

Fuel allocation

Evaluating interchange transactions in deregulated markets


Radwan e abdel aal computer engineering department november 2006

Weekend

General upward trend

Weekdays

Seasonal yearly

variations

28 years

Short-Term Forecasting:

Factors affecting the load

  • Time, Calendar:

    Hourly, daily, seasonal,

    holidays, school year, …

  • Weather: (Heating/cooling loads)

    Temperature, humidity, wind speed,

    cloud cover, …

     Economic, Societal:

    (Slow trending effect)Industrial growth,electricity pricing,Population growth,…

  • Events:

    Start/stop of large loads: Sports and TV shows, …


Radwan e abdel aal computer engineering department november 2006

Forecasting tomorrow’s Peak Load

  • 47 Inputs representing:

    - Peak load,

    - Max and min temperatures,

    - Day type: WRK, SAT, SUN/HOLIDAY

    over the previous 7 days and the forecasted day

  • Training: 3 years (1987-89)

    Evaluation: 4th year (1990)

  • Trend management (2 ways):

    - Use an additional trend input 

    - Normalize all training years to last training year and then denormalize model output

24 off

Tomorrow’s

Peak Load

Forecaster

7 x 6 = 42

O/P

Forecasted

Day

5

Trend

Total: 48 inputs


Radwan e abdel aal computer engineering department november 2006

Forecasting tomorrow’s Peak Load:

Monolithic neural and abductive Networks

Abductive Model:

Only 8 of the 48 inputs available

are automatically selected during training

Predicted

Forecasted

Actual

Actual, MW

MAPE:

2.61%

2.52%

Performance over evaluation year:

  • Mean Absolute Percentage Error (MAPE) = 2.52%

  • Correlation cofft. between true and predicted data = 0.986


Radwan e abdel aal computer engineering department november 2006

Improving Forecasting Accuracy

Using Abductive Ensembles

  • Three-member ensemble to forecast load for the year 1990

    Training:

  • Members are trained on raw data for the three preceding years 1987, 1988, 1989

  • No need for trend input when training on 1-year data

    For evaluation:

  • For each model:

  • Normalize evaluation year load data to the model training year at input

  • Denormalize model output to evaluation year before combining the 3 outputs

Evaluation Year


Abductive network ensembles results

Abductive Network EnsemblesResults

Statistically significant error reduction


Summary

Summary

  • Network ensembles (committees) can lead to significant performance gains in classification and regression

  • Members need to be both accurate and independent

  • Independence is more difficult to achieve with abductive compared to neural networks

  • Effective ways to achieve this are: Different training datasets, different input features, and different model complexity (CPMs)

  • Demonstrated the technique on medical data (classification) and electrical load forecasting (regression)


  • Login