Identifying Feature Relevance Using a Random Forest

1 / 25

# Identifying Feature Relevance Using a Random Forest - PowerPoint PPT Presentation

## Identifying Feature Relevance Using a Random Forest

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn

2. Overview • What is a Random Forest? • Why do Relevance Identification? • Estimating Feature Importance with a Random Forest • Node Complexity Compensation • Employing Feature Relevance • Extension to Feature Selection

3. Random Forest • Combination of base learners using Bagging • Uses CART-based decision trees

4. Random Forest (cont...) • Optimises split using Information Gain • Selects feature randomly to perform each split • Implicit Feature Selection of CART is removed

5. Feature Relevance: Ranking • Analyse Features individually • Measures of Correlation to the target • Feature is relevant if: Assumes no feature interaction Fails to identify relevant features in parity problem

6. Feature Relevance: Subset Methods • Use implicit feature selection of decision tree induction • Wrapper methods • Subset search methods • Identifying Markov Blankets • Feature is relevant if:

7. Relevance Identification using Average Information Gain • Can identify feature interaction • Reliability dependant upon node composition • Irrelevant features give non-zero relevance

8. Node Complexity Compensation • Some nodes are easier to split • Requires each sample to be weighted by some measure of node complexity • Data projected on to one-dimensional space • For Binary Classification:

9. Unique & Non-Unique Arrangements • Some arrangements are reflections (non-unique) Some arrangements are symmetrical about their centre (unique)

10. Node Complexity Compensation (cont…) Au - No. Unique Arrangements

11. Information Gain Density Functions • Node Complexity improves measure of average IG • The effect is visible when examining the IG density functions for each feature • These are constructed by building a forest and recording the frequencies of IG values achieved by each feature

12. Information Gain Density Functions • RF used to construct 500 trees on an artificial dataset • IG density functions recorded for each feature

13. Employing Feature Relevance • Feature Selection • Feature Weighting • Random Forest uses a Feature Sampling distribution to select each feature. • Distribution can be altered in two ways • Parallel: Update during forest construction • Two-stage: Fixed prior to forest construction

14. Parallel • Control update rate using confidence intervals. • Assume Information Gain values have normal distribution. Statistic has a Student’s t distribution with n-1 degrees of freedom Maintain most uniform distribution within confidence bounds

15. Convergence Rates

16. Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials

17. Irrelevant Features • Average IG is the mean of a non-negative sample. • Expected IG of an irrelevant feature is non-zero. • Performance is degraded when there is a high proportion of irrelevant features.

18. Expected Information Gain nL - No. examples in left descendant iL - No. positive examples in left descendant

19. Expected Information Gain No. positive examples No. negative examples

20. Bounds on Expected Information Gain • Upper can be approximated as Lower Bound is given by

21. Irrelevant Features: Bounds • 100 trees built on artificial dataset • Average IG recorded and bounds calculated

22. Friedman FS: CFS:

23. Simple FS: CFS:

24. Results • 90% of data used for training, 10% for testing • Forests of 100 trees were tested and averaged over 100 trials • 100 trees constructed for feature evaluation in each trial

25. Summary • Node complexity compensation improves measure of feature relevance by examining node composition • Feature sampling distribution can be updated using confidence intervals to control the update rate • Irrelevant features can be removed by calculating their expected performance