1 / 53

Testing neural networks using the complete round robin method

Testing neural networks using the complete round robin method. Boris Kovalerchuk, Clayton Todd, Dan Henderson Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520 borisk@tahoma.cwu.edu toddc l @cwu.edu danno @ elltel . net. Problem. Common practice

kerem
Download Presentation

Testing neural networks using the complete round robin method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing neural networks using the complete round robin method Boris Kovalerchuk, Clayton Todd, Dan Henderson Dept. of Computer Science, Central Washington University, Ellensburg, WA, 98926-7520 borisk@tahoma.cwu.edu toddcl@cwu.edu danno@elltel.net

  2. Problem • Common practice • Pattern discovered by learning methods such as NN are tested in the data before they are used for their intended purpose. • Reliability of testing • The conclusion about reliability of discovered patterns depends on testing methods used. • Problem • A testing method may not test critical situations.

  3. Result New testing method • Tests much wider set of situations than the traditional round robin method • Speeds up required computations

  4. Novelty and Implementation • Novelty • The mathematical mechanism based on the theory of monotone Boolean functions and • Multithreaded parallel processing. • Implementation • The method has been implemented for backpropagation neural networks and successfully tested for 1024 neural networks by using SP500 data.

  5. 1. Approach and Method • Common approach • 1. Select subsets inthe data set D • subset Tr for training and • subset Tv for validating the discovered patterns. • 2. repeat step 1several times for different subsets • 3. compare • if results are similar to each other than a discovered regularity can be called reliable for data D. Tr Tv

  6. Known Methods[Dietterich, 1997] • Random selection of subsets • bootstrap aggregation (bagging) • Selection of disjoint subsets • cross­validated committees • Selection of subsets according the probability distribution • Boosting

  7. Problems of sub-sampling • Different regularities • for different sub-samples of Tr • Rejecting and accepting regularities • heavily depends on a specific splitting of Tr • Example: • non-stationary financial time series • bear and bull market trends for different time intervals.

  8. Splitting sensitive regularities for non-stationary data Reg1 does not work on A’=B Reg2 does not work on B’=A

  9. Round robin method • Eliminates arbitrary splitting by examining several groups of subsets of Tr. • The complete round robin method examines all groups of subsets of Tr. • Drawback • 2n possible subsets, where n is the number of groups of objects in the data set. • Learning 2n neural networks is a computational challenge.

  10. Implementation • Complete round robin method applicable to both • attribute-based and • relational data mining methods. • The method is illustrated for neural networks.

  11. Data • Let M be a learning method and • D be a data set of N objects, represented by m attributes. • D={di},i=1,…,N, di=(di1, di2,…, dim). • Method M is applied to data D for knowledge discovery.

  12. Grouping data • Example -- a stock time series. • the first 250 data objects (days) belong to 1980, • the next 250 objects (days) belong to 1981, and so on. • Similarly, half years, quarters and other time intervals can be used. • Any of these subsets can be used as training data.

  13. Testing data • Example 1 (general data) • Training -- the odd groups #1, #3, #5, #7 and #9 Testing -- #2, #4, #6, #8 and #10 • Example 2 (time series) • Training -- #1, #2, #3, #4 and #5 • Testing -- #6, #7, #8, #9 and #10 • Used third path • completely independent later data C for testing.

  14. Hypothesis of monotonicity • D1 and D2 are training data sets and • Perform1 and Perform2 • the performance indicators of learned models using D1 and D2 • Binary Perform index, • 1 stands for appropriate performance and • 0 stands for inappropriate performance. • The hypothesis of monotonicity (HM) • If D1 D2 then Perform1 Perform2. (1)

  15. Assumption • If data set D1 covers data set D2, then performance of method M on D1 should be better or equal to performance of M on D2. • Extra data bring more useful information than noise for knowledge discovery.

  16. Experimental testing of the hypothesis • Hypothesis P(M,Di) P(M,Dj) for performance is not always true. • 1024 subsets of ten years and generated NN • Found 683 subsets such that Di Dj. • Surprisingly, in this experiment monotonicity was observed for all of 683 combinations of years.

  17. Discussion • Experiment • strong evidence for use of monotonicity along with the complete round robin method. • Incomplete round robin method assumes some kind of independence of training subsets used. • Discovered monotonicity shows that independence at least should be tested.

  18. Formal notation • The error Er for data set D={di}, i=1,…,N, • is the normalized error of all its components di: • T(di) is the actual target value for di d • J(di) is the target value forecast delivered by discovered model J, i.e., the trained neural network in our case.

  19. Performance • Performance is measured by the error tolerance (threshold) Q0 of error Er:

  20. Binary hypothesis of monotonicity • Combinations of years are coded • as binary vectors vi=(vi1,vi2…vi10) like 0000011111 with 10 components from (0000000000) to (1111111111) • total 2n=1024 data subsets. • Hypothesis of monotonicity • If vi  vj then Performi Performj (2) • Here • vi  vj vik vjk for all k=1,...,10.

  21. Monotone Boolean Functions • Not every vi and vj are comparable with each other by the ”” relation. • Perform as a quality indicator Q: • Q(M, D, Q0) =1  Perform=1, (3) • where Q0 is some performance limit. • Rewritten monotonicity (2) • If vi  vj then Q(M, Di, Q0) Q(M, Dj, Q0) (4) • Q(M, D, Q0) is a monotone Boolean function of D. [Hansel, 1966; Kovalerchuk et al, 1996]

  22. Use of Monotonicity • A method M, • Data sets D1 and D2, with D2 D1. • Monotonicity • if the method M does not perform well on the data D1, then it will not perform well on the data D2 either. • under this assumption, we do not need to test method M on D2.

  23. Experiment • SP500 • running method M 250 times instead of the complete 1024 times. • Tables

  24. Experiments with SP500 and Neural Networks • The backpropagation neural network for predicting SP500 [Rao, Rao, 1993] • 0.6% prediction error on the SP500 test data (50 weeks) with 200 weeks (about four years) of training data. • Is this result reliable? Will it be sustained for wider training and testing data?

  25. Data for testing reliability • The same data: • Training data -- all trading weeks from 1980 to 1989 and • Independent testing data -- all trading weeks from 1990-1992.

  26. Complete round robin • Generated • all 1024 subsets of the ten training years (1980-1989) • Computed • 1024 the corresponding backpropagation neural networks and • their Perform values. • 3% error tolerance threshold (Q0=0.03) • (higher than 0.6% used for the smaller data set in [Rao, Rao, 1993]).

  27. Performance • A satisfactory performance is coded as 1 and • non-satisfactory performance is coded as 0.

  28. Analysis of performance • Consistent performance (training and testing) 67.05% + 28.25%= 95.3% with 3% error tolerance • sound Performance on training data => sound Performance on testing data (67.05%) • unsound Performance on training data => unsound Performance on testing data (28.25%) • A random choice of data for training from ten-year SP500 data will not produce regularity in 32.95% of cases, although regularities useful for forecasting do exist.

  29. Specific Analysis • Table 2 • 9 nested data subsets of the possible 1024 subsets • Begin the nested sequence with a single year (1986). • This single year’s data is too small to train a neural network to a 3% error tolerance.

  30. Specific analysis 9 nested data subsets of the possible 1024 subsets

  31. Nested sets • A single year (1986) • is too small to train a neural network to a 3% error tolerance. • 1986, 1988 and 1989, • produce the same negative result. • 1986, 1988, 1989 and 1985, • the error moved below the 3% threshold. • Five or more years of data • also satisfy the error criteria. • The monotonicity hypothesis is confirmed.

  32. Analysis • All other 1023 combinations of years • only a few combinations of four years satisfy the 3% error tolerance, • practically all five-year combinations satisfy the 3% threshold. • all combinations of over five years satisfy this threshold.

  33. Analysis • Four-year training data sets produce marginally reliable forecasts. • Example. • 1980, 1981, 1986, and 1987, corresponding to the binary vector (1100001100) do not satisfy the 3% error tolerance

  34. Further analyses • Three levels of error tolerance, Q0: 3.5%, 3.0% and 2.0%. • The number of neural networks • with sound performance goes down from 81.82% to 11.24% by moving error tolerance from 3.5% to 2%. • NN with 3.5% error tolerance • are much more reliable than networks with 2.0 % error tolerance.

  35. Three levels of error tolerance

  36. Performance

  37. Standard random choice method • A random choice of training data • for 2% error tolerance will more often reject the training data as insufficient. • This standard approach does not even allow us to know how unreliable the result is without running the complete 1024 subsets in complete round robin method.

  38. Reliability of 0.6 error tolerance • Rao and Rao [1993] • the 0.6% error for 200 weeks (about four years) from the 10-year training data. • unreliable (table 3)

  39. Underlying mechanism • The number of computations depends on a sequence of testing data subsets Di. • To optimize the sequence of testing, Hansel’s lemma [Hansel, 1966, Kovalerchuk et al, 1996] from the theory of monotone Boolean functions is applied to so called Hansel’s chains of binary vectors.

  40. Computational challenge of Complete round robin method • Monotonicity and multithreading significantly speed up computing • Use of monotonicity with 1023 threads decreased average runtime about 3.5 times from 15-20 minutes to 4-6 minutes in order to train 1023 Neural Networks in the case of mixed 1s and 0s for output.

  41. Error tolerance and computing time • Different error tolerance values can change output and runtime. • extreme cases with all 1’s or all 0’s as outputs.. • File preparation to train 1023 Neural Networks. • The largest share of time for file preparation (41.5%) is taken by files using five years in the data subset (table 5).

  42. Runtime

  43. Time distribution

  44. General logic of software • The exhaustive option • generate 1024 subsets using a file preparation program and • compute backpropagation for all subsets • Optimized option • generate Hansel chains, store them • compute backpropagation only for specific subsets dictated by the chains.

  45. Implemented optimized option • 1. For a given binary vector the corresponding training data are produced. • 2. Backpropagation is computed generating the Perform value. • 3. The Perform value is used along with a stored Hansel chains to decide which binary vector (i.e., subset of data) will be used next for learning neural networks.

  46. System Implementation • The exhaustive set of Hansel chains is generated first and stored for a desired case. • A 1024 subsets of data are created using a file preparation program. • Based on the sequence dictated by the Hansel Chains, we produce the corresponding training data and compute backpropagation to generate the Perform value. The next vector to be used is based on the stored Hansel Chains and the Perform value.

  47. User Interface and Implementation • Built to take advantage of Windows NT threading. • Built with Borland C++ Builder 4.0. • Parameters for the Neural Networks and sub-systems are setup through the interface. • Various options, such as monotinicity, can be turned on or off through the interface.

  48. Multithreaded Implementation • Independent Sub-Processes • Learning of an individual neural network or a group of the neural networks. • Each sub-process is implemented as an individual thread. • Run in parallel on several processors to further speed up computations.

  49. Threads • Two different types of threads. • Worker Thread and • Communication Threads • Communication Threads send and receive the data. • Work threads are started by the Servers to perform the calculations on the data.

  50. Client/Server Implementation • Decreases the workload per-computer • Client does the data formatting, Server performs all the data manipulation. • A client connects to multiply servers. • Each server performs work on the data and returns the results to the client. • The client formats the returned results and gives a visualization of them.

More Related