Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art

Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art Presented by: Juan Bernal COT4930 - Introduction to Data Mining Instructor: Dr. Koshgoftaar, Florida Atlantic University Spring, 2008

Topics • Introduction • Cross-Validation definition and importance • Why is Cross-Validation a computational-intensive task? • Distributing Data-Mining processes over a computer network. • WEKA and distributed Data Mining: How it is done • Other Projects implementing grids/distributed networks • Weka Parallel • Grid Weka • Weka4WS • Inhambu • Conclusion

Introduction • Data Mining today is being performed in vast amounts of ever growing data. The need to analyze and extract information from different domain databases demands more computational resources and the expected results in the minimum amount of time possible. • There are many different projects that try to address Data Mining processes over distributed, or grid enabled networks. All of them attempt to make use of all the available computer resources in a grid or networked environment to improve the time that takes to obtain results and even to increase the accuracy of the results obtained. • One of the Data Mining most computational intensive tasks is Cross Validation, which is the focus in many grid/distributed network Data Mining tools.

Cross-Validation • Cross-Validation (CV) is the standard Data Mining method for evaluating performance of classification algorithms. Mainly, to evaluate the Error Rate of a learning technique. • In CV a dataset is partitioned in n folds, where each is used for testing and the remainder used for training. The procedure of testing and training is repeated n times so that each partition or fold is used once for testing. • The standard way of predicting the error rate of a learning technique given a single, fixed sample of data is to use a stratified 10-fold cross-validation. • Stratification implies making sure that when sampling is done each class is properly represented in both training and test datasets. This is achieved by randomly sampling the dataset when doing the n fold partitions.

10-Fold Cross-Validation • In a stratified 10-fold Cross-Validation the data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. The learning procedure is executed a total of 10 times on different training sets, and finally the 10 error rates are averaged to yield an overall error estimate. 3-fold cross-validation graphical example

Why is Cross-Validation a computational-intensive task? • When seeking an accurate error estimate, it is standard procedure to repeat the CV process 10 times. This means invoking the learning algorithm 100 times and is a computational and time intensive task. • Given the nature of Cross-Validation many researchers have worked on executing this process more efficiently over a grid or networked computer environments.

Distributing Data-Mining processes over a computer network • Different projects including WEKA have implemented a way to distribute Data Mining processes and in particular Cross-Validation over networked computers. In almost all projects a client-server approach is used and methods like Java RMI (Remote Method Invocation) and WSRF (Web Services Resource Framework) are implemented to allow network communications between clients and servers. • Also, WEKA is the main tool over which different projects are based to achieve Data Mining over computer networks due to its easily accessible Java source code and adaptability.

WEKA distribution of Data Mining Processes over several computers • The WEKA tool contains a feature to split an experiment and distribute it across several processors. • Distributing an experiment involves splitting it into subexperiments that RMI sends to the host for execution. The experiment can be partitioned by datasets, where each subexperiment is self-contained and applies all schemes to a single dataset. In the other hand, with few datasets the partitions can set by run. For example a 10 times 10-fold CV would be split into 10 sub experiments, one per run. • This feature is available from the experimenter section of the WEKA tool which is the main section under which research is done. • Under the Experimenter the ability to distribute processes is found under the advanced version of the Setup panel.

WEKA requirements for distributing experiments • Each host: • Needs Java installed • Needs access to databases to be used • Needs to be running the weka.experiment.RemoteEngine experiment server • Distributing an experiment works best if the results are sent to a central database by selecting JDBC as the result destination. If not preferred, each host can save the results to a different ARFF that can be merged afterwards.

WEKA difficulties for distributed implementation • File and directory permissions can be difficult to set up. • Manually installing and configuring each host with the Weka experimenter server and the remote.policy file which grants remote engine permissions for network operations. • Manually initializing or starting each host. • Setting up a centralized database server and access. In the positive side once all these configurations and preparations are done the experiment can be executed and time can be saved by distributing the workload among the hosts. WEKA Experimenter Tutorial: http://sourceforge.net/project/downloading.php?groupname=weka&filename=ExperimenterTutorial-3.4.12.pdf&use_mirror=internap

Other Projects implementing grids / distributed networks for Data Mining and Cross-Validation • Based on Weka there are some projects that try to improve the process of performing data mining and cross-validation over numerous computers: • Weka Parallel • Grid Weka • Inhambu • Weka4WS

Weka-Parallel: Machine Learning in Parallel • Weka-Parallel was created with the intention of being able to run the cross-validation portion of any given classifier very quickly. This speed increase is accomplished by simultaneously calculating the necessary information using many different machines. • To achieve communication from the computer running Weka (client) to the other computers (servers) Weka-Parallel uses a simple connection established by the Socket class in the Java.net package. Each server would start a daemon that listens to a port, then the socket would open a Data and an Object DataStream to send/receive information. • RMI was not used to manage the client calls to servers to do the necessary methods for calculating specific folds of CV. Instead, the client sends integer codes to the servers telling him what methods to run. • Each server receives a copy of the dataset, and information on what fold it has to perform. The client computer maintains an index to assign what fold each server performs and has a Round Robin algorithm.

Weka-Parallel:Speedup performance analysis • An experiment was done running the J48 decision tree classifier with default parameters on the Waveform-5000 dataset from the UCI repository. The 5000-Waveform dataset contains 5300 points in 21 dimensions, and the goal s to find the classifier that correctly distinguishes between 3 classes of waves. A 500-fold cross validation was ran using up to 14 computers with similar hardware. Weka-Parallel link: http://weka-parallel.sourceforge.net/

Grid-Weka • In the Grid-enabled Weka, execution of the following tasks can be distributed across several computers in an ad-hoc Grid: • Building a classifier on a remote machine. • Testing a previously built classifier on several machines in parallel . • Labeling a dataset using a previously built classifier on several machine in parallel. • Using several machines to perform parallel cross-validation. • Labeling involves applying a previously learned classifier to an unlabelled data set to predict instance labels. • Testing takes a labeled data set, temporarily removes class labels, applies the classifier, and then analyses the quality of the classification algorithm by comparing the actual and the predicted labels. • Finally, for n-fold cross-validation a labeled data set is partitioned into n folds, and n training and testing iterations are performed. On each iteration, one fold is used as a test set, and the rest of the data is used as a training set. A classifier is learned on the training set and then validated on the test data. • Grid-Weka is similar to the Weka-Parallel project, but allows for performing more functions in parallel on remote machines (and also includes better load balancing, fault monitoring, and datasets management).

Grid-Weka • The labeling function is distributed by partitioning the data set, labeling several partitions in parallel on different available machines, and merging the results into a single labeled data set. • The testing function is distributed in a similar way, with test statistics being computed in parallel on several machines for different subsets of the test data. • Distributing cross-validation is also straightforward: individual iterations for different folds are executed on different machines.

Grid-Weka : Setup details • It uses a custom interface for communication between clients and servers utilizing native Java object serialization for data exchange. • It is mainly done on a Java command line execution style. • It uses a .weka-parallel configuration file in the client computer to setup the list of servers, in the following format: • PORT=<Port number><Machine IP address or DNS name><Number of Weka servers running on this machine><Max. amount of memory on this machine in Mbytes><Machine IP address or DNS name> • For each Weka server, a copy of the Weka software (the .jar file) is made on the selected machines and the Weka server class is run as follows: java weka.core.DistributedServer <Port number> • If a machine is going to run more than one weka server each server should have its own directory so it doesn’t combine results generated.

Performance analysis between Weka-Parallel and Grid-Weka • Grid-weka sacrifices some performance in exchange of more features, compare to the Parallel-Weka. These features are load-balancing, data recovery/fault monitoring, and more data mining functions than just cross validation. Grid-Weka Development: http://userweb.port.ac.uk/~khusainr/weka/Xin_thesis.pdf Grid-Weka HowTo: http://userweb.port.ac.uk/~khusainr/weka/gweka-howto.html

Inhambu • Inhambu is a distributed object-oriented system that supports the execution of data mining applications on clusters of PCs and workstations. • Inhambu is a system that uses the idle resources in a cluster composed of commodity PCs, supporting the execution of DM applications based on the Weka tool. • Its goal is to improve issues with Scheduling and load sharing, Overloading and contention avoidance, Heterogeneity, and Fault tolerance when performing Data Mining processes in a grid or clusters of computers.

Inhambu: architecture • The architecture of Inhambu implements: • An application layer: consists in a modified implementation of Weka. With specific components implemented and deployed at the client and server sides. The client component executes the user interface and generates DM tasks, while the server contains the core Weka classes which execute the DM tasks. • A resource management layer: which provides for the execution of Weka in a distributed environment. • The trader provides publishing and discovery mechanisms for clients and servers.

Inhambu: Improvements • Scheduling and load sharing: Implementation of static and dynamic performance indices. Static performance indices are usually implemented by static values that express or quantify amounts of resources and capacities. After an index is created then a dynamic monitoring performance updates the index. • Overloading and contention avoidance: Implementation of a “best effort” policy, where to avoid overloading a computer, it can only be chosen to receive load entities if its load index is below a given threshold. Default value chosen for the threshold is 0.7. for the relationship utilization index vs. the response time of a computer system. • Heterogeneity: Based on the Capacity State Index maintained distribution of the work can be enhanced in heterogeneous environments. • Fault tolerance: Checkpointing and recovery was implemented in the client side.

Inhambu: performance against Parallel-Weka. • Performance was done by running experiments on 2 real world databases: • Adults Census Income, and the a dataset for the diffuse large-B-cell lymphoma DLBCL. • The first performance test was done to determine scalability as shown in the tables when using J48 and PART classifications. • Inhambu and Weka-Parallel performs roughly similar for fine granularity tasks, and Inhambu performs better than Weka-Parallel when running tasks whose granularity is coarser.

Inhambu: Performance on non-dedicated and heterogeneous clusters • Notice that Weka-Parallel can lead to better performance in presence of shorter tasks, such as J4.8, due to its low communication overhead (it uses sockets). Regardless of higher overhead due to the use of RMI, Inhambu has a better performance in presence of longer tasks, Inhambu link: http://inhambu.incubadora.fapesp.br/portal

Weka4WS • The goal of Weka4WS is to extend Weka to support remote execution of the data mining algorithms through the Web Services Resource Framework (WSRF) Web Services. • To enable remote invocation, all the data mining algorithms provided by the Weka library are exposed as a Web Service. • Weka4WS has been developed by using the WSRF Java library provided by Globus Toolkit 4 (GT4). Which is an OGSA (Open Grid Service Architecture).

Weka4WS structure • In the Weka4WS framework all nodes use the GT4 services for standard Grid functionalities, such as security and data management. Those nodes can be distinguished in two categories: • 1. user nodes, which are the local machines of the users providing the Weka4WS client software • 2. computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data mining tasks. • The storage node can be applied when a centralized database is used.

Weka4WS :Setup details • Weka4WS requires Globus Toolkit 4 on the computing nodes and only the Java WS Core (a subset of Globus Toolkit) on the user nodes. But since GT4 only runs in Unix platforms, the computing nodes need to be Unix or Linux. • The Weka4WS client can be installed in either Unix or Windows environment. • Due to the nature of the web-service oriented approach there are security requirements because Weka4WS runs in a security context, and uses a grid-map authorization (only users listed in the service grid-map can execute it). Authentication needed using certificates. • In the client computer a machines file is needed for listing all the computing nodes. This is the only setup/configuration Weka4WS needs. The format of this file: # ==================== computing node ==================== # hostname container port gridFTP port pluto.deis.unical.it 8443 2811

Weka4WS: performance • Performance analysis of Weka4WS for executing a typical data mining task in different network scenarios. In particular, the execution times of the different steps needed to perform the overall data mining task were evaluated to determine overhead on LAN vs. WAN networks. • No performance comparisons were done against other Grid enabled data mining tools. Weka4WS paper: http://grid.deis.unical.it/papers/pdf/PKDD2005.pdf

Conclusion • The area of Data Mining and Cross-Validation over Grid enabled environments is in constant development. • Latest efforts try to develop and implement standard frameworks such as the OGSA (Open Grid Service Architecture) for data mining tools. • From the analysis of each of the presented tools, Weka4WS presents the most interesting overall. Still, other projects have positive features that eventually will be conglomerated into a single Grid Data Mining Tool based on Weka. • Further research will focus on enhancing performance, of the current tools that use RMI and the WSRF to avoid the overhead given by communications. Also, a further research topic can include available peer-to-peer or internet networks to facilitate performing data mining task over an Internet cluster available to everyone.

Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art

Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art

Presentation Transcript

The Software Infrastructure for Electronic Commerce

DATA MINING Introductory and Advanced Topics Part II

Knime: a data mining platform

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 —

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Data Mining Classification: Basic Concepts,

Data Mining Chapter 1

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Data Mining Algorithms for Recommendation Systems

Chapter 19: Distributed Databases

Weka – A Data Mining Toolkit

Data Mining: Concepts and Techniques

CS 490 Sample Project Mining the Mushroom Data Set

Spatial Data Mining: Accomplishments and Research Needs

Data Mining: Concepts and Techniques

DATA WAREHOUSING AND DATA MINING

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

QUALITY ASSURANCE AND VALIDATION FOR BIOMANUFACTURING

15-826: Multimedia Databases and Data Mining