1 / 21

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING. Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, William Perrizo North Dakota State University Computer Science Department. Outline. Introduction to Distributed Data Mining Demands Existing Projects Architecture

jshipman
Download Presentation

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov,William Perrizo North Dakota State UniversityComputer Science Department

  2. Outline • Introduction to Distributed Data Mining • Demands • Existing Projects • Architecture • Importance of a Layered Architecture • A Prototype System • System Architecture (layered) • Server • Communication • Client & GUI • System Characteristics • Conclusion

  3. Demands of distributed data mining • Large dataset size • Diversity of data • Geographic distribution of users and resources • Computationally intensive result generation

  4. Large scale distributed data mining project • Kensington project • Mining enterprise data distributed across the internet. • Papyrus project • Based on mobile agents implemented using java. • PaDDMAS • A component based tool set that integrates pre-developed or custom packages • JAM • Agent based distributed system that has been developed to mine stored in different sites. • BODHI • Collective data mining with stress on the learning vertically partitioned data.

  5. Architecture • Client-Server • Advantage: • Able to use high performance computing on the server side to do the data mining. • Agent based • Hybrid

  6. Importance of a Layered Architecture • Layered framework helps to manage complexity. • Provides the flexibility to add/remove/modify layer and components of a layer • Allows for a better tracking of progress of large, complex projects. • Human input is required to tune the data and the algorithms to suite the need (Mix of greyware versus software can be changed over time).

  7. Capture dataset to DataMIMETM Mine on DataMIME™ Integrate data (synchronize to existing) Internet System performance ananlysis Client Side One of the Slave Servers Master Server Server Side System Architecture • DataMIMETM developed as proof-of-concept. • Based on patent pending, “P-tree technology” • Efficient and scalable system. • Flexible plug-ins. • Conceptual view of the system

  8. Already Plugged Algorithm Plugs for new algorithms Room for new feeder DMA Layer DCI/DII Layer DMI Layer DPMI: Distributed Ptree Management Interface Distributed Ptree database Server Architecture • Data capture and integration layer (DCI/DII) • Data mining interface (DMI) • Distributed Ptree Management Interface (DPMI) • Uniform data structure • Data mining algorithms (DMA) • Client-server communication • Client interface

  9. Already Plugged Algorithm Plugs for new algorithms Room for new feeder DMA Layer DCI/DII Layer DMI Layer DPMI: Distributed Ptree Management Interface Distributed P-tree database (DPD) The Distributed P-tree Database • The DPD collects all data in vertical format (as opposed to the ubiquitous horizontal (record-based) data structure used in DBMSs), as Predicate-trees (P-trees) based on the patent pending P-tree technology). • P-trees can be 0-dimensional, 1-dimensional, 2-dimensional, etc. • Next slide shows the detailed construction of 1-D P-trees from a generic horizontal table of data.

  10. Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontally structured records Scanned vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole is pure1? false  0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false  0 P11 0 0 0 0 1 01 3. Right half pure1? false  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 1 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 ^ 7 0 1 4 0 0 1 0 1 01 5. Rt half of right half? true1 0 1 0 6. Lf half of lf of rt? true1 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level ^ 7. Rt half of lf of rt? false0 then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 R11 0 0 0 0 1 0 1 1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved. Horizontally AND basic Ptrees P11 And it’s pure so branch ends But it is pure (pure0) so this branch ends

  11. m 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 m m 1 m 0 1 m 1 1 m 1 1 1 1 0 0 0 1 0 1 1 0 1 2-D P-tree Data Structure • Peano or Z-ordering • Pure (Pure-1/Pure-0/Mixed) quadrants • Root Count (count of 1s in the tree) • Provides an efficient format for ANDing, ORing and Complementing. • Lossless, compressed, count-computation_ready representations.

  12. Room for new feeder Already Plugged feeder DCI/DII Layer DCI/DII (Data Capture and Data Integration Interface) layer • Allows user to capture and to integrate data to the required format (p-tree format). • The main component of this layer is the feeder. • An individual feeder can process a particular format of incoming data. • User can write his own feeder and plug it very easily in this architecture.

  13. D M I DMI (Data Mining Interface) layer • DMI does counting, the most important operation for data mining provided by P-trees, including: • basic P-trees • value P-trees • tuple P-trees • Interval P-trees • Cube P-trees • DMI also provide the P-tree algebra, which has four operations: • AND • OR • NOT (complement) and • XOR, to implement the point wise logical operations on P-trees for (Data Mining Algorithms) DMA.

  14. Distributed Ptree Management Interface (DPMI) Layer • The DPMI layer provides: • access • location • and concurrency transparency • by hiding the fact that: • data representation may differ • resources may be located in different places • resources may be shared by several competitive users. • By resource we meant data and its converted form Ptree.

  15. DMA Layer DMA (Data Mining Algorithms) layer • This layer is a collection of data mining tools (algorithms). • Upon receiving a request from the client side an algorithm will be fired up for mining. • This layer depends on the DMI for accessing meta-info and required counts needed in: • Ptree based K Nearest Neighbor PKNN • Podium Incremental Neighbor Evaluator PINE • P-BAYESIAN • Etc. • The architecture has the flexibility to plug-in any new algorithm on this layer.

  16. Communication • The communication between different layers is designed in such a way that it minimizes the data flow over the network. • In the DCI and the DMA communication protocols a client will create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. The response for a request is a line with a message indicating the outcome of the request. • A DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties. Response to the DMA protocol request also contains key / value pairs.

  17. Meta Data Data Client Side DCI DCI Data Meta-data generator Client Side DCI Prediction Model Unclassified data Client side DMA DMA Visualization Tool Client Structure • Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results. • The two main functionalities are: • Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing.

  18. Data Capturing Data Mining Client and GUI In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user (http://midas.cs.ndsu.nodak.edu/~datasurg/datamime )

  19. System Characteristics • Ability to handle formatted record-based, relational-like data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format. • Easy conversion from any other machine readable format can be provided through customized data feeders. • Users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system. • Capable of handling large quantities of data and mines them in scalable time. • Clients of the system can run on UNIX and Microsoft Windows platform with the server designed to be a UNIX-based system.

  20. System Characteristics (cont.) • Supports major RDBMS platforms. • The server engine can be run on a single machine or distributed across multiple computers for better scalability and efficiency. • The system has an open architecture provides high degree of software extensibility and integration capabilities. • The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work. • The system minimizes the flow of data across the network.

  21. Conclusion • We have shown the importance of having a layered architecture for a distributed data mining system. • Key elements were identified in deciding on the different layers. • Able to identify a unique efficient vertical data structure at the lowest layer that can take advantages of the latest hardware. • To facilitate the data distribution a management layer is also recognized. • Two other layers are defined: data capture and data mining layer. • A prototype system was developed as a proof-of-concept to show the feasibility of the approach.

More Related