Distributed Indexed Outlier Detection Algorithm

Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014

A Parallel iOrca Algorithm • iOrca is a very efficient serial algorithm for outlier detection in data • https://c3.nasa.gov/dashlink/resources/450/ • 3 key points for iOrca’s performance • Indexing the data prior to analysis to front load potential outliers • Select random point R from data, order data based on decreasing distance from R. Given that R is much more likely to be an inlier, distance from R is a good predictor of outliers • Break indexed data into blocks for processing • Each data point in the processed block is compared to points in the full data set, navigating in a spiral fashion until k neighbors are located closer than the current cutoff. Spiraling finds neighbors more quickly, while the fast increase in cutoff threshold (by indexing relative to R) means fewer comparisons are required to identify k neighbors • Global early termination function • Once any point is processed where (the distance from that point to R) + (the distance from R to R’s k-th neighbor) < the current cutoff, all processing is complete • New DIO/iOrca algorithm applies these same techniques, but distributes the analysis of blocks of data over many processors

DIO Parallel Algorithm Overview • Follows iOrca’s data indexing and block processing concepts, but adds a central control function which assigns blocks of data to available worker nodes • Each worker node evaluates each data point in block until a sufficient number of neighbors closer than the cutoff are found, or the early termination condition is recognized • Worker returns outlier candidates and requests another block of data • Controller maintains running list of outliers and threshold, passes current threshold to workers along with data assignments • Job ends at data EOF or when global termination condition occurs, controller outputs final list of outliers

Indexed Distributed Method at Controller Node initialization (open files, validate parameters) send (synchronous) to each worker node: index to data block and cutoff threshold loop until data EOF or early termination on message from worker receive (asynch) from worker: count of new potential outliers acknowledge receipt, wait for data receive (synchronous) from worker: details of each new outlier send (synchronous) to worker: new data index and cutoff threshold sort new candidates into master outlier list and recalculate threshold message processing end loop end process final outlier list and print output

Indexed Distributed Method at Worker Node initialization (open files, set parameters, only done once per worker process) loop until finished receive (synchronous) from MPI control process: index to data block & current cutoff threshold check for EOF check for global early termination condition for each data point find neighbors, determine outlier candidates sort outliers, calculate new cutoff threshold if potential outliers are found send (asynch) to MPI controller: count of new outliers wait for receive signal from MPI send (synchronous) to MPI controller: outlier candidates with scores and neighbors end if loop ends when data block index is at or past EOF, or early termination condition is observed

DIO/iOrca – Process Overview DIO works as a drop-in replacement for iOrca dio application CSV input data dprep application indexed data files iOrca application results parsing script File formats, pre-processing and post-processing procedures are unchanged Blue = existing iOrca programs and scripts CSV output files

Performance Results - Overview • Serial iOrca tested against parallel DIO/iOrca • Test data is 1 million rows of 10 random floating point numbers • Detect 1% of input as outliers, process blocks of 150 rows • All testing was done on hotel.futuregrid.org (IBM iDataplex, RHEL 5.9) • MPI for testing: OpenMPI version 1.4.5 (gnu-4.1 compiler) • Largest run attempted, on 96 processor cores, ran in 1.09% of the time required for the serial iOrca job (12.5 minutes vs 19.1 hours)

Performance Results (detail)

Performance Results – Near Linear Scaling

Future Work? • Dynamically add additional controller node(s) as needed to allow scaling for much larger applications. • Develop DIO as a free standing, open source application (current version runs within iOrca framework)

Distributed Indexed Outlier Detection Algorithm