Chapter 7 - Preparing Scientific and Engineering Data for Mining

Chapter 7 - Preparing Scientific and Engineering Data for Mining Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore National Laboratory http://www.llnl.gov/casc/people/kamath UCRL-PRES-145087: The work of Chandrika Kamath in Chapters 5, 6, and 7 was performed under the auspices of the U.S. Department of Energy by the University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

The input data cannot be directly input to the pattern recognition algorithms pattern recognition algorithms Input Data features Images/Meshes: Time-dependent Multi-sensor Compressed Spatio-temporal Massive 2,3,4 dimensions Classification Clustering Regression …. ? Data items The input data must be processed to make it suitable for the pattern recognition algorithms.

Science and engineering data are available in different formats • Different storage formats • FITS, AIPS in astronomy • netCDF, GRIB (grid in binary) in climate • Different ways of generating output • sea surface temps for each month in a file • sea surface temps for each year in a file • Depending on the problem, data can be • one-dimensional, usually time series, from sensors or processing of other data • two-dimensional (spatial) + time • three-dimensional (spatial) + time

Cell centered Node centered Edge centered Two-dimensional scientific data is available as images or as meshes MACHO • Can have spatial and temporal aspects • Images • pixel values: gray-scale or real • a scene obtained using different sensors, at different times, at different resolutions • can be noisy, with noise varying between images and within an image • Mesh • values at a mesh point are real • “cell centered”, “node centered” or “edge centered” Asteroid

Three dimensional scientific data comes from modeling objects in 3-D • Values at a mesh point are real • Can be “cell centered”, “node centered”, “edge centered”, or “face centered” • Often have a series of meshes in time: spatial and temporal aspect

The complexity of meshes makes it difficult to extract features Cartesian Structured Structured Unstructured Unstructured

The distribution of mesh points can change with time - need feature tracking Composed ‘Unstructured’ mesh Hierarchy of regular meshes Composite meshes - locally structured, globally unstructured

Science data is not often in a form ready for pattern recognition • Data available as pixels or values at mesh points • But, the patterns of interest are at a higher level The raw data must be transformed into features before we can apply pattern recognition. Extracting features that are robust, relevant to the problem, and invariant to scaling, rotation, and translation is non-trivial and time consuming - but, essential to the success of the pattern recognition algorithm.

Most of the work in data mining focuses on pattern recognition, BUT…. • … it is the data pre-processing which is • more influential and time consuming • domain specific and therefore less general • “perhaps as little as 10% effort was spent on classification aspects of the problem.” (Burl ‘98) • Langley/Simon ‘95: “… much of the power comes not from the specific induction method, but from proper formulation of the problems and from crafting the representation to make learning tractable.” • Brodley/Smyth ‘95: “… in practical applications, it is often the data and human issues which ultimately dictate success or failure of a project rather than algorithmic and model issues.”

The Sapphire view of the end-to-end data mining process (Kamath’01) Raw Data Target Data Preprocessed Data Transformed Data Patterns Knowledge Data Preprocessing Pattern Recognition Interpreting Results De-noising Object- identification Feature- extraction Normalization Dimension- reduction Data Fusion Sampling Multi-resolution analysis Classification Clustering Regression …. Visualization Validation

Let’s make a few ‘simple’ assumptions in our discussion of data preparation …. • We understand the problem and the data • We have formulated a solution approach • We have relatively easy access to the data • We have the software to read, write, and display the data • We have the software to bring the data into a consistent format  To satisfy these ‘simple’ assumptions may require far more time than you expect!

Data fusion may be necessary when data from many sources is available • Combining information from more than one source to make a more accurate and better informed decision • Exploit complementary information from different sensors, at different wavelengths, from different viewpoints,…. X-ray Infrared Optical Radio Images of the Crab Nebula from chandra.harvard.edu

Data registration is an important part of data fusion • Registration:align images to relate information in one image to information in another image • Used in data fusion and change detection Translation Rigid body Rotation Horizontal shear  Obtain a global or local transformation to match the input data to the reference data.

There are four major components of data registration (Brown ‘92) • Feature space: what features do we use for matching? • pixels, edges, contours, corners,…. • Search space: what transformation to use to establish correspondence between input and reference data? • translations, rotations, scaling,…. • Search strategy: which transformations are computed and evaluated? • exhaustive search, multiresolution methods • Similarity metric: how to evaluate the match between input and reference data? • mean square error, sum of abs differences

Some recent work in image registration • An excellent survey: Brown 92. • Use of wavelet-based multi-resolution techniques (Le Moigne ‘94) • Using evolutionary algorithms as a search strategy (Mandava ‘89) • Using the Levenberg-Marquardt optimization strategy (Thevenaz ‘98).

The data may need to be de-noised to better identify the objects • Noise in the data can be due to the data acquisition process or natural phenomena such as atmospheric turbulence • De-noising is difficult as cannot always tell what is the signal and what is the noise • A simple approach: thresholding • drop all values below a threshold • how do we calculate the threshold?

-1,-1 0,-1 1,-1 -1,0 0,0 0,1 -1,1 0,1 1,1 Simple filters can be used to “smooth” the data and minimize the noise effects • Convolve the image with a filter Image: Non-zero locations of a 3 by 3 filter: Convolution of filter f with image I:

Examples of some simple filters • Filters can vary in width - a wide filter gives better noise reduction, but smooths the edges • Mean filters • Gaussian filters

Multi-resolution analysis using wavelets • Using appropriate filters, decompose a signal • high frequency part (detail coefficients) • low frequency part (smooth coefficients) • In 2-dimensions, apply first along one dimension and then the other • Choice of wavelets, transforms, boundary conditions, number of levels 64 48 16 32 56 56 48 24 56 24 56 36 8 -8 0 12 40 46 16 10 8 -8 0 12 43 -3 16 10 8 -8 0 12 Smooth: Detail:

Wavelet multi-resolution analysis: Haar wavelet, periodic boundary conditions Horizontal Vertical Diagonal 2 level decomposition

Wavelets can be used for removing noise from data • Useful when data is available compressed using wavelet transforms • Basic idea - drop detail coefficients below a threshold • Extensive study (Fodor/Kamath ‘01) • several wavelets, boundary treatments • several shrinkage rules, shrinkage functions • compare with linear and non-linear spatial filters Noisy Image Forward wavelet transform Calculate threshold Apply threshold Inverse wavelet transform Denoised Image

Comparison of denoising: wavelets (symmlet12) vs. spatial filters Noisy, = 20 MSE=398 Original MMSE + Gaussian MSE=65 SURE MSE=69

Results of study on wavelet-based statistical techniques for denoising data • Results independent of choice of wavelets • Soft thresholding better than hard or semi-soft • SURE and Bayes rules are consistently better • Wavelets preserve edges better; introduce artifacts • Wavelets are not good at structured noise • Combination of spatial filters may give smaller MSE • Spatial filters often blur edges • Other approaches - diffusion-based methods, level set methods, ENO and TVD schemes, non-decimated transforms, curvelets, ….

De-noising techniques applied to the FIRST images Observed SureShrink Universal HypTest Unsharp mask Simple threshold

Wavelets can be a useful tool in several aspects of data mining (Fodor/Kamath ‘00) • Very effective in compression • astronomy, simulations, FBI fingerprints • JPEG 2000, MPEG-4 standards • progressive transmission of data • Mining compressed data: visualization approaches (Machiraju ‘01) • Feature extraction (at different scales) • texture analysis (Ma/Manjunath ‘95) • Image registration (Le Moigne ‘94) • Caveat: Recent work (Candes/Donoho ‘00) indicates that wavelets might not be good for > 1D data

Once the data has been de-noised, we need to identify the “objects” in it • Identifying the objects is non-trivial • tremendous variability of object shapes: man-made vs. natural objects • denoising may have smoothed the edges • variations in image quality (noise, boundary gaps)

Identifying objects in data is difficult, both in 2 and 3-D images and meshes • Challenges in traditional image algorithms • need many parameters for optimal performance • interactions between parameters are complex and non-linear • no universally accepted measure of quality of the segmented image • no single method can handle variations between images • Identifying “objects” in mesh data • mesh may move/change over time • in two/three spatial dimensions + time • irregular meshes • “objects” may split or merge

Several techniques are being used in the image processing community • Histogram the image, and threshold it based on the histogram: separate the foreground from the background • Segmentation techniques • split and merge (top-down) • region growing (bottom-up) • Edge detection: use a filter to identify an edge Laplacian Sobel:

Examples of some simple edge detection Original Sobel Canny

More sophisticated techniques for object identification • Combine traditional techniques with evolutionary algorithms to make them more adaptive (Bhanu/Lee ‘95, Cagnoni ‘97) • Deformable models for segmentation • parametric approach: snakes or active contours (Kass et. al ‘87) • geometric approach: level set methods (Malladi and Sethian ‘96) • Non-linear diffusion filters based on PDEs • smooth images while enhancing edges (Weickert et. al ‘98)  PDE-based techniques are gaining popularity - they are robust, but expensive.

Once the objects have been identified, the features must be extracted • Features dependent on the problem • identifying relevant features • extracting robust features • extracting features invariant to scale, rotation, and translation • Features may include • distances, angles, areas • histograms • fourier or wavelet coefficients • various moments • ….

May need to reduce the dimension or the number of features Object recognition and Feature Extraction Dimension Reduction Pattern Recognition Raw Data Information Features Features Data items

There are several reasons why dimension reduction may be helpful • Fewer features may make pattern recognition algorithms computationally tractable • Less time is spent in extracting features • Can minimize correlations between features, which may be a requirement of some algorithms (e.g. GLMs)

In the FIRST data, we need to reduce the 103 features for 3-entry sources • Input from domain experts • EDA techniques: parallel plots and box plots • Wrapper approach

There are also more complex techniques for dimension reduction • Principal component analysis • transform the features to be mutually uncorrelated • focus on directions that maximize the variance

Principal component analysis algorithm • N data items in d dimensions • find the d-dimensional mean vector • obtain the d x d covariance matrix • obtain the d eigenvalues and eigenvectors of the covariance matrix • keep k largest eigenvectors (k << d) • project the (original data - mean) into the space spanned by these vectors The eigenvectors or principal components (PCs) are mutually orthogonal and the original data is a linear combination of these PCs

We applied PCA to the problem of bent-double classification • The first 20 PCs explained about 90% of the variance • Eliminate unimportant variables • eliminate variable with largest coefficient in e-vector corresponding to smallest e-value • repeat with the e-vector for the next smallest e-value • continue till left with 20 variables Using the 31 features found through EDA and PCA lowers the error from 11.1% to 9.5%

Need more appropriate techniques for dimension reduction • PCA may not always be appropriate • linear • orthogonal • Other options • independent component analysis • blind source separation • non-linear PCA • genetic algorithms • Need incremental techniques which are applied as the data is being collected (Kargupta ‘00)

It is difficult to find labeled data in science and engineering applications • Training set usually generated manually, not historically • Not all scientists may agree on a label • “Labeled” data vs “interesting”data • Often ground truth is unavailable, or difficult to find • Approach to labeling may be ad-hoc • the “yellow-sticky-pad” approach to identifying bent doubles Non-bent double Bent-double

Sapphire experiences with a flexible system design for data mining • We address the needs of a diverse set of applications • Not all problems require the entire process • Not all algorithms are suitable for a problem • Algorithms typically depend on several parameters • Intermediate data must be handled properly • Domain dependent and independent parts must be clearly identified • Should be able to accommodate a growing data set

RDB Decision Trees Neural Networks k-NN k-means Genetic algo. . . . Features De-noise data Extract features Data items FITS netCDF View . . . Display Patterns Sample data Multi-resolution- analysis Dimension- reduction Sample features Sapphire Software Public Domain Software Sapphire & Domain Software The Sapphire approach: a flexible, portable, scalable system architecture User Input & Feedback Components linked by Python

Other pointers that discuss system architecture issues • Data mining specific projects • ADAM , JARTool, Diamond Eye • Workshops of more general interest • mining scientific datasets (http:www.ahpcrc.umn.edu/conferences) • interfaces to scientific data archives (http://www.cacr.caltech.edu/SDA) • large scientific databases (http://www.cacr.caltech.edu/euus/) • issues in the application of data mining to scientific data (http://www.cs.uah.edu/NASA_Mining) • data fusion and data mining (http://ic-www.arc.nasa.edu/ic/data99-workshop)

Challenges in mining science and engineering data sets • Feature extraction is non-trivial • Labeled data is difficult to obtain • Data can be high dimensional • Need techniques to handle spatial and temporal aspects • System infrastructure issues are important • Data fusion and registration are required in some cases • Data may be compressed • May need to mine data as it is being generated • ….

Acknowledgements • The Sapphire project team: Erick Cantú-Paz, Imola K. Fodor, and Nu Ai Tang • Sisira Weeratunga (LLNL) for insights on simulations and PDEs • FIRST scientists: Bob Becker, Michael Gregg, Sally Laurent-Muehleisen, and Rick White http://www.llnl.gov/casc/sapphire UCRL-PRES-145087: The work of Chandrika Kamath in Chapters 5, 6, and 7 was performed under the auspices of the U.S. Department of Energy by the University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

Chapter 7 - References Credits for images used in Chapter 7 (if not provided with the image) • MACHO web page: http://wwwmacho.anu.edu.au/ • 3D meshes: http://www.llnl.gov/casc/overture • Structured and unstructured mesh around the front of an aircraft, http://www.nas.nasa.gov/Pubs/Docs/FAST/chp_16.serferu.html • 3D unstructured mesh with heterogeneous elements: http://cox.iwr.uni-heidelbeg.de/ug/Images/benchmark_grid.gif • Composite grid: SAMRAI project - http://www.llnl.gov/casc/samrai • Wavelet images: generated by Sapphire software - http://www.llnl.gov/casc/sapphire

Chapter 7 - References Burl,M., L. Asker, P. Smyth, U. Fayyad, P. Perona, L. Crumpler, and J. Aubele, “ Learning to recognize volcanoes on Venus”, Machine Learning, Volume 30, pages 165-195, 1998. Langley, P. and H. A. Simon, “Applications of machine learning and rule induction”, Communications of the ACM, Volume 38, Number 11, pages 55-64. Brodley, C. and P. Smyth, “The process of applying machine learning algorithms”, Workshop on applying machine learning in practice, IMLC 1995 (http://citeseer.nj.nec.com/722.html) Kamath, C., E. Cantú-Paz, I. K. Fodor, and N. Tang, “Searching for bent-double galaxies in the first survey”, in Data Mining for Scientific and Engineering Applications, R. Grossman, C. Kamath, W. Kegelmeyer, V. Kumar, and R. Namburu (eds.), Kluwer 2001.

Chapter 7 - References Brown, L. “ A Survey of Image Registration Techniques”. ACM Computing Surveys, Vol. 24, Number 4, December 1992. Le Moigne, J., “Parallel Registration of Multi-sensor remotely senses imagery using wavelet coefficients”, Proc. SPIE Wavelet Applications Conference, Orlando, 1994, pages 423-443. Mandava, V., Fitzpatrick, J., and Pickens, D. (1989). Adaptive search space scaling in digital image registration. IEEE Transactions on Medical Imaging, 8, 251-262. Thevenaz, P., Ruttimann, U., Unser, M., “A Pyramid Approach to Sub-pixel Registration based on intensity”, IEEE Transactions on Image Processing, Vol 7, Number 1, January 1998. Fodor, I.K. and C. Kamath, “On denoising images using wavelet-based statistical techniques”, submitted for publication. See the Sapphire web page for details.

Chapter 7 - References Fodor, I.K. and C. Kamath, “The role of multi-resolution in mining massive image datasets”, Proceedings of the YES2000 Symposium on Advanced Multiscale and Multi-resolution Methods, Lecture Notes in Computational Science and Engineering, Springer-Verlag, 2001. Machiraju, R. and J. Fowler, D. Thompson, W. Schroeder, and B. Soni, “EVITA - A Prototype System for Efficient Visualization and Interrogation of Terascale Datasets”, to appear in Data Mining for Scientific and Engineering Applications, R. Grossman, C. Kamath, W. Kegelmeyer, V. Kumar, and R. Namburu (eds.), Kluwer 2001. Ma, W. Y. and B. S. Manjunath, “A comparison of wavelet transform features for texture image annotation”, Proc. Second International Conference on Image Processing, ICIP 95, pages 256-259.

Chapter 7 - References Candes, E. and Donoho, D. , “Curvelets, multiresolution representation, and scaling laws”, Proc. Wavelet Applications in Signal and Image Processing VIII, SPIE 2000, vol. 4119. Bhanu, B. and S. Lee,”Adaptive image segmentation using a genetic algorithm”, IEEE Transactions on Systems, Man, and Cybernetics, 25, pages 1543-1567, 1995. Cagnoni, S., Dobrzeniecki, A, R. Poli, J. Yanch, “Segmentation of 3D medical images through genetically optimized contour tracking algorithms”, U. Birmingham School of Computer Science Technical Report, CSRP-97-28, 1997. Kass, M., A. Witkin, and D. Terzopolous, “Snakes: active contour models”,Int’l J. Computer Vision, Volume 1. No. 4, pages 321-331, 1987.

Chapter 7 - References Malladi, R. and J. Sethian, “A unified approach to noise removal, image enhancement, and shape recovery”, IEEE Transactions on Image Processing, Volume 5, 1996, pages 1154-1168. Weickert, J, B. ter Haar Romeny, and M. Viergever, “Efficient and Reliable Schemes for Nonlinear Diffusion Filtering”, IEEE Transactions on Image Processing, Volume 7, Number 3, March 1998. Joliffe, I., Principal Component Analysis, Springer Verlag, 1986. Kargupta, H, W. Huang, S. Krishnamoorthy, and E. Johnson, “Distributed clustering using collective principal component analysis”, ACM SigKDD Workshop on Distributed and Parallel Knowledge Discovery, 2000.

Chapter 7 - Preparing Scientific and Engineering Data for Mining