1 / 76

Juan M. Banda

Framework for creating large-scale content-based image retrieval (CBIR) system for solar data analysis. Juan M. Banda. Agenda. Project Objectives Datasets Framework Description Feature Extraction Attribute Evaluation Dimensionality Reduction Dissimilarity Measures Component

marcie
Download Presentation

Juan M. Banda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Framework for creating large-scale content-based image retrieval (CBIR) system for solar data analysis Juan M. Banda

  2. Agenda • Project Objectives • Datasets • Framework Description • Feature Extraction • Attribute Evaluation • Dimensionality Reduction • Dissimilarity Measures Component • Indexing Component

  3. Project Objectives

  4. Creation of a CBIR system building framework • Creation of a composite multi-dimensional data indexing technique • Creation of a CBIR system for Solar Dynamics Observatory

  5. Contributions • Framework is the first of its kind • Custom solution for high-dimensional data indexing and retrieval • First domain-specific CBIR system for solar data • Motivation • Lack of simple CBIR system creation tools • High-dimensional data indexing and retrieval has shown to be very domain-specific • SDO (with AIA) produces around 69,120 images per day. Around 700 Gigabytes of image data per day

  6. Datasets

  7. TRACE Dataset • Created using the Heliophysics Events Knowledgebase (HEK) portal • Contains 8 classes: Active Region, Coronal Jet, Emerging Flux, Filament, Filament Activation, Filament Eruption, Flare, and Oscillation • 200 images per class, available on the web: http://www.cs.montana.edu/angryk/SDO/data/TRACEbenchmark/

  8. Sample Images from subset of classes Active Region Oscillation Flare Filament Filament Eruption Filament Activation

  9. INDECS Database • Images of indoor environment’s under changing conditions • Contains 8 Classes: Corridor Cloudy and Night, Kitchen Cloudy, Night, and Sunny, Two-persons Office Cloudy, Night, and Sunny • 200 images per class, available on the web: http://cogvis.nada.kth.se/INDECS/

  10. Samples Images from subset of classes Corridor - Cloudy Corridor - Night Kitchen - Cloudy Kitchen - Night Kitchen - Sunny Two-persons Office - Cloudy

  11. ImageCLEFmed Dataset • The 2005 dataset contains 9,000 radio graph images divided in 57 classes • 2006-2007 datasets increased to 116 classes and by 1,000 images each year • 2010 dataset contains over 77,000 images (perfect for scalability evaluation)

  12. Sample Images from subset of classes Head Profile Lungs Hand Vertebrae

  13. Labeling • TRACE Dataset • One label per image (as a whole) • One label per cell (several per image) • INDECS Database • One label per image (as a whole) • ImageCLEFmed • One label per image (as a whole)

  14. Classifiers

  15. Comparative Evaluation Puposes • Future work: Tune parameters better • Why? • Naïve Bayes • C 4.5 • Support Vector Machines (SVM) • Adaboosting C 4.5

  16. Refereed publications from this work • 2010 J.M Banda and R. Angryk “Selection of Image Parameters as the First Step Towards Creating a CBIR System for the Solar Dynamics Observatory”. TO APPEAR. International Conference on Digital Image Computing: Techniques and Applications (DICTA). Sydney, Australia, December 1-3, 2010 J.M Banda and R. Angryk “Usage of dissimilarity measures and multidimensional scaling for large scale solar data analysis”. TO APPEAR. NASA Conference of Intelligent Data Understanding (CIDU 2010). Computer History Museum, Mountain View, CA October 5th - 6th, 2010 (Invited for submission to Best of CIDU 2010 issue of Statistical Analysis and Data Mining journal (the official journal of ASA)) J.M Banda and R. Angryk “An Experimental Evaluation of Popular Image Parameters for Monochromatic Solar Image categorization” Proceedings of the twenty-third international Florida Artificial Intelligence Research Society conference (FLAIRS-23), Daytona Beach, Florida, USA, May 19–21 2010. pp. 380-385. • 2009 J.M Banda and R. Angryk “On the effectiveness of fuzzy clustering as a data discretization technique for large-scale classification of solar images” Proceedings of the 18th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE ’09), Jeju Island, Korea, August 2009, pp. 2019-2024.

  17. Framework Description

  18. Feature Extraction

  19. Image Parameters

  20. Image Segmentation / Feature Extraction 8 by 8 grid segmentation (128 x 128 pixels per cell)

  21. Image Parameter Extraction Times for 1,600 Images

  22. Comparative Evaluation Average classification accuracy with cell labeling Some of these results are part of the paper accepted for publication in the FLAIRS-23 conference (2010)

  23. Attribute Evaluation

  24. Motivation for this stage • By selecting the most relevant image parameters we will be able to save processing and storage costs for each parameter that we remove • SDO Image parameter vector will grow 6 Gigabytes per day

  25. Unsupervised Attribute Evaluation a) b) • Average correlation map for the Active Region class in the one image as a query against: • the same class scenario (intra-class correlation) ( 1 image vs. 199 images) • other classes (inter-class correlation) scenario (1 image vs. 1,400 images)

  26. Better Visualization? a) b) • MDS map for the Active Region class in the one image as a query against: • the same class scenario (intra-class correlation) ( 1 image vs. 199 images) • other classes (inter-class correlation) scenario (1 image vs. 1,400 images) Multidimensional Scaling (MDS) allows us to better visualize these correlations

  27. Supervised Attribute Evaluation • Chi Squared • Gain Ratio • Info Gain User Extendable (WEKA has more than 15 other methods that the user can select)

  28. Supervised Attribute Evaluation

  29. Experimental Set-up • Objective: 30% dimensionality reduction • Remove 3 parameters for each set of experiments

  30. Attribute Evaluation – Preliminary Experimental Results

  31. Attribute Evaluation - Preliminary Conclusions • Removal of some image parameters maintains comparable classification accuracy • Saving up to 30% of storage and processing costs • Paper: Accepted for publication in DICTA 2010 conference

  32. Dimensionality Reduction

  33. Motivation • By eliminating redundant dimensions we will be able to save retrieval and storage costs • In our case: 540 kilobytes per dimension per day, since we will have a 10,240 dimensional image parameter vector per image (5.27 GB per day)

  34. Linear dimensionality reduction methods • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) • Locality Preserving Projections (LPP) • Factor Analysis (FA)

  35. Non-linear Dimensionality Reduction Methods • Kernel PCA • Isomap • Locally-Linear Embedding (LLE) • Laplacian Eigenmaps (LE)

  36. Experimental Set-up • We selected 67% of our data as the training set and an the remaining 33% for evaluation • Full Image Labeling • For comparative evaluation we utilize the number of components returned by standard PCA and SVD’s algorithms, setting up a variance threshold between 96 and 99% of the variance

  37. Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per method

  38. Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per method

  39. Dimensionality Reduction - Preliminary Experimental Results Average classification accuracy per number of generated dimensions

  40. Dimensionality Reduction – Preliminary Conclusions • Selecting anywhere between 42 and 74 dimensions provided stable results • For our current benchmark dataset we can reduce around 90% from 640 dimensions we started with • For the SDO mission a 90% reduction would imply savings of up to 4.74 Gigabytes per day (from 5.27 Gigabytes of data per day) • Paper: Under Review

  41. Dissimilarity Measures Component

  42. Motivation for this stage • Literature reports very interesting results for different measures in different scenarios • The need to identify peculiar relationships between image parameters and different measures

  43. Dissimilarity Measures • 1) Euclidean distance [30]: Defined as the distance between two points give by the Pythagorean Theorem. Special case of the Minkowski metric where p=2. • 2) Standardized Euclidean distance [30]:Defined as the Euclidean distance calculated on standardized data, in this case standardized by the standard deviations.

  44. Dissimilarity Measures • 3) Mahalanobis distance [30]: Defined as the Euclidean distance normalized based on a covariance matrix to make the distance metric scale-invariant. • 4) City block distance [30]: Also known as Manhattan distance, it represents distance between points in a grid by examining the absolute differences between coordinates of a pair of objects. Special case of the Minkowski metric where p=1.

  45. Dissimilarity Measures • 5) Chebychev distance [30]: Measures distance assuming only the most significant dimension is relevant. Special case of the Minkowski metric where p = ∞. • 6) Cosine distance [26]: Measures the dissimilarity between two vectors by finding the cosine of the angle between them.

  46. Dissimilarity Measures • 7) Correlation distance [26]: Measures the dissimilarity of the sample correlation between points as sequences of values. • 8) Spearman distance [25]: Measures the dissimilarity of the sample’s Spearman rank [25] correlation between observations as sequences of values.

More Related