1 / 17

DDM

DDM. Kirk. LSST-VAO discussion: Distributed Data Mining (DDM). Kirk Borne George Mason University March 24, 2011. The LSST Data Challenges. 100,000 events every night. 100 PB image archive. 50 billion object database. 20 PB science catalog. The LSST Data Mining Challenges.

Download Presentation

DDM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DDM Kirk

  2. LSST-VAO discussion:Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

  3. The LSST Data Challenges 100,000 events every night 100 PB image archive 50 billion object database 20 PB science catalog

  4. The LSST Data Mining Challenges Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. Massive event stream: knowledge extraction in real time for 100,000 events each night. • Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. • Look at #2 and #3 in more detail ...

  5. LSST data mining challenge # 2 • Accurately characterize and classify 50 billion objects and 20 trillion source observations • Requires VO-accessible multi-wavelength data • Szalay’s Law: Astrophysical discovery potential grows as (number of data sources)2 • Benefits of very large datasets: • best statistical analysis of “typical” events • automated search for “rare” events

  6. LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

  7. LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: flux time

  8. LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! flux time

  9. LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! Characterize first ! then Classify. flux time

  10. Characterization Use Case #1 • Feature detection and extraction: • Automated pipelines’ tasks: Characterize! • Identify and describe features in the data • Extract feature descriptors from the data • Curating these features for scientific re-use • Human experts’ tasks: Categorize and Classify! • Associate features with astrophysical processes • Find boundaries between feature sets and label them • Example: Star-Galaxy Separation

  11. Characterization Use Case #2 • The clustering problem: • Finding clusters of objects within a data set • Pipeline: apply an optimal algorithm for finding friends-of-friends or nearest neighbors • N is >1010, so what is the most efficient way to sort? • Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem • Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!

  12. Characterization Use Case #3 • Outlier detection: (unknown unknowns) • Finding the objects and events that are outside the bounds of our expectations (outside known clusters) • These may be real scientific discoveries or garbage • Outlier detection is therefore useful for: • Novelty Discovery – is my Nobel prize waiting? • Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)? • How do we measure their “interestingness”?

  13. Characterization Use Case #4 • The dimension reduction problem: • Finding correlations and “fundamental planes” of parameters • Number of attributes can be hundreds or thousands • The Curse of High Dimensionality ! • Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? • Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?

  14. The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential.

  15. The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential. Requirements for success: • Discovery of distributed data sources • Access to distributed data sources • Applying characterization and clustering (data mining) algorithms on distributed data: • Unsupervised and Supervised Machine Learning

  16. Data Bottleneck • Mismatch: • Data volumes increase 1000x in 10 yrs • I/O bandwidth improves ~3x in 10 years • Therefore . . . Distributed Data Mining

  17. Distributed Data Mining (DDM) • DDM comes in 2 types: • Mining ofDistributed Data (MDD) • Distributed Mining of Data (DMD) • Type 1 takes many forms, with data being centralized (in whole or in partitions) • Type 2 requires sophisticated algorithms that operate with data in situ … • Ship the Code to the Data • The computations are done on the data locally,with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solutionis converged upon.  • This can be pipeline-initiated or scientist end-user-initiated. • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery

More Related