DDM

DDM Kirk

LSST-VAO discussion:Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

The LSST Data Challenges 100,000 events every night 100 PB image archive 50 billion object database 20 PB science catalog

The LSST Data Mining Challenges Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. Massive event stream: knowledge extraction in real time for 100,000 events each night. • Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. • Look at #2 and #3 in more detail ...

LSST data mining challenge # 2 • Accurately characterize and classify 50 billion objects and 20 trillion source observations • Requires VO-accessible multi-wavelength data • Szalay’s Law: Astrophysical discovery potential grows as (number of data sources)2 • Benefits of very large datasets: • best statistical analysis of “typical” events • automated search for “rare” events

LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: flux time

LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! flux time

LSST data mining challenge # 3 • Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! Characterize first ! then Classify. flux time

Characterization Use Case #1 • Feature detection and extraction: • Automated pipelines’ tasks: Characterize! • Identify and describe features in the data • Extract feature descriptors from the data • Curating these features for scientific re-use • Human experts’ tasks: Categorize and Classify! • Associate features with astrophysical processes • Find boundaries between feature sets and label them • Example: Star-Galaxy Separation

Characterization Use Case #2 • The clustering problem: • Finding clusters of objects within a data set • Pipeline: apply an optimal algorithm for finding friends-of-friends or nearest neighbors • N is >1010, so what is the most efficient way to sort? • Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem • Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!

Characterization Use Case #3 • Outlier detection: (unknown unknowns) • Finding the objects and events that are outside the bounds of our expectations (outside known clusters) • These may be real scientific discoveries or garbage • Outlier detection is therefore useful for: • Novelty Discovery – is my Nobel prize waiting? • Anomaly Detection – is the detector system working? • Data Quality Assurance – is the data pipeline working? • How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)? • How do we measure their “interestingness”?

Characterization Use Case #4 • The dimension reduction problem: • Finding correlations and “fundamental planes” of parameters • Number of attributes can be hundreds or thousands • The Curse of High Dimensionality ! • Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? • Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?

The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential.

The LSST Data Mining Challenges: What’s the common theme? • Need multi-wavelength data in all use cases! • VO-accessible ancillary information is essential. Requirements for success: • Discovery of distributed data sources • Access to distributed data sources • Applying characterization and clustering (data mining) algorithms on distributed data: • Unsupervised and Supervised Machine Learning

Data Bottleneck • Mismatch: • Data volumes increase 1000x in 10 yrs • I/O bandwidth improves ~3x in 10 years • Therefore . . . Distributed Data Mining

Distributed Data Mining (DDM) • DDM comes in 2 types: • Mining ofDistributed Data (MDD) • Distributed Mining of Data (DMD) • Type 1 takes many forms, with data being centralized (in whole or in partitions) • Type 2 requires sophisticated algorithms that operate with data in situ … • Ship the Code to the Data • The computations are done on the data locally,with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solutionis converged upon. • This can be pipeline-initiated or scientist end-user-initiated. • References: http://www.cs.umbc.edu/~hillol/DDMBIB/ • Ultimate goal: Knowledge Discovery through Data Discovery

DDM

DDM

Presentation Transcript

DDM – A Cache Only Memory Architecture

DIRECT DESIGN METHOD “DDM”

DDM - A Cache-Only Memory Architecture

T3 DDM in US ATLAS

DDM Part II Analyzing the Results

DDM Ops User Feedback

DDM Processing steps

FINANCE 5. Stock valuation - DDM

The DDM and Common Stock Valuation

DDM development

DDM – A Cache-Only Memory Architecture

DDM Technical Assistance and Networking Session

Custom Orthotic Insole Using DDM

Northborough-Southborough DDM Development

ATLAS DDM Operations - I

US Tier3 Storage and DDM

FINANCE 5. Stock valuation - DDM

DDM – A Cache-Only Memory Architecture

DDM for (US) Tier3