4 Data Reduction

4Data Reduction 응용화학부 송상옥

발표순서 • Data Reduction의 필요성 • Dimension Reduction의 역할 및 형태 • Dimension Reduction의 구체적 방법

왜 필요한가? • 데이터가 너무 많으면 • 예측 프로그램의 용량 초과 • 해를 구하는데 걸리는 시간 지연 • 적절한 양의 데이터 • 데이터에 포함된 개념의 복잡도에 의존 (model의 complexity) • mining 이전에 알 수 없다. • Ex) random data

Dimension Reduction의 역할

Dimension Reduction의 형태 • Delete a column (feature) • Delete a row (case) • Reduce the number of values in a column (smooth a feature) • transformation to new data set(PCA)

Impossible ! Search space computational time approximation promising subsets simple distance measure using only training error Best Features Selection

Mean and Variance • Cases : a sample from some dist. • Spreadsheet  mean and variance • BUT, Dist. is unknown Heuristic Feature Selection Guidance

Independent Features • Classification problem • k classes classification • k pairwise comparison • Regression = pseudo-classification

Distance Based Selection • Independent analysis + correlation analysis  detect redundancy • Distance measure • Independent feature • Branch-and-Bound Algorithm

Heuristic Feature Selection • Comparison measures • Significant Test • Dm • F-Test

Principal Components • Merging features • a new set of fewer columns first k-component • First principal component • minimum euclidean distance • Feature with a large variance • excellent chances for separation of class or group of case values

Decision Trees • Dynamic logic approach • coordinated with searching for solution • advantageous in large feature spaces • recursive partitioning

Clustering problem Reducing Values Problem

Rounding

K-Mean Clustering

Class Entropy

How many Cases? • 적절한 sample size  complexity • Prediction method와 긴밀하게 연관 • 빠른 시간 안에 적절한 해  Case reduction !! • Basic approach (random sampling) • Incremental samples • Average samples

A Single Sample

Incremental Samples

추가적인 bias 없이 variance error를 줄일 수 있음 Best Solution Approach Average Samples

Specialized Techniques • Sequential Sampling over Time • Time-dependent data • Sampling period와 feature measuring 사이에 최적화 • Strategic sampling of Key Event • Net change > threshold (regression) • Adjusting prevalence • Low prevalence에 대해 case 반복

4 Data Reduction

4 Data Reduction

Presentation Transcript

DATA DRIVEN COURSE REDUCTION

Data Reduction Tutorials - Tuesday

GMOS Data Reduction

Data reduction and analysis of SPS data

NIFS Data Reduction

EIS Data Reduction System

Data Reduction personal projects

MWC 297 Data Reduction

MOS Data Reduction

SINFONI data reduction

Observational procedures and data reduction Lecture 4: Data reduction process

NIFS Data Reduction

Data Reduction with NIRI

Data reduction for S3

Data Reduction using SORTAV

DATA REDUCTION (Lecture# 03)

Chandra data reduction

Data Reduction

Data Preparation and Reduction

Infrared Data Reduction

Data Reduction using SORTAV