1 / 21

4 Data Reduction

4 Data Reduction. 응용화학부 송상옥. 발표순서. Data Reduction 의 필요성 Dimension Reduction 의 역할 및 형태 Dimension Reduction 의 구체적 방법. 왜 필요한가?. 데이터가 너무 많으면 예측 프로그램의 용량 초과 해를 구하는데 걸리는 시간 지연 적절한 양의 데이터 데이터에 포함된 개념의 복잡도에 의존 ( model 의 complexity) mining 이전에 알 수 없다. Ex) random data.

frobinson
Download Presentation

4 Data Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 4Data Reduction 응용화학부 송상옥

  2. 발표순서 • Data Reduction의 필요성 • Dimension Reduction의 역할 및 형태 • Dimension Reduction의 구체적 방법

  3. 왜 필요한가? • 데이터가 너무 많으면 • 예측 프로그램의 용량 초과 • 해를 구하는데 걸리는 시간 지연 • 적절한 양의 데이터 • 데이터에 포함된 개념의 복잡도에 의존 (model의 complexity) • mining 이전에 알 수 없다. • Ex) random data

  4. Dimension Reduction의 역할

  5. Dimension Reduction의 형태 • Delete a column (feature) • Delete a row (case) • Reduce the number of values in a column (smooth a feature) • transformation to new data set(PCA)

  6. Impossible ! Search space computational time approximation promising subsets simple distance measure using only training error Best Features Selection

  7. Mean and Variance • Cases : a sample from some dist. • Spreadsheet  mean and variance • BUT, Dist. is unknown Heuristic Feature Selection Guidance

  8. Independent Features • Classification problem • k classes classification • k pairwise comparison • Regression = pseudo-classification

  9. Distance Based Selection • Independent analysis + correlation analysis  detect redundancy • Distance measure • Independent feature • Branch-and-Bound Algorithm

  10. Heuristic Feature Selection • Comparison measures • Significant Test • Dm • F-Test

  11. Principal Components • Merging features • a new set of fewer columns first k-component • First principal component • minimum euclidean distance • Feature with a large variance • excellent chances for separation of class or group of case values

  12. Decision Trees • Dynamic logic approach • coordinated with searching for solution • advantageous in large feature spaces • recursive partitioning

  13. Clustering problem Reducing Values Problem

  14. Rounding

  15. K-Mean Clustering

  16. Class Entropy

  17. How many Cases? • 적절한 sample size  complexity • Prediction method와 긴밀하게 연관 • 빠른 시간 안에 적절한 해  Case reduction !! • Basic approach (random sampling) • Incremental samples • Average samples

  18. A Single Sample

  19. Incremental Samples

  20. 추가적인 bias 없이 variance error를 줄일 수 있음 Best Solution Approach Average Samples

  21. Specialized Techniques • Sequential Sampling over Time • Time-dependent data • Sampling period와 feature measuring 사이에 최적화 • Strategic sampling of Key Event • Net change > threshold (regression) • Adjusting prevalence • Low prevalence에 대해 case 반복

More Related