Instance Construction via Likelihood-Based Data Squashing

Download Presentation

Instance Construction via Likelihood-Based Data Squashing

Loading in 2 Seconds...

- 71 Views
- Uploaded on
- Presentation posted in: General

Instance Construction via Likelihood-Based Data Squashing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Madigan D.,et. al.

(Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers)

Summarize: Jinsan Yang, SNU Biointelligence Lab

- Abstract
- Data Compression Method: Squashing
- LDS: Likelihood based data squashing

- Keywords
Instance Construction, Data Squashing

- Introduction
- The LDS Algorithm
- Evaluation: Logistic Regression
- Evaluation: Neural Networks
- Iterative LDS
- Discussion

- Massive data examples
- Large-scale retailing
- Telecommunications
- Astronomy
- Computational biology
- Internet logging

- Some computational challenges
- Need of multiple passes for data access
- 10^5~6 times slower than main memory
- Current Solution:Scaling up existing algorithm
- Here: Scaling down the data

- Data squashing: 750000 8443 ( DuMouchel et al (1999),
- Outperforms by a factor of 500 in MSE than random sample of size 7543

- Motivation: Bayesian rule
- Given three data points d1,d2,d3, estimate the parameter :
- Clusters by likelihood profile:

- Details of LDS Algorithm
- [Select] Values of by a central composite design

Central composite Design for 3 factors

- [Profile] Evaluate the likelihood profiles
- [Cluster] Cluster the mother data in a single pass
- Select n’ random samples as initial cluster centers
- Assign the remaining data to each cluster

- [Construct] Construct the Pseudo data:
- cluster center

- Small-scale simulations:
- Initial estimate of
- Plot: Log (Error Ratio)
- Three methods of initial parameter estimations
- 100 data / 48 squashed data

- Medium Scale: 100000 , base: 1% simple random sampling

- Large Scale: 744963 , base: 1% simple random sampling

- Feed forward, two input nodes, one hidden layer with 3 units,
Single binary output

- Mother data: 10000, Squashed data: 1000, repetitions:30
test data: 1000 from the same network

- Comparisons for P(whole) - P(reduced)

- When the estimation of is not accurate.
1. Set from simple random sampling

- 2. Squash by LDS
- 3. Estimate
- 4. Go to 2.