Grid Computing in Data Mining and Data Mining on Grid Computing

Grid Computing in Data Mining and Data Mining on Grid Computing David Cieslak (dcieslak@cse.nd.edu) Advisor: Nitesh Chawla (nchawla@cse.nd.edu) University of Notre Dame

Grid Computing in Data Mining How you help me

Data Mining Primer • Data Mining:"The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". -Fayyad, Piatetsky-Shapiro & Smyth, 1996. • Classifier: Learning algorithm which trains a predictive model from data • Ensemble:A set of classifiers working together to improve prediction

Applications of Data Mining • Network Intrusion Detection • Categorizing Adult Income • Finding Calcifications in Mammography • Looking for Oil Spills • Identifying Handwritten Digits • Predicting Job Failure on a Computing Grid • Anticipating Successful Companies

Condor Makes DM Tractable • I use a small set of algorithms in high volume Ex: Run same classifier on many datasets • A single data mining operation may have easily parallelized segments Ex: Learn an ensemble of 10 classifiers on dataset • Introducing simple parallelism into data mining conserves time significantly

Common DM Task: 10 Fold CV Original Data Network Traffic Dataset ~30 MB Data

Common DM Task: 10 Fold CV Original Data 10 Training Folds ~27 MB Data 10 Testing Folds ~30 MB Data ~3 MB Data

Common DM Task: 10 Fold CV Training Fold i Learning Algorithm Train Classifier Evaluate Classifier Testing Fold i ~27 MB Data ~3 MB Data RIPPER < 1 min ~2 Hours

Common DM Task: 10 Fold CV Average and aggregate various statistics and measures across folds ~27 MB Data ~3 MB Data

Using Condor on 10 Folds Condor Pool Local Host Local Host • Receive Results • Aggregate/Average • Splits Data • Upload Data and Task to Pool • Learn Classifier • Evaluate Classifier • Return results ~ 5 mins ~ 2 Hours ~ 5 mins • If there is 1 Hour Learn/Eval time, Condor saves up to 18 hours in real time

A More Complex DM Task Over/Under Sampling Wrapper • Split data into 50 folds (single) • Generate 10 undersamplings and 20 oversamplings per fold (pool) • Learn classifier on each undersampling (pool) • Evaluate and select best undersampling (single) • Learn classifier combing best undersampling with each oversampling (pool) • Evaluate best combination (single) • Obtain results on test folds (pool) • Aggregate/Average results (single)

Condor Speed-Ups & Usage • 10 Fold CV Evaluation • Single Machine: roughly one day • Using Condor: under one hour • Over/Under Sampling Wrapper • Single Machine: days to weeks • Using Condor: under a day • In 2006, I used 471,126 CPU hours via Condor • I am “slacking” in 2007: 13,235 CPU hours

A Data Miner’s Wishlist • User specifies task to system Outlines serial task phases • System “smartly” divides labor What is the logical task granule based on: • Condor Pool Performance • Upload/download latency • Data size • Algorithm Complexity

Data Mining on Grid Computing How I help you

It’s Ugly in the Real World • Machine related failures: • Power outages, network outages, faulty memory, corrupted file system, bad config files, expired certs, packet filters... • Job related failures: • Crash on some args, bad executable, missing input files, mistake in args, missing components, failure to understand dependencies... • Incompatibilities between jobs and machines: • Missing libraries, not enough disk/cpu/mem, wrong software installed, wrong version installed, wrong memory layout... • Load related failures: • Slow actions induce timeouts; kernel tables: files, sockets, procs; router tables: addresses, routes, connections; competition with other users... • Non-deterministic failures: • Multi-thread/CPU synchronization, event interleaving across systems, random number generators, interactive effects, cosmic rays...

A “Grand Challenge” Problem: • A user submits one million jobs to the grid. • Half of them fail. • Now what? • Examine the output of every failed job? • Login to every site to examine the logs? • Resubmit and hope for the best? • We need some way of getting the big picture. • Need to identify problems not seen before.

An Idea: • We have lots of structured information about the components of a grid. • Can we perform some form of data mining to discover the big picture of what is going on? • User: Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. • Admin: User “joe” is running 1000s of jobs that transfer 10 TB of data that fail immediately; perhaps he needs help. • Can we act on this information to improve the system? • User: Avoid resources that are working for you. • Admin: Assist user in understand and fixing the problem.

Job ClassAd MyType = "Job" TargetType = "Machine" ClusterId = 11839 QDate = 1150231068 CompletionDate = 0 Owner = "dcieslak“ JobUniverse = 5 Cmd = "ripper-cost-can-9-50.sh" LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 ExitStatus = 0 ImageSize = 40000 DiskUsage = 110000 NumCkpts = 0 NumRestarts = 0 NumSystemHolds = 0 CommittedTime = 0 ExitBySignal = FALSE PoolName = "ccl00.cse.nd.edu" CondorVersion = "6.7.19 May 10 2006" … Machine ClassAd MyType = "Machine" TargetType = "Job" Name = "ccl00.cse.nd.edu" CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) MachineGroup = "ccl" MachineOwner = "dthain" CondorVersion = "6.7.19 May 10 2006" CondorPlatform = "I386-LINUX_RH9" VirtualMachineID = 1 ExecutableSize = 20000 JobUniverse = 1 NiceUser = FALSE VirtualMemory = 962948 Memory = 498 Cpus = 1 Disk = 19072712 CondorLoadAvg = 1.000000 LoadAvg = 1.130000 … User Job Log Job 1 submitted. Job 2 submitted. Job 1 placed on ccl00.cse.nd.edu Job 1 evicted. Job 1 placed on smarty.cse.nd.edu. Job 1 completed. Job 2 placed on dvorak.helios.nd.edu Job 2 suspended Job 2 resumed Job 2 exited normally with status 1. ...

User Job Log Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Success Class Failure Class DATA MINING Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. Failure Criteria: exit !=0 core dump evicted suspended bad output

------------------------- run 1 ------------------------- Hypothesis: exit1 :- Memory>=1930, JobStart>=1.14626e+09, MonitorSelfTime>=1.14626e+09 (491/377) exit1 :- Memory>=1930, Disk<=555320 (1670/1639). default exit0 (11904/4503). Error rate on holdout data is 30.9852% Running average of error rate is 30.9852% ------------------------- run 2 ------------------------- Hypothesis: exit1 :- Memory>=1930, Disk<=541186 (2076/1812). default exit0 (12090/4606). Error rate on holdout data is 31.8791% Running average of error rate is 31.4322% ------------------------- run 3 ------------------------- Hypothesis: exit1 :- Memory>=1930, MonitorSelfImageSize>=8.844e+09 (1270/1050). exit1 :- Memory>=1930, KeyboardIdle>=815995 (793/763). exit1 :- Memory>=1927, EnteredCurrentState<=1.14625e+09, VirtualMemory>=2.09646e+06, LoadAvg>=30000, LastBenchmark<=1.14623e+09, MonitorSelfImageSize<=7.836e+09 (94/84). exit1 :- Memory>=1927, TotalLoadAvg<=1.43e+06, UpdatesTotal<=8069, LastBenchmark<=1.14619e+09, UpdatesLost<=1 (77/61). default exit0 (11940/4452). Error rate on holdout data is 31.8111% Running average of error rate is 31.5585%

Unexpected Discoveries • Purdue Teragrid (91343 jobs on 2523 CPUs) • Jobs fail on machines with (Memory>1920MB) • Diagnosis: Linux machines with > 3GB have a different memory layout that breaks some programs that do inappropriate pointer arithmetic. • UND & UW (4005 jobs on 1460 CPUs) • Jobs fail on machines with less than 4MB disk. • Diagnosis: Condor failed in an unusual way when the job transfers input files that don’t fit.

Many Open Problems • Strengths and Weaknesses of Approach • Correlation != Causation -> could be enough? • Limits of reported data -> increase resolution? • Not enough data points -> direct job placement? • Acting on Information • Steering by the end user. • Applying learned rules back to the system. • Evaluating (and sometimes abandoning) changes. • Creating tools that assist with “digging deeper.” • Data Mining Research • Continuous intake + incremental construction. • Creating results that non-specialists can understand.

Acknowledgements Dr. Thain (University of Notre Dame) Local Condor expert Use of some slides for this presentation Cooperative Computing Lab Maintain/Improve local Condor Pool Provide computing resources

Condor Related Publications • D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via Data Mining," (HPDC-15), June 2006 • N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation Trees," AAAI Workshop on the Evaluation Methods in Machine Learning, July 2006 • N. Chawla, D. Cieslak, L. Hall, A. Joshi, “Killing Two Birds with One Stone: Countering Cost and Imbalance,” Data Mining and Knowledge Discovery, Under Revision

Questions?

Grid Computing in Data Mining and Data Mining on Grid Computing

Grid Computing in Data Mining and Data Mining on Grid Computing

Presentation Transcript

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Cloud Computing, Data Mining and Cyberinfrastructure

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

BioGrid: Integration of Biological Data Grid and Computing Grid

Grid Computing in Data Mining and Data Mining on Grid Computing

Grid Computing

Grids, Grid Technologies and Data Mining

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing

Grid Computing