1 / 34

SCAN-Lite: Enterprise-wide analysis on the cheap

SCAN-Lite: Enterprise-wide analysis on the cheap. Craig Soules, Kimberly Keeton, Brad Morrey. Enterprise information management. Search Clustering Provenance Classification IT Trending Virus scanning. Metadata Server. Enterprise information management. Metadata Server.

brooke
Download Presentation

SCAN-Lite: Enterprise-wide analysis on the cheap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCAN-Lite:Enterprise-wide analysis on the cheap Craig Soules, Kimberly Keeton, Brad Morrey

  2. Enterprise information management • Search • Clustering • Provenance • Classification • IT Trending • Virus scanning Metadata Server

  3. Enterprise information management Metadata Server Data is duplicated across machines! Duplicate analysis is wasted work

  4. Issues • Analysis programs conflict on clients • Contend for system resources (memory, disk) • Clients repeat work • Duplicate files on multiple clients • Client foreground workloads are impacted • Work exceeds available idle time on busy clients

  5. Approaches • Reduce resource contention Client

  6. Clients Approaches • Avoid duplicate work

  7. Approaches • Leverage duplication to balance client load • Delay analysis to identify all duplicates Clients Global Scheduler

  8. Solutions • Local scheduler • Coordinates analyses to reduce resource contention • Up to 60% improvement • Global scheduler • Identifies duplicates to remove work • Balance load • 40% reduction in impact to foreground tasks

  9. Analysis Programs Local scheduling • Traditionally, analyses are separate programs • Scheduling left to the operating system • Potentially at different times • Each program identifies files to scan • Each program opens and reads file data Disk

  10. Analysis Plugins Unified local scheduling • Each analysis routine is a separate thread • Control thread manages shared tasks • Identify files to scan, and open/read file data • Shared memory buffer distributes file data ControlThread Disk Shared Memory

  11. Local scheduling performance • Ran a fitness test using 7 analysis routines • 42 data sets, each containing files of a fixed size • Ran both approaches over each data set • Calculated per-file elapsed scan time • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Seven-at-once • Run each analysis routine separately at the same time • Unified • SCAN-Lite’s unified local scheduling approach

  12. App 1 App 2 Sum of CPU times Sum of elapsed times Max of elapsed times Elapsed time vs. CPU time • Original fitness test used CPU time • Gave less variable performance curves for modeling • Disk contention shows up in elapsed time • CPU time is multiplexed • Elapsed time is not This is very bad

  13. Local scheduling results 17% - 60% improvement Seven-at-once benefits from deep disk queues, but this hurts foreground apps Small random I/Os have worse interaction than larger ones

  14. Global scheduler • Two goals: • Reduce additional work from duplicate files • Utilize duplication to schedule work to the “best” client • Two-phase scanning • Phase one: identify duplicate files using content hashing • Phase two: analyze one copy at the appropriate client • Delaying between phase one and two provides opportunity for additional duplication and deletion

  15. Traditional scanning Clients Server

  16. Phase one: Duplicate detection Clients Server

  17. Phase two: Scheduling Clients Server

  18. Freshness When to schedule • Clients upload hashes each scheduling period • The freshness specifies a deadline by which new data must be analyzed Scheduling here gives one option Scheduling here gives three options Schedule before this period Scheduling Period Time

  19. IdleTime A B C D Files Clients How to schedule • Scheduling is a bin packing problem • Files are balls, clients are bins • Size of bins is available idle time • Color of balls/bins equates to location of duplicates • Size of balls is time required for analysis

  20. IdleTime A B C D Files Clients How to schedule • We use a greedy heuristic for scheduling • Consider idle time and machine priorities • See paper for details

  21. IdleTime A B C D Files Clients Work ahead • Start by scheduling all work that meets freshness • Schedule additional work on still idle machines • Any remaining idle time can be used for additional work • We refer to this as work ahead

  22. One-phase Cost Two-phase Cost Two-phase scanning: Trade-offs Clients

  23. One-phase Cost Two-phase Cost Two-phase scanning: Trade-offs Clients

  24. Two-phase scanning: Trade-offs • If cost of hashing exceeds the additional work from duplicates, then one-phase scanning is better • Analysis of hashing costs using SHA-1 indicate that 3% data duplication is the minimum • Do we see that in practice?

  25. Data set 1 2+ Hash Duplication in enterprise data • Examined two data sources: • 100 user home directories from a central server • 12 user productivity machines • In both datasets, saw ~10% duplication • Even more with system files, email servers, sharepoints, etc. • This is sufficient duplication for work reduction = 4/7 duplication

  26. Global scheduling policies • Traditional • One-phase scanning, scan all copies • Rand • Two-phase scanning, random scheduling • BestPlace • Two-phase scanning, greedy scheduling • BestPlaceTime • Two-phase scanning, greedy scheduling + work ahead • Opt • Unreplicated data only, delayed + work ahead

  27. Client Impact TotalWork Idle Time Client Metrics • Total Work • Total elapsed time spent on analysis and hashing • Client Impact • Time spent that exceeded client idle time

  28. Client Impact TotalWork Idle Time Client Metrics • Metrics calculated for each day • Summed over the entire simulation period

  29. Experimental setup • Implemented a simulator to test a variety of machine configurations and scheduling policies • Config: 50 high priority blades, 50 low priority laptops • Blades were modeled after: • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 • Laptops were modeled after: • 2GHz Pentium M, 1.5GB RAM, 60GB SATA • Simulated 30 days • Daily creation rates and layouts from traced workloads • Freshness of 3 days, scheduling period of 1 day

  30. Total work Prefers faster blade machines over laptops, increasing their total work to reduce client impact Removes duplicate work, reducing the total work done Doing work ahead of the freshness delay means analyzing files that would have been deleted

  31. 40% Improvement Client impact By doing work ahead of the freshness deadline, SCAN-Lite takes better advantage of idle time Choosing the best place helps hit the idle time targets, reducing average client impact Less work means less impact Theoretical OPT only 8% better than BestPlaceTime

  32. Summary • Reducing local scanning interference is critical • 17% - 60% improvement from reduced contention • Two-phase scanning reduces analysis overheads • Reduce total work to near single-copy costs • Reduced client impact by up to 40% on our workload

  33. Future work • This is an initial system for reducing analysis costs • Many improvements remain! • Vary freshness delays • Different applications may have different requirements • Provide freshness and scan priorities to clients • Could prioritize scan order to not exceed client idle times • Try more workloads • May need better bin packing algorithms

  34. Summary • Ever increasing number of analyses in the enterprise • Search, provenance, trending, clustering, classification, etc. • Local scheduling to reduce resource contention on clients • Up to 60% performance improvement • Two-phase scanning to reduce work and balance load • Delay analysis work to identify duplicate work • Global scheduling to balance load • Reduced client impact by up to 40% on our workload

More Related