File Classification in self-* storage systems

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer

Introduction • Self-* infrastructure need information about • Users • Applications • Policies • Not readily provided, and cannot depend on them to provide them • So? Must be learned

Self-* storage systems • Sub-problem of the self-* structure • Key: to get hints based on what creators associate with their files • File size • File names • Lifetimes • Intentions determined, then decisions can be made • Results: better file organization, performance

Classifying Files • Current: rule-of-thumb policy selection • Generic, not optimized • Better: distinguish classes • Finer grained policies • Ideally assigned at file creation • Determine classes at creation • Self-* must learn this association • 1) traces 2)running fs

So, how? • Create model that classify based on (some attribs) • Name • Owner • Permissions • Must filter out irrelevant attribs • Classifier must learn rules to do so • Based on test set • Then inference happens

The right model • Model must be • Scalable • Dynamic • Cost-sensitive (mis-prediction cost) • Interpretable (human) • Model selected: decision trees

ABLE • Attribute-based learning environment • 1. obtain traces • 2. make decision tree • 3. make predictions • Top down, until all attribs are used • Split sample until leaves have similar file attribs • After creation, query begins

Tests • Based on several systems to make sure it is workload-independent • DEAS03 • EECS03 • CAMPUS • LAB • The control: MODE algorithm – places all files in a single cluster

Results • Prediction results quite good • 90% - 100% claimed • Clustering files by attribs are clear • Predict that a model’s ruleset will converge over time

Benefits of incremental learning • Dynamically refines model as samples become available • Generally better than one-shot learners • Sometimes one-shot performs poorly • Ruleset of incremental learners are smaller

On accuracy • More attributes = chance of over-fitting • More rules -> smaller ratios • Loses compression benefits • Predictive models can have false predictions • Can impact performance • Things that should be in RAM is placed on disk instead etc. • Solution: cost functions • Penalize errors • Create biased tree • System goals will need to be translated into it

Conclusion • These trees provide prediction accuracies in the 90% range • Adaptable via incremental learning • Continued work: integration into self-* infrastructure

Questions?

File Classification in self-* storage systems