1 / 13

File Classification in self-* storage systems

File Classification in self-* storage systems. Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer. Introduction. Self-* infrastructure need information about Users Applications Policies Not readily provided, and cannot depend on them to provide them

vaughan
Download Presentation

File Classification in self-* storage systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer

  2. Introduction • Self-* infrastructure need information about • Users • Applications • Policies • Not readily provided, and cannot depend on them to provide them • So? Must be learned

  3. Self-* storage systems • Sub-problem of the self-* structure • Key: to get hints based on what creators associate with their files • File size • File names • Lifetimes • Intentions determined, then decisions can be made • Results: better file organization, performance

  4. Classifying Files • Current: rule-of-thumb policy selection • Generic, not optimized • Better: distinguish classes • Finer grained policies • Ideally assigned at file creation • Determine classes at creation • Self-* must learn this association • 1) traces 2)running fs

  5. So, how? • Create model that classify based on (some attribs) • Name • Owner • Permissions • Must filter out irrelevant attribs • Classifier must learn rules to do so • Based on test set • Then inference happens

  6. The right model • Model must be • Scalable • Dynamic • Cost-sensitive (mis-prediction cost) • Interpretable (human) • Model selected: decision trees

  7. ABLE • Attribute-based learning environment • 1. obtain traces • 2. make decision tree • 3. make predictions • Top down, until all attribs are used • Split sample until leaves have similar file attribs • After creation, query begins

  8. Tests • Based on several systems to make sure it is workload-independent • DEAS03 • EECS03 • CAMPUS • LAB • The control: MODE algorithm – places all files in a single cluster

  9. Results • Prediction results quite good • 90% - 100% claimed • Clustering files by attribs are clear • Predict that a model’s ruleset will converge over time

  10. Benefits of incremental learning • Dynamically refines model as samples become available • Generally better than one-shot learners • Sometimes one-shot performs poorly • Ruleset of incremental learners are smaller

  11. On accuracy • More attributes = chance of over-fitting • More rules -> smaller ratios • Loses compression benefits • Predictive models can have false predictions • Can impact performance • Things that should be in RAM is placed on disk instead etc. • Solution: cost functions • Penalize errors • Create biased tree • System goals will need to be translated into it

  12. Conclusion • These trees provide prediction accuracies in the 90% range • Adaptable via incremental learning • Continued work: integration into self-* infrastructure

  13. Questions?

More Related