1 / 37

Polonium: Tera -Scale Graph Mining for Malware Detection

Polonium: Tera -Scale Graph Mining for Malware Detection. Patent Pending. Polo Chau Machine Learning Dept. Carey Nachenberg Vice President & Fellow My boss at. Jeffrey Wilhelm Principal Software Engr. My advisor Prof. Christos Faloutsos Computer Science Dept. Adam Wright

evan
Download Presentation

Polonium: Tera -Scale Graph Mining for Malware Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Polonium: Tera-Scale Graph Mining for Malware Detection Patent Pending Polo Chau Machine Learning Dept Carey Nachenberg Vice President & Fellow My boss at Jeffrey Wilhelm Principal Software Engr My advisor Prof. Christos Faloutsos Computer Science Dept Adam Wright Software Engineer

  2. Anti-Virus Software… Polo Chau. Machine Learning Department, Carnegie Mellon University

  3. Detecting Malware Traditional malware detection approaches rely on signatures • Collect malware samples • Security experts generate signatures from samples • Signatures distributed to users’ computers as updates How to handle the increasingly common “Zero-day” malware? Many new or unknown malware Signature-based approach does not work No samples  No signatures  No detection Polo Chau. Machine Learning Department, Carnegie Mellon University

  4. Symantec’s New Reputation-Based Approach World’s leading personal security software provider Computes reputation score for every application; protects users from those with poor reputation Leverages terabytes of data anonymously contributedby the millions of participants of the worldwide Norton Community Watch program Polo Chau. Machine Learning Department, Carnegie Mellon University

  5. Symantec’s New Reputation-Based Approach Uses an ensemble of machine learning and data mining algorithms, plus many other detection modules, to compute application reputations Polonium is a new malware detection technology • I helped created in Fall 2009 at Symantec as an intern • Being incorporated into their products • Patent pending Polo Chau. Machine Learning Department, Carnegie Mellon University

  6. Related Work(briefly) Existing research has used many familiar techniques, e.g., Naïve Bayes, SVM, decision trees Polo Chau. Machine Learning Department, Carnegie Mellon University

  7. Propagation Of Leverage Of Network Influence Unearths Malware P O L O N I U M Polo Chau. Machine Learning Department, Carnegie Mellon University

  8. The Data 60+ terabytes of dataanonymously contributedby participants of worldwide Norton Community Watch program >50 million machines >900 million executablefiles Constructed a machine-file bipartite graph (0.2 TB+) ~1 billion nodes (machines and files) ~37 billion edges Polo Chau. Machine Learning Department, Carnegie Mellon University

  9. Terminology Polo Chau. Machine Learning Department, Carnegie Mellon University

  10. The Malware Detection Problem First describe domain knowledge to be incorporated Given • a billion-node machine-file bipartite graph • prior knowledge about some files and machines’ goodness Treat each file i as a random variable Xi={xg, xb} xgis the good label, P(xg) is file goodness xbis the bad label, P(xb) is file badness Goal: find file goodness P(Xi=xg) for each file i Since goodness + badness =1, just consider goodness Then the Polonium algorithm that computes file goodness Polo Chau. Machine Learning Department, Carnegie Mellon University

  11. 1. Prior file reputation ? “Known” files “Unknown” files Symantec maintains a ground truth database of known-good and known-bad files Correlates prior file reputation with file prevalence e.g., set known-good file’s prior to 0.9 Intuition: good files appear on many machines; bad files appear on few machines Polo Chau. Machine Learning Department, Carnegie Mellon University

  12. 2. Prior machine reputation • Computed using Symantec’s proprietary formula; takes into account multiple anonymous aspects of machine’s usage and behavior • Machine reputation is a value between 0 and 1 • Intuitively, files associated with a machine with high reputation are more likely to be good Polo Chau. Machine Learning Department, Carnegie Mellon University

  13. 3. “Homophilic” machine-file relationships Also known as “guilt-by-association” Bad files more likely appear on low reputation machines Good files more likely appear on high reputation machines Machine File Polo Chau. Machine Learning Department, Carnegie Mellon University

  14. Recap: Incorporating Domain Knowledge How to infer the reputation of an unknown file, using its neighbors’ (and their neighbors’) reputation? Adapts Belief Propagation algorithm. Polo Chau. Machine Learning Department, Carnegie Mellon University

  15. Details Computing Node Reputation/Belief (same for file node & machine node) • Node belief ≈ P(xi) • Messagefrom neighboring nodes • Prior node belief Neighbor’s opinion about the node’s reputation • Normalization Constant Polo Chau. Machine Learning Department, Carnegie Mellon University

  16. Details Generating message sent from node i node j • i’s message to j • Propagation function • ~Node i’s belief (same for file  machine & machine  file) • We choose Є= 0.001 to preserve minute probability differences Example function Polo Chau. Machine Learning Department, Carnegie Mellon University

  17. Example Assigning Prior Probabilities Machine nodes use (proprietary) machine reputations e.g., [0.6, 0.4]  machine reputation is 0.6 0.6 0.45 0.35 0.6 A 0.45 B 0.35 C Machines 0.5 0.5 0.9 0.5 0.5 0.1 1 2 3 4 Files 0.9 0.5 0.5 0.1 All messages initialized to [0.5, 0.5]. E.g., mA1=[0.5, 0.5], m1A=[0.5, 0.5] Polo Chau. Machine Learning Department, Carnegie Mellon University

  18. Example Propagate Machine  File Messages 0.6 A 0.45 B 0.35 C Machines 0.9 0.92 0.5 0.58 0.5 0.38 0.06 0.1 1 2 3 4 Files Polo Chau. Machine Learning Department, Carnegie Mellon University

  19. Example Propagate File Machine Messages 0.87 0.6 A 0.81 0.45 B 0.1 0.35 C Machines 0.58 0.5 0.5 0.38 0.92 0.58 0.38 0.06 2 3 1 2 3 4 Files Polo Chau. Machine Learning Department, Carnegie Mellon University

  20. Algorithm Termination Ideally, algorithm stops when reputations converge Theoretically NO guarantee this will happen Empirically run for fixed number of iterations (we used 7) Upon completion, we have reputation scores for all file and machine; we only want file reputations Polo Chau. Machine Learning Department, Carnegie Mellon University

  21. Polo Chau. Machine Learning Department, Carnegie Mellon University

  22. Experiments Evaluated with full machine-file bipartite graph ~1 billion nodes (>900M files, >50M machines) ~37 billion edges Largest file-submission graph constructed and analyzed Evaluated with 1/10 ground truth files; 9/10 for setting file priors Run on 64Bit Red Hat Linux with 4 Quad-Core processors and 256GB RAM Polo Chau. Machine Learning Department, Carnegie Mellon University

  23. One-Iteration Resultsfor files reported by four or more machines 84.9% True Positive Rate1% False Positive Rate In computer security industry,high TPR is important.Low FPR is critical! % of malware correctly identified % of non-malware wrongly labeled as malware Polo Chau. Machine Learning Department, Carnegie Mellon University

  24. Multi-Iteration Resultsfor files reported by four or more machines 7 6 5 4 3 2 1 2.2%  in TPRsame 1% FPR Diminishing return Polo Chau. Machine Learning Department, Carnegie Mellon University

  25. Scalability: Running Time Per Iteration 3 hours, for full data with 37 billion edges Polo Chau. Machine Learning Department, Carnegie Mellon University

  26. Optimization #1Doubles speed by computing half of messages File  Machine messages depend ONLY on Machine  File messages from previous iteration Polo Chau. Machine Learning Department, Carnegie Mellon University

  27. Optimization #2Externalize “Edge File” Observation: random access to graph edges or edge messages is NOT necessary; sequential access is sufficient Use adjacency list layout to store messages e.g., [FM0] [FM0] [FM1] [FM1] [FM2] [FM2]… Polo Chau. Machine Learning Department, Carnegie Mellon University

  28. Scaling-up Computation Further • Belief Propagation – hence Polonium – can be implemented as matrix-vector multiplication that leverages research on parallel computation, architecture, etc. • Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph MiningXintian Yang • Inference of Beliefs on Billion-Scale GraphsU Kang Polo Chau. Machine Learning Department, Carnegie Mellon University

  29. Conclusions Polonium is a new and effective reputation-based malware detection technology adapting the Belief Propagation algorithm:87% TPR, at 1% FPR Evaluated on 37 billionedge machine-file bipartite graph, largest file submissions dataset ever published 60 TB raw data 0.2 TB for derived graph Scalable & Fast Optimization doubles speed, reduces storage Polo Chau. Machine Learning Department, Carnegie Mellon University

  30. Thanks Polonium: Tera-Scale Graph Mining for Malware Detection Patent Pending Polo Chau Machine Learning Dept Carey Nachenberg Vice President & Fellow My boss at Jeffrey Wilhelm Principal Software Engr My advisor Prof. Christos Faloutsos Computer Science Dept Adam Wright Software Engineer

  31. Polo Chau. Machine Learning Department, Carnegie Mellon University

  32. Data Statistics: Machine-Submission Distribution Polo Chau. Machine Learning Department, Carnegie Mellon University

  33. Data Statistics: File-Prevalence Distribution Polo Chau. Machine Learning Department, Carnegie Mellon University

  34. The “Right” Algorithm Easy to incorporate domain knowledge Must be effective: high TPR at low FPR Easy to understand (a “whitebox” method) Polo Chau. Machine Learning Department, Carnegie Mellon University

  35. Domain Knowledge to Incorporate • Prior file reputation • Prior machine reputation • “Homophilic” machine-file relationships Polo Chau. Machine Learning Department, Carnegie Mellon University

  36. The Polonium AlgorithmAn adaptation of the Belief Propagation algorithm Given • a billion-node machine-file bipartite graph • prior knowledge about some files and machines’ goodness • the intuition of “guilt-by-association” Treat each node i as a two-state random variable Xi={xg, xb} xgis the good label, P(xg) is node goodness xbis the bad label, P(xb) is node badness Goal: Find file goodness P(Xi=xg) for each file i We don’t care about machines Polo Chau. Machine Learning Department, Carnegie Mellon University

  37. Symantec World’s leading security software provider Released 1.8 million signatures in 2008, resulting in 200 million detections Estimated release rate of malicious or unwanted software would exceed that of legitimate software (2008 Symantec Security Threat Report) Malicious or unwanted software Legitimate software > Polo Chau. Machine Learning Department, Carnegie Mellon University

More Related