Data Mining for Malicious Traffic Dr. Latifur Khan (NASA, AFOSR)

Cyber Security Research at the University of Texas at DallasSample ProjectsProf. Bhavani Thuraisingham, PhD, CISSP Prof. Latifur Khan, PhDProf. Murat Kantarcioglu, PhDProf. Kevin Hamlen, PhDProf. Edwin Sha, PhDAugust 2010

Data Mining for Malicious TrafficDr. Latifur Khan (NASA, AFOSR) • Motivation • Network traffic is a continuous flow of data, which is evolving with time • How can we detect intrusion by mining the network traffic when • the intrusions evolve themselves ? • only a small fraction of the traffic is analyzed and labeled by human experts ? • new kind of intrusions appear ? • Technical Approach • Idea: Build a classification model from past data and predict intrusions using the model. • The model must be able to • keep itself up-to-date so that it can detect intrusions even if their characteristics change over time • use the limited amount of labeled data to efficiently update itself • detect new kind of intrusions in the traffic • Strategy: • Semi-supervised learning to compensate for the short of labeled training data • Ensemble classification technique to cope with the changes in the traffic • Novel class detection to detect new kind of intrusions in the traffic System Architecture Newer chunks Older chunks Network traffic Last Partially labeled chunk Last Unlabeled chunk Classification Training 1 Intrusion? 2 Update 4 Ensemble of models New model Refinement 3

Reactively Adaptive MalwareDr. Kevin W. Hamlen and Dr. Latifur Khan (AFOSR) • Motivation • Design and study malware immune to conventional antivirus technologies • Important for AF active defense project • Important for developing adequate defenses in anticipation of next-generation attacks • Technical Approach • Data Mining • use machine learning to discover signatures dynamically • adapt to new malware in the field • share learned signatures amongst mutually trusting attackers • Reactively Adaptive Malware • discover false negatives in protection system • self-obfuscate to defeat defenses Signature Inference Engine Obfuscated Binary Malware Binary Antivirus Signature Database Signature Approximation Model Obfuscation Generation Signature Query Interface Obfuscation Function

AFOSR: Assured Information Sharing: 2005-2008 (Dr. Bhavani Thuraisingham) • Integrate the Medicaid claims data and mine the data; next enforced policies and determine how much information has been lost (Trustworthy partners); Prototype system; Application of Semantic web technologies • Apply game theory and probing to extract information from semi-trustworthy partners • Conduct Active Defence and determine the actions of an untrustworthy partner • Defend ourselves from our partners using data mining techniques • Conduct active defence – find our what our partners are doing by monitoring them so that we can defend our selves from dynamic situations • Trust for Peer to Peer Networks (Infrastructure security) Data/Policy for Coalition Export Export Data/Policy Data/Policy Export Data/Policy Component Component Data/Policy for Data/Policy for Agency A Agency C Component Data/Policy for Agency B Trustworthy Partners Semi-Trustworthy Partners Untrustworthy Partners

Incentive Issues in Assured Information SharingDr. Murat Kantarcioglu (DoD MURI Project 2008-2013, AFOSR)) • Motivation • Misaligned incentives could be a significant problem in Information Security. • Software bugs vs. Software companies’ incentives • Incentive issues in information sharing have been explored to some extent • Incentive issues in file sharing p2p networks • Assured information sharing creates new challenges • Security considerations vs. Utility • Technical Approach • Verify that the other participants do not lie about their data. • If the data is revealed as it is • Trust but verify (Our initial results: DKE ’08 paper) • If the data is not revealed (e.g., SMC techniques are used) • Non-cooperative computing • Mechanism design • SMC with rational adversaries.

Scalable Social Network MiningDr. Murat Kantarcioglu (NSF) • Motivation • Mining social network data could provide important insights. • Recently many different data mining techniques have been suggested for mining social network data. • These techniques require many iterations (e.g., collective inference techniques) and expensive computations (e.g., maximum likelihood methods) over the large social networks. • Initial Results • Partitioning techniques based on various social network centrality metrics have been implemented • Degree centrality (DC) • Clustering coefficient (CC) • Closeness centrality (CloC) • Betweenness centrality (BC) • Random partionining • Domain specific • Our initial results indicate by intelligent partitioning we can increase accuracy and reduce running time. • Technical Approach • Our goal is to scale the existing social network mining techniques to very large social network data by using cloud computing. • To achieve this goal, we are exploring • Intelligent data partition techniques based on social network concepts • Caching of some important queries • Efficient update of cached query results using cloud computing

Language-based SecurityDr. Kevin W. Hamlen (AFOSR) • Motivation • Mobile code security (web scripts, patches, etc.) • How to enforce application-specific security policies over these untrusted software extensions? • Policy #1:Untrusted code must not create or modify any file whose name ends in “.exe” • Policy #2:Untrusted code must not access the network after reading a confidential file • Policy #3:Untrusted code must relinquish the thread after at most 1000 instruction cycles System Architecture Trusted Computing Base untrusted code security policy Rewriter verifier self-monitoring code + proof • Technical Approach • Idea: Automatically rewrite the code prior to execution • Two constraints on rewritten code: • rewritten code must satisfy security policy • rewritten code behaves exactly like original (except with regard to policy violations) • One simple rewriting strategy: • insert guard instructions before every potentially dangerous instruction • Use compiler optimizations to eliminate or streamline unnecessary guards reject accept Example Code (inserted code shown in green) … eax := “filename.exe” if (eax == “*.exe”) abort(); call System.open(eax, “w”); …

Result Cryptographic Protocols Sanitized Data Processing Sanitized Data 1 (Public) Sanitized Data 2 (Public) Data Sanitization Data Sanitization Source Data 1 (Private) Source Data 2 (Private) Privacy-preserving Distributed Data MiningDr. Murat Kantarcioglu (NSF) • Motivation • Privacy sensitive data that is needed for many critical tasks is distributed among different organizations. • Statistical analysis of hospital discharge data for detecting biological weapons attacks. • Privacy concerns may hinder sharing such data for legitimate purposes. • Our goal is to develop techniques to enable distributed data mining without sacrificing individual privacy • Technical Approach • Idea: Combine sanitization and cryptographic techniques to enable efficient and accurate privacy-preserving distributed data mining. • Each data source sanitizes its own data. • Sanitized data is shared directly . • Cryptographic algorithms use sanitize data along with original data to get the data mining results. • Our initial results indicate that this idea is more efficient than pure cryptographic approaches and more accurate than pure sanitization approaches.

WWW Disambiguation & Geo-tagging: Dr. L. Khan (NGA) gazetteer Webpage • WWW problems as a source of geo-information • Geographic context embedded in natural language descriptions • Place names ambiguous and confused with names of organisations, people, buildings and streets • Web queries depend on exact match of text terms Text Info. Retrieval NNP • Applications: • Location-based services • Locally targeted web advertising • Mining geographic properties Market research • Geo-Information Web services Update NN, NNS, NNP, NNPS Ranking Based Disambiguation • Geo-Tagging = Geo-parsing + Geo-coding • Geo-parsing • Recognising geographic references (ignoring non-geographic uses of place terminology) • Geo-coding • Attaching a unique quantitative locations (footprint) to geographic references • Example: • Geo-Geo ambiguity {city}Columbia/{S_C}California/U.S. {City}Columbia/{S_C}Pennsylvania/U.S. • Geo- non Geo ambiguity e.g. “Samuel Lancaster” Lancaster > Last name. {City} Lancaster / Texas/ U.S.

Other Projects • Secure Cloud Computing • http://www.wpafb.af.mil/news/story.asp?id=123209377 • Secure Social and Private Networks • Security and Privacy preserving ontology alignment • Secure Peer to Peer Data Management • Risk modeling and analysis of Botnets • Policy interoperability of geospatial data • Data provenance and Attribution of Attacks • Accountability of Secure Systems

Data Mining for Malicious Traffic Dr. Latifur Khan (NASA, AFOSR)