1 / 29

Graph-based Pattern Learning

Graph-based Pattern Learning. Dr. Larry Holder School of EECS, WSU. Graphs. Social Network. Protein-protein Interaction. Internet. Power Grid. Web. Some Graph Statistics. Web 10B pages, 1T hyperlinks Topology storage: 10TB

marina
Download Presentation

Graph-based Pattern Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graph-based Pattern Learning Dr. Larry Holder School of EECS, WSU

  2. Graphs Social Network Protein-protein Interaction Internet Power Grid Web

  3. Some Graph Statistics • Web • 10B pages, 1T hyperlinks • Topology storage: 10TB • Google PageRank: Eigenvector on 10Bx10B adjacency matrix (sparse) • MySpace • 100M users, 10B friendship links • Clique/community detection • 300K new users per day

  4. Graph Problems • Degree • Diameter • Centrality • Shortest path • Cycles/tours • Minimum spanning tree • Traversals/search • Connectivity • Clustering • Partitioning • Cliques • Motifs • Subgraph isomorphism • Frequent subgraphs • Pattern learning • Dynamics

  5. Graph-based Pattern Learning • Unsupervised pattern discovery • Hierarchical conceptual clustering • Supervised pattern learning • Anomaly detection • Dynamic graph pattern learning

  6. Unsupervised Pattern Discovery • Frequency-based (AGM, gSpan, FSG, Gaston) • “Graph-based Data Mining” • Find all subgraphs g within a set of graph transactions G such that • where  is subgraph isomorphism and • t is the minimum support • Focus on pruning and fast, code-based graph matching • Still requires subgraph isomorphism

  7. S1 Unsupervised Pattern Discovery • Graph compression and the minimum description length (MDL) principle • The best theory minimizes the description length of the theory and the description length of the data given the theory • The best graphical pattern S minimizes the description length of S and the description length of the graph G compressed with pattern S • where description length DL(G) is the minimum number of bits needed to represent G (SUBDUE) • Compression can be based on inexact matches to pattern S1 S1 S1 S1 S2 S2 S2

  8. Hierarchical Conceptual Clustering • Use iterative process on input graph G • Repeat • Find best pattern S in graph G • Add S to hierarchy • G = G compressed with S • Until no more compression • Clustering is a lattice • Clusters described by pattern • Not just instances as in traditional clustering techniques

  9. Hierarchical pattern discovered at 7th iteration of SUBDUE SRA TEES Text Extraction System Mock Terrorist Scenario Event Generator Observables Message Traffic Reports (142) Fund raising Recruitment Training Reconnaissance ... Convert to Graph Entities and Relationships SUBDUE Pattern Learner Patterns DHS Insight Project Terrorist Group Data

  10. Supervised Learning • Given positive graph G+ and negative graph G- • Find pattern S minimizing DL(G+ | S) / DL(G- | S) • When |G+|,|G-| >> 1, find pattern S maximizing classification accuracy: Positive Graphs Negative Graphs SUBDUE Pattern(s)

  11. DARPA/AFRL Evidence Assessment, Grouping, Linking and Evaluation (EAGLE) Program Convert EDB to SUBDUE graph format Positive & negative examples EDB Threat • Evidence DB (EDB) • contains simulated data • on threat and non-threat activity • Persons, targets, capabilities, • resources, transfers, and • communications Non-threat SUBDUE Patterns Evaluate

  12. Graph Regression (with Nikhil Ketkar, WSU) • Learn a model Yi = f(Gi ), where Yi is a real number and Gi is a graph • E.g., solubility or binding activity of chemical compounds • One approach • Apply frequent-graph miner to set of training graphs Gi • Frequent subgraphs form a feature vector V • Input {(Yi, Vi)} to linear support-vector machine • gRegress approach • Prune feature set based on correlation with other features and lack of correlation with Y • Learn model using non-linear SVM or piece-wise regression

  13. Anomaly Detection (with Bill Eberle, TTU) • Learn normative patterns of activity • Detect small, unlikely deviations from normative patterns • Present anomalies and their context to analyst Convert to graph Normative Pattern Graph-Based Anomaly Detection (GBAD) SUBDUE Activity Data Anomaly GBAD

  14. GBAD Approach • Determine normative pattern S using SUBDUE minimum description length (MDL) heuristic that minimizes: M(S,G) = DL(G|S) + DL(S) • Three algorithms for handling each of the different anomaly categories • GBAD-MDL finds anomalous modifications • GBAD-P (Probability) finds anomalous insertions • GBAD-MPS (Maximum Partial Substructure) finds anomalous deletions

  15. DHS Insight Project: Cargo Data • Shipment data from PIERS (Port Import Export Reporting Service) • Only North American imports (U.S., Puerto Rico, Canada) • 65,535 records (shipments) • Information categories: • General • Commodity codes • Countries and ports • U.S. company names and locations • Foreign shipper names and locations • Notification party names and locations • Shipping line, vessel and packaging • Container • Weight and shipment • Financial

  16. Anomaly Detection in Cargo Data • Marijuana seized at port on Florida [U.S. Customers Service 2000]. • Smuggler did not disclose some financial information, and ship traversed extra port. • GBAD-P discovers the extra traversed port; GBAD-MPS discovers the missing financial information.

  17. DHS CyberSecurity R&D Program: Insider Threat Detection using Graphs Gov’t ID Request Processing Insider Threat Scenarios (CERT Insider Threat Documents) Frontline staff reviews case (invasion of privacy). Frontline staff submits case directly to a case officer (bypassing the approval officer). Frontline staff recommends or decides case. Approval officer reverses accept/reject recommendation from assigned case officer. Unassigned case officer updates or recommends case. Applicant communicates with approval officer or case officer. Unassigned case officer communicates with applicant. Database access from an external source or after hours. GBAD on Scenario 1 GBAD on Scenario 4 • 1000 cases • Multiple normative patterns • 1-3 anomalies • No false positives

  18. Dynamic Graph Pattern Learning(with Chang hun You, WSU) • Dynamic graph DG = {G1, G2, …, Gn} • Find graph rewrite rules between pairs of graphs Gi / Gi+1 • Find common subgraph between Gi and Gi+1 • Remainder of Gi to be removed (GR) • Remainder of Gi+1 to be added (GA) • Find transformation rules of temporal patterns in rewrite rules • Remove (GR) at time t, then add (GA) at time t+k

  19. Dynamic Graph (BioNet)

  20. Graph Rewriting Rule

  21. Example: Circadian Rhythm in Drosophila (Fruit Fly)

  22. Example: Circadian Rhythm in Drosophila (Fruit Fly)  Transformation rule (Sub 1): Structure appearing and disappearing in network. Full temporal transformation rule:  Boxes are removals (after 5 hours), and ellipses are additions (after 7 hours) of Sub 1. Cycles every 12 hours. Time 6-47 is training; time 54-66 is prediction.

  23. Graph-based Pattern Learning • Algorithms • Pattern discovery and clustering • Supervised learning • Anomaly detection • Dynamic graphs • Applications • Social networks • Biological networks • Computer networks • Process flows • (Semantic) Web  • … linkeddata.org

  24. High Performance Computing Issues • Memory bottleneck • Most real-world graphs do not fit in main memory • Patterns of access to graph not sequential • Computational bottleneck • Graph and subgraph isomorphism

  25. High Performance Computing Issues • Functional parallelism • Parallel search over space of candidate subgraph patterns • High communication to avoid redundancy • Child patterns rely on embeddings kept with parent • Hinders parallelism • Computing embeddings from scratch is NPC • Data parallelism • Partition graphs, find patterns in each partition, evaluate patterns in other partitions • Edge cuts may break patterns • May require NPC subgraph isomorphism

  26. Data-Intensive Scalable Computing • MapReduce [Google] • Dean & Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004. • Hadoop [Yahoo] • MapReduce • Distributed filesystem Map Reduce

  27. Multiscale Issues • Hierarchical networks • Higher-level hyper-nodes summarize detail at lower levels • E.g., Netflix prize (www.netflixprize.com) • 17K movies, 400K users, 100M reviews • E.g., user’s average rating vs. specific ratings • E.g., movie’s average rating vs. specific rating 5 rating 3.5 4.5 review avg. rating user avg. rating movie user movie title (reviews…) “Matrix”

  28. Conclusions • Graph representation of relational data • Graph-based pattern learning improves understanding of modeled behavior • Massive, dynamic graphs • Numerous application domains • Graph problems computationally and memory intensive • HPC (data-intensive computing) and multiscale approaches

  29. For More Information • Larry Holder, School of EECS, WSU • Email: holder@wsu.edu • URL: www.eecs.wsu.edu/~holder • SUBDUE • Source code in C • Datasets • www.subdue.org • D. Cook and L. Holder (2006). Mining Graph Data, Wiley. (www.eecs.wsu.edu/mgd)

More Related