310 likes | 547 Views
Neighborhood Formation and Anomaly Detection in Bipartite Graphs. Jimeng Sun, Huiming Qu , Deepayan Chakrabarti & Christos Faloutsos. Presented By Bhavana Dalvi. Outline. Motivation Problem Definition Neighborhood formation Anomaly detection Experiments Related work
E N D
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun, HuimingQu, DeepayanChakrabarti & Christos Faloutsos Presented By BhavanaDalvi
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
Bipartite graphs and interesting questions Author Paper graph Papers Authors a
Bipartite graphs and interesting questions Author Paper graph Papers Authors Which authors are most related to ‘a’ ? a
Bipartite graphs and interesting questions Author Paper graph Papers Authors Which authors are most related to ‘a’ ? a
Bipartite graphs and interesting questions Author Paper graph Papers Authors Which authors are most related to ‘a’ ? a b 0.8
Bipartite graphs and interesting questions Author Paper graph Papers Authors Which authors are most related to ‘a’ ? a 0.4 b 0.8 0.6 0.2
Bipartite graphs and interesting questions Author Paper graph Papers Authors a 0.4 0.8 0.6 Which is the uncommon paper written by ‘a’ ? 0.2
Bipartite graphs and interesting questions Author Paper graph Papers Authors a 0.4 0.8 0.6 Which is the uncommon paper written by ‘a’ ? 0.2
Bipartite graphs and interesting questions P2P Network Which users have similar preferences as a particular user? users files • Which files are downloaded by users with very different preferences? Jimeng Sun’s presentation at ICDM 2005
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
Problem definition V1 V2 • Neighborhood formation (NF) • Input : query node q in V1 • Output : relevance scores of all the nodes in V1 to q • Anomaly detection (AD) • Input : query node q in V1, • Output : normality scores for nodes in V2 that link to q E q
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
Neighborhood formation • Relevance (b, q) (# short length paths from q to b) q q b b The connection that links only b and q brings more relevance than the connection which links b, q and other nodes.
c c q c c c Exact NF algorithm : random walk with restart Input : a graph G and a query node q Output : relevance scores to q • Construct the transition matrix where • every node in the graph becomes a state • every state has a restart probability c to jump back to the query node q. • transition probability • Find the steady-state probability u which is the relevance score of all the nodes to q Jimeng Sun’s presentation at ICDM 2005
Finding Steady State Probabilities • |V1| = k , |V2| = n • M : k*n matrix representing weighted graph G • Adjacency matrix : • PA= col_norm(MA) • qA : transform query node ‘a’ to (k+n)*1 vector where only ath column has 1 and rest are 0. • uA : steady state probability vector with restart probability c • Bipartite structure : • k << n then savings are significant
Extensions to NF Algorithm • Parallel NF • If multiple queries, computation can be done in parallel. • Approximate NF • Cluster the nodes in to k partitions (preprocessing) • Given query node q, find partition Gi it belongs to • Run Exact NF algorithm only on Gi • Set relevance = 0 for nodes not in Gi
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
Anomaly Detection • A node x in V2 is normal if • Nodes in V1 that links to x • are in same neighbourhood. • e.g. V1 V2 • V1 V2 x x high normality low normality
Anomaly Detection Algorithm • Input : node t in V2, Bipartite transition matrix P, • Output : Normality score(t) • Set St = neighbours of t in V1 • RSt : Pairwise relevance scores for nodes in St • Normality score ns(t) = function (RSt) e.g. mean over non-diagonal elements in RSt
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
relevance score Do the neighborhoods make sense? relevance score relevance score most relevant neighbors most relevant neighbors The nodes (x-axis) with the highest relevance scores (y-axis) are indeed very relevant to the query node.
How accurateis the approximate NF? neighborhood size = 20 num of partitions = 10 • Precision = fraction of overlaps between ApprNF and • NF among top k neighbors • The precision drops slowly while increasing the number of partition • The precision remain high for a wide range of neighborhood size
Do the anomalies make sense? avg. normality score • Injection : • Inject 100 nodes in V2connecting k nodes each in V1 • where k = avg. degree of nodes in V2 • Nodes in V1 are randomly picked such that • degree = 10 * avg. degree of nodes in V1 • Assumption : will induce connections across neighbourhoods
What about the computational cost? Computational cost drops significantly even with small increment in number of partitions
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
RelatedWork • Random walk on Graphs • Page-Rank [ISDN 1998], • Topic Sensitive Page-Rank [WWW 2002] • Outlier detection • Outlier detection in high dimensional data : Aggarwal and Yu [SIGMOD 2001] • Outlier Detection Using Random Walks [ICTAI 2006] • Find outlier clusters • Graph partitioning : • METIS package • Spectral clustering methods • Neighbourhoods can become personalized clusters
Outline • Motivation • Problem Definition • Neighborhood formation • Anomaly detection • Experiments • Related work • Conclusion and future work
Conclusions and Future Work • Solution to two problems for Bipartite Graphs • Neighborhood Formation (NF) • Anomaly Detection (AD) • Random walk with restart along with graph partitioning can be used to solve NF efficiently. • AD can be done based on relevance scores generated by NF • Experiments on real datasets show good results. • Proximity Tracking on Time-Evolving Graphs (SIAM 2008 paper) • Defines proximity scores in dynamic setting. • Efficient incremental updates