1 / 25

Information Network Analysis and Discovery

Information Network Analysis and Discovery. Cuiping Li Guoming He Information School, Renmin University of China. Related Work. Whole graph Level - Macro properties (Laws, generators) -Summary/Visualization -Index 2. Sub-graph Level

jacob
Download Presentation

Information Network Analysis and Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China

  2. Related Work • Whole graph Level -Macro properties (Laws, generators) -Summary/Visualization -Index 2. Sub-graph Level -Frequent Pattern Mining -Clustering (Community/group detection) -Connected Sub-graph, Central Piece -Pattern Match • Node or Link Level -Ranking -Proximity/Similarity -Node Classification -Outlier Detection (Abnormal nodes/links)

  3. Node Proximity/Similarity: Why? • Link prediction [Liben-Nowell+], [Tong+] • Ranking [Haveliwala], [Chakrabarti+] • Email Management [Minkov+] • Image caption [Pan+] • Neighborhooh Formulation [Sun+] • Conn. subgraph [Faloutsos+], [Tong+], [Koren+] • Pattern match [Tong+] • Collaborative Filtering [Fouss+] • Many more…

  4. Node Similarity: Related Work(1) Computer Network’99: Finding related pages in the World Wide Web, Jeffrey Dean , Monika R. Henzinger (adapting from HITS) KDD’02:SimRank: A Measure of Structural-Context Similarity, Glen Jeh, Jennifer Widom (Adapting from PageRank) Exploiting Hierarchical Domain Structure to Compute Similarity. P. Ganesan, H. Garcia-Molina, and J. Widom. Transactions on Information Systems, 21(1): 64-93, January 2003. Vertex similarity in networks: Phys. Rev. E 73, 026120 (2006) Optimization on simrank WWW’05: Scaling link-base similarity search, D.Fogaras, B. Racz (Approximate) VLDB’08: Accuracy Estimate and Optimization Techniques for SimRank ComputationDmitry Lizorkin, Pavel Velikhov , Maxim Grinev, Denis Turdakov .

  5. Node Similarity: Related Work(2) Domain-Integrated of simrank: VLDB’08: Simrank++: Query Rewriting through Link Analysis of theClickGraph, Loannis Antonellis (Stanford University), Hector Garcia-Molina (Stanford University), Chi-Chao Chang (Yahoo!). (keywords, ads) Clustering using simrank: SIGIR’03: ReCom: Reinforcement Clustering of multi-type interrelated data objects, J. Wang, H.J. Zeng, Z. Chen, H.J. LU,L. Tao VLDB’06: LinkCLus:Efficient Clustering via Heterogeneous Semantic Links, Xiaoxin Yin, Jiawei Han, Philip Yu

  6. Existing Research: Limitation 1 • Not Dynamic • Static Algorithm • Iterative • Challenges of Dynamic Network • Re-computation even one node or edge changes • Our Solution • Non-iterative • Incremental Computation Cuiping Li,Jiawei Han, Guoming He, Xin Jin, Yizhou Sun, Yintao Yu, Tianyi Wu, "Fast Computation of SimRank for Static and Dynamic Information Networks", Int. Conf. on Extending Data Base Technology (EDBT'10), Lausanne, Switzerland, March 2010

  7. Existing Research: Limitation 2 • Not Efficient • Our Solution: employ the modern hardware resource • GPU (Graphic Process Unit) • Multi-Processor

  8. Compute Node Similarity for Dynamic Network • SimRank formula • Or • Intuition • Two objects are similar if they are referenced by similar objects.

  9. Howto Compute SimRank Incremetally • Fist glance at SimRank formula • It is Iterative.Has no chance to be computed incrementally • Key Observation • SimRank iteration formula has the same form as the well-known Sylvester Equations,based on this, we can compute SimRank without iteration.

  10. Vec-Operator and Kronecker Products • Vec-Operator • Vec flattens an n x n matrix A into an n2 x 1 vector • It stacks the columns of the matrix on top of each other, from left to right • Kronecker Product • Product of two matrices A and B • Each element of A is multiplied with the full matrix B:

  11. Sylvester Equations • Sylvester Equations: X=SXT + X0 • Given three n x n matrixes S, T, and X0 • We want to determine X • Solvable in O(n3)

  12. Sylvester Equations • Rewrite the Sylvester Equations as vec(X)=vec(SXT) + vec(X0) • Exploit the well-known fact vec(SXT) = (TTS)vec(X) • We can get vec(X)= (TTS)vec(X) + vec(X0) • We can get (I - TTS)vec(X) = vex(X0) • Now we have to solve vec(X)=(I - TTS)-1 vec(X0)

  13. SimRank • SimRank has the same form as the Sylvester equations X=cATXA +(1-c)e, (A is the normalized adjacent matrix, e is an identity matirx) • Similarly, for SimRank, we have to solve vec(X)=(I -cAT AT)-1 vec((1-c)e) vec(X)= (1-c) (I -cAT AT)-1 vec(e) • AT AT can be solved in O(n3) • More importantly, when A is sparse/skew, we can improve the efficiently further.

  14. Advantages of non-iterative method vec(X)= (1-c) (I -cAT AT)-1 vec(e) • It can be solved approximately • It can be computed incrementally • It can be computed pair-wisely

  15. Approximation vec(X)= (1-c) (I -cWW)-1 vec(e) • 利用奇异值分解SVD和Sherman-Morrison方程求L的逆 W =

  16. Approximation • W的lowrankSVD分解 • k的大小 • k越大,计算时间越长,精确度越高 • Error Bound

  17. Approximation • 预计算 • 计算某对结点(i,j)的SimRank

  18. Incremental Computation 只需要对U,,V进行维护即可

  19. Applications • Similarity Tracking: return the N most similar nodes of i at each time step t. • Centrality Tracking: return the N most central nodes at each time step t.

  20. ExperimentalResult on DBLP Top-10 Most Similar Terms for ‘Prof. Jennifer Widom’ up to Each Time Step

  21. ExperimentalResult Top-10 Most Similar Authors for ‘Prof. Jennifer Widom’ up to Each Time Step

  22. 预计算时间

  23. 计算不同个数结点对的时间

  24. Wikepedia Data • We set the threshold T to be 1.0e-6. For k=15 • the pre-compute time of the Wikipedia dataset is approx. 5.68 hours • the query time for every 1000 node pairs is 3.718 seconds

More Related