1 / 29

DBconnect: Mining Research Community on DBLP Data

DBconnect: Mining Research Community on DBLP Data. Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07. 報告人 : 吳建良. Outline. Community Motivation Understand research community – recommend collaborations

faye
Download Presentation

DBconnect: Mining Research Community on DBLP Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07 報告人:吳建良

  2. Outline • Community • Motivation • Understand research community – recommend collaborations • Proposed Apporach • Rank the relevance with a random walk approach • DBconnect • A navigational system to investigate community relations • Conclusion

  3. What is community? • In Graph Theory: • Densely connected groups of vertices, with sparser connection between groups • In Social Network Analysis: • Groups of entities that share similar properties or connect to each other via certain relations

  4. Why is community important? • Interesting data with community structure: • Researcher collaboration, friendship network, WWW, Massive Multi-player on-line gaming, electronic communications… • Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.

  5. Motivation • Understand the research network between authors, conferences and topics (rank entities by relevance for given entities) • Find and recommend research collaborators for given authors • Explore the academic social network

  6. Proposed Approach • Build bipartite graph in the author-conference space • Limitation of traditional bipartite graph model • Extend the bipartite model to include co-authorship information • Further extend the model to tripartite to include topic information • Use random walk with restart on such models

  7. An example • Author Publication Records in Conferences • a, b, c, d, e are authors • ac(3) means that author a and c published three papers together in • KDD(y) conference

  8. Bipartite model for conference-author social network Weight(edge)=publishing frequency of author in a certain conference Limitation: Fail to represent any co- co-authorships To capture the co-author relations: Add a link between a and c  miss the role of KDD Make the link connecting a and c to KDD  make the random walk infeasible Add additional nodes to represent each co-author relation  impractical, a huge number of such relations

  9. Extend the bipartite model to include co-authorship information • Add a virtual level of nodes to replace the conference partition, and add direction to the edges • A nodes then connect to their own split relation nodes with the original weight • C’ nodes to all author nodes • If the A node and C’ node have a co-author relation  edge weight: co-author frequency * a parameter f • Otherwise, the edge is weighted as original • Set f=k (k is the total author number of a conference) 3 3f 3 3f 7 7 7 3 7 7 7 7

  10. Further extend the model to tripartite to include topic information • Research topic is an important component to differentiate any research community • Authors that attend the same conferences might work on various topics

  11. Adding topic information • Very few conference proceedings have their table of contents included in DBLP • Table of contents include session titles • Extract relevant topics from DBLP • Use paper title, and find frequent co-locations in title text • Method • Manually select a list of stopwords to remove frequently used but non-topic-related words • Ex: Towards, Understanding, Approach, …

  12. Adding topic information (cond.) • Count frequency of every co-located pairs of stemmed words • Select the top 1000 most frequent bi-grams as topics • Manually add several tri-grams • Ex: World Wide Web, Support Vector Machine, …

  13. Random walk on DBLP social network • Problem to be solving: • Given an author node a A , compute a relevance score for each author b A • Simple example: conference-author network G Relational matrix M3×5

  14. Random walk on DBLP social network (cond.) • Normalize M such that every column sum up to 1: Q(M) = col_norm(M), Q(MT) = col_norm(MT) • Construct the adjacency matrix J of G after normalization

  15. Random walk on DBLP social network (cond.) • Normalized adjacency matrix J of G Q(M) Q(MT )

  16. Random walk on DBLP social network (cond.) • A random walk on this graph moves from one node to one of its neighbors based on the probability • Probability: proportional to the weight of the edge over the sum of weights of all edges that connect to this node • EX: if we start from node SIGMOD, then build u as the start vector • u is a one-column vector, consisting of (3+7) elements • The value of element corresponding to SIGMOD is set to 1

  17. Random walk on DBLP social network (cond.) • u=Ju • After step1 of the first iteration, the random walk hits the author nodes with b=1×0.44, d=1×0.33, e=1×0.22 • After step2 of the first iteration, the chance that the random walk goes back to SIGMOD is 0.44×0.8+0.33 ×1+0.22 ×0.22 = 0.73, and the other 0.27 goes to the other two conference nodes

  18. Random walk on DBLP social network (cond.) • After a few iterations, the vector will converge and gives a stable score to every node • However, these scores are always the same no matter where the walk begins • Solved by random walk with restart • Given a restarting probability c • Use another vector v, and the value of element corresponding to SIGMOD is set to 1 • In each random walk iteration, the walker goes back to the start node with a restart probability u=(1-c)u + cv

  19. Random walk on DBLP social network (cond.) • Random walk with restart algorithm(1) Input: node α A, a bipartite graph model G, restarting probability c, converge threshold ε. Output: relevance score vector B for author nodes. 1. Compute the adjacency matrices J(n+m) ×(n+m) of G. /* n conferences and m authors */ 2. Initialize vα = 0, set element for α to 1: vα(α) = 1. 3. While (△uα > ε ) uα = Juα uα = (1 − c) uα + cvα 4. Set vector B = uα(n+1:n+m). 5. Return B.

  20. Random walk on DBLP social network (cond.) • Extend the bipartite model into a directed bipartite graph G'=(C',A,E') • A has m author nodes, and C has n conference nodes • C' is generated based on C and has n*m nodes • Assume every node in C is split into m nodes • First generate a matrix M(n*m)×m for directional edges from C' to A • Then form a matrix Nm×(n*m) for edges from A to C'

  21. Random walk on DBLP social network (cond.) • The adjacency matrix J of G‘ • Algorithm(2): The random walk with restart algorithm for directed bipartite model

  22. Random walk on DBLP social network (cond.) • Extend to the tripartite graph model G''=(C,A,T,E'') • Assume n conferences, m authors and l topics in G'‘ • Three corresponding matrices: Un×m, Vm×l and Wn×l • The adjacency matrices of G'' after normalization:

  23. Random walk on DBLP social network (cond.) • Algorithm(3): The random walk with restart algorithm for tripartite model

  24. DBLP dataset • Download the publication data for conferences from the DBLP website9 in July 2007 • It contains more than 300,000 authors, about 3,000 conferences and the selected 1,000 N-gram topics • The entire adjacency matrix becomes too big to make the random walk efficient • Use the METIS algorithm to partition the large graph into ten subgraphs of about the same size

  25. The DBconnect System • http://kingman.cs.ualberta.ca/research/demos/content/dbconnect/ • A navigational system to investigate the community connections and relations • Displaying researcher statistics from academic search engines • Providing lists of recommended entities to given authors, topics and conferences

  26. The DBconnect System (cond.) • Academic Information • Conference contribution, earliest publication year and average publication per year • H-index is calculated based on information retrieved from Google Scholar • Approximate citation numbers • Related Conferences • Based on author-conference-topic model • Related Topics • Based on author-conference-topic model

  27. The DBconnect System (cond.) • Co-authors • Co-author name and number of paper • Related Researchers • Based on the directed bipartite graph model • Recommended Collaborators • Based on author-conference-topic model • Co-authors’ names are not shown here • The result implies that the given author shares similar topics and conference experiences with these listed researchers, hence the recommendation

  28. The DBconnect System (cond.) • Recommended To • The recommendation is not symmetric • Author A may be recommended as a possible future collaborator to author B but not vice versa • EX: Jiawei Han has been recommended as collaborator for 6201 authors, but apparently only a few of them is recommended as collaborators to him • The given author has been recommended to the author lists • Symmetric Recommendations • The author lists have been recommended to the given author

  29. Conclusion • Extend a bipartite graph model to incorporate co-authorship • Propose a random walk with restart approach • Find related conferences, authors, and topics for a given entity • Present DBconnect system • Help explore the relational structure and discover implicit knowledge within the DBLP data collection

More Related