Link prediction in co authorship network
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Link Prediction in Co-Authorship Network PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on
  • Presentation posted in: General

Link Prediction in Co-Authorship Network. Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu. Introduction. Link prediction Introduce future connections within the network scope Co-authorship network A network of collaborations among researchers, scientists, academic writers.

Download Presentation

Link Prediction in Co-Authorship Network

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Link prediction in co authorship network

Link Prediction in Co-Authorship Network

Le Nhat Minh ( A0074403N)

Supervisor: Dongyuan Lu


Introduction

Introduction

  • Link prediction

    • Introduce future connections within the network scope

  • Co-authorship network

    • A network of collaborations among researchers, scientists, academic writers


Introduction1

Introduction

  • Potential applications

    • Recommend experts or group of researchers for individual researcher.


Outline

Outline

  • Problem Background

  • Related Work

  • Workflow

  • Conclusion

    • Result Analysis

    • Research plan


Problem background

Problem Background

  • What connect researchers together ?

  • Given an instance of co-authorship network:

    • A researcher connect to another if they collaborated on at least one paper.

X

X

X

Y

X

2001

Y

2004


Problem background1

Problem Background

  • How to predict the link?

  • Based on criteria:

    • Co-authorship network topology

    • Researcher’s personal information

    • Researcher’s papers

  • Boost up link predictions performance

    • Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate.


Related work

Related Work

  • Link prediction problems in Social network

    • Liben‐Nowell, D., & Kleinberg, J., 2007

    • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013

  • In social network, interactions among users are very dynamic with:

    • Creation of new link within a few days

    • Deletion or replacement of the existent links

  • Different features present by the two networks

    • Characteristics of individual researcher : citations, affiliations , institutions, ...

    • Characteristics of person : marriage status, ages, working places, …


Link prediction in co authorship network

  • Three mainstream approaches for link prediction:

    • Similarity based estimation

      • Liben‐Nowell, D., & Kleinberg, J., 2007

    • Maximum likelihood estimation

      • Murata, T., & Moriyasu, S., 2008

      • Guimerà, R., & Sales-Pardo, M., 2009

    • Supervised Learning model

      • Pavlov, M., & Ichise, R., 2007

      • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006


Similarity based estimation

Similarity Based Estimation

  • Use metrics to estimate proximities of pairs of researchers

  • Based on those proximities to rank pairs of researchers

  • The top pairs of researchers will likely to be the recommendations.


Similarity based estimation1

Similarity Based Estimation

  • Network structure based measurement

Some conventions:


Similarity based estimation2

Similarity Based Estimation

  • Common Neighbor:

Y

X


Similarity based estimation3

Similarity Based Estimation

  • Jaccard’s coefficient:

Y

X


Similarity based estimation4

Similarity Based Estimation

  • Preferential Attachment:

Y

X


Similarity based estimation5

Similarity Based Estimation

  • Adamic/Adar:

Z

Y

X


Similarity based estimation6

Similarity Based Estimation

  • Shortest Path:

    • Defines the minimum number of edges connecting two nodes.

  • PageRank:

    • A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.


Maximum likelihood estimation

Maximum Likelihood Estimation

  • Predefine specific rules of a network

  • Required a prior knowledge of the network

  • The likelihood of any non-connected link is calculated according to those rules.


Supervised learning model

Supervised Learning Model

  • Construct dimensional feature vectors

  • Fetch these vectors to classifiers to optimize a target function (training model)

  • Link prediction becomes a binary classification


Supervised learning model1

Supervised Learning Model

  • Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006) using:

  • Decision Tree

  • SVM (Linear Kernel)

  • K nearest neighbor

  • Multilayer Perceptron

  • Naives Bayes

  • Bagging

  • Combine many classifiers (Pavlov, M., & Ichise, R., 2007)

  • Decision stump + AdaBoost

  • Decision Tree + AdaBoost

  • SMO + AdaBoost


  • Summary

    Summary

    • Similarity based estimation

      • Not quite well-perform

    • Maximum likelihood

      • Depend on the network

    • Supervised learning model

      • Perform better than similarity based estimation


    Workflow

    Workflow

    Classifier Model

    Features


    Graph description

    Graph Description

    • Co-authorship graph:

      • Undirected graph G (V , E)

    • Node or Vertex ( Author )

      • Author ID

      • Author Name

    • Link or Edge (Co-authorship)

      • Pair of author ID

      • List of publication year followed by paper title

        (Ex: 2004 :”Introduction to …” )


    Setting up data

    Setting up data

    • Dataset is separated into 2 timing spans: 2000 – 2010 and 2010 – 2013

    • The first is for training, the latter is for testing.

    • Currently, there are 134,307 researchers in the network 2000 – 2013.

    • Crop out authors who are not available in testing period, remaining 104,265 researchers


    Setting up data1

    Setting up data

    • Choose a subset from 104,265 researchers

    • Experiment on 937 researchers


    Baseline features

    Baseline Features

    • Extract features from the network structure:

      • Local similarity

        • Common Neighbor

        • Adamic/Adar

        • Preferential Attachment

        • Jaccard’s coefficient

      • Global similarity

        • Shortest Path

        • PageRank


    Baseline features1

    Baseline Features

    • Feature for co-authorship network

      • Keywordmatching (Cohen, S., & Ebel, L., 2013 )

        A suggested metric to measure the textual relavancy uses a TF-IDF based function to determine.


    Proposed features

    Proposed Features

    • Productivity of the authors

      Observe the “history” of an author

    • For example, at a particular node A:

    T0 = 2000

    T1 = 2004

    T2 = 2005

    T3= 2006

    n=3

    m=1

    n=4

    m=2

    n=6

    m=2

    n=7

    m=3

    n : No. of shared paper

    m: No. of collaborators

    i=0

    i=1

    i=2

    i=3


    Proposed features1

    Proposed Features

    • Productivity of the authors

      Observe the “history” of an author

      The “productivity” of node A:

    α : a constant to assign the weight of each time period


    Training set

    Training set

    • Set up training data

      • Withn nodes, there is possible links.

      • Among those, separate two links

        • Positive link: links appear in training years.

        • Negativelink: the remaining non-existent link in training years.

          Note: Avoid bias training by balancing the number of instances between trueand false label.

      • Classify all the non-existent links

      • Compare with the testing data


    Experimental results

    Experimental Results

    • New links to predict: 57 links

    • Measurement of performance

      • Precision:

      • Recall:

      • Harmonic mean:


    Result analysis

    Result Analysis

    • Possible reasons

      • Features

      • Small set of data – sampling problem

      • Instances of the negative links used for training


    Research plan

    Research Plan

    • Use weighted graph with parameters:

      • No. of papers

      • No. of neighbor

      • No. of citations

    • Focus on features that specifically target the co-authorship network:

      • Citations

      • Institutions

    • Enlarge the experiment dataset size

    Thank you


    References

    References

    • Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), 211-230.

    • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security.

    • Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031.

    • Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55.

    • Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257.

    • Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078.

    • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257.

    • Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.


    Link prediction in co authorship network

    • Link per year of training set is greater than link per year of testing set:

      • In testing period, only consider “new” collaborations.

      • Any collaborations between researchers that already has a link will be disregarded.


    Results with different classifiers

    Results with different classifiers


    Proposed feature

    Proposed Feature

    • The reason for proposing this feature:

      • Keep track of the researcher tendency

      • Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones

      • Also give high score for prolific researchers (based on number of published paper)


    Stochastic block model

    Stochastic Block Model

    • Guimerà, R., & Sales-Pardo, M., 2009


    Stochastic block model1

    Stochastic Block Model

    6

    1

    7

    2

    4

    X

    3

    Y

    5

    The reliability of an individual link is:


  • Login