1 / 31

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs. Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang. Tsinghua University Microsoft Research. OAG overview.

carrief
Download Presentation

OAG: Toward Linking Large-scale Heterogeneous Entity Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs Fanjin Zhang, Xiao Liu, Jie Tang, Yuxiao Dong, Peiran Yao, Jie Zhang, Xiaotao Gu, Yan Wang, Bin Shao, Rui Li and Kuansan Wang. Tsinghua University Microsoft Research

  2. OAGoverview Open Academic Graph (OAG) is a large knowledge graph unifying two web-scale academic graphs: Microsoft Academic Graph (MAG) and AMiner. Linkinglarge-scaleheterogeneousacademicgraphs

  3. OAG: Open Academic Graph https://www.openacademic.ai/oag/

  4. Problem & Challenges Input: twoheterogeneousentity graphsand. Output: entitylinkingssuch that and represent exactly the same entity.

  5. Challenges • Entity heterogeneity • Different types of entities • Heterogeneous attributes • Entity ambiguity • Long-standing name ambiguity problem • Large-scale entity linking • Hundreds of millions of publications in each source.

  6. Related work • Rule-based method: DiscR [TKDE’15] • Traditional ML method: RiMOM [JWS’06], Rong et al. [ISWC’12], Wang et al. [WWW’12], COSNET [KDD’15]. • Embedding-based method: IONE [IJCAI’16], REGAL [CIKM’18], MEgo2Vec [CIKM’18].

  7. Framework: LinKG Authorlinkingmodule Venuelinkingmodule Paperlinkingmodule

  8. Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.

  9. Linking venues — sequence-based entities • Input: venue names in each graph • Output: linked venue pairs • Idea: LSTM-based method Direct name matching Easy cases Fuzzy-sequence linking

  10. Venuelinkingcharacteristics • Wordordermatters • E.g.‘Diagnostic and interventional imaging’ and ‘Journalof Diagnostic Imaging and Interventional Radiology’ • Fuzzymatchingforvaried-lengthvenuenames. • Extra or missing prefix or suffix • E.g.Proceedings of the Secondinternational conference on Advances in social network miningand analysis.

  11. Venuelinkingmodel Two-layer LSTM layers Raw word sequence Input Similarity score Keywords extracted from integral sequences

  12. Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.

  13. Linking papers — large-scale entities • Problemsetting:To link paper entities, we fully leverage the heterogeneousinformation, including a paper’s title and authors. • Leveragethehashingtechnique(LSH)forfastprocessing • AdoptDoc2Vectotransformtitlestoreal-valuedvectors • UseLSHtomapreal-valuedpaperfeaturestobinarycodes. • Andtheconvolutional neural network foreffective linking.

  14. Paperlinkingcharacteristics • Large-scaleentities • Hundredsofmillionsofacademicpublicationsforeachgraph. • Localandhierarchicalmatchingpatterns • Papertitles are often truncated if they contain punctuation marks, suchas ‘:’ and ‘?’ • Differentauthornameformats:JingZhang,J.,Zhang&Zhang,J.

  15. Paperlinkingmodel—CNNmodel Convolutiononinputsimilaritymatrix word-levelsimilaritymatrix MLP layers

  16. Framework: LinKG • Venue linking — Sequence-based Entities • An LSTM-based method to capture the dependencies • Paper linking • locality-sensitive hashing and convolutional neural networks for scalable and precise linking. • Author linking • heterogeneous graph attention networks to model different types of entities.

  17. Linking authors — ambiguous entities • Problemsetting:To link author entities, we generate a heterogeneoussubgraph for each author. • One author’ssubgraph is composedof his or her coauthors, papers, and publication venues. • Alsoincorporatethevenueandpaperlinkingresults. • Presenta heterogeneous graph attention network basedtechnique for author linking.

  18. Authorlinkingcharacteristics • Nameambiguity • 16,392JingZhanginAMinerand7,170JingZhanginMAG • Attributesparsity • Missingaffiliations,homepages… • Alreadylinkedpapersandvenues! • Viewauthorlinkingasasubgraphmatchingproblem • Aggregateneededinformationfromneighbors

  19. Graph neural networks • Neighborhood Aggregation: • Aggregate neighbor information and pass into a neural network • It can be viewed as a center-surround filter in CNN---graph convolutions! b a v c e d

  20. GCN: graph convolutional networks GCN is one way of neighbor aggregations • GraphSage • Graph Attention • … …

  21. LinKGstep1:pairedsubgraphconstruction • Subgraph nodes • direct (heterogeneous) neighbors, including coauthors, papers, and venues • coauthors’papersandvenues (2-hop ego networks) • Merge pre-linked entities (papers or venues) • Construct fixed-size graph

  22. Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Inputnodefeatures(insubgraphs) • Semanticembedding:averagewordembeddingofauthorattributes • Structureembedding:trainednetworkembeddingonalargeheterogeneousgraph(e.g.LINE)

  23. Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers • attentioncoefficient attnlearntbyself-attentionmechanism • Normalizedattentioncoefficient:differentiatedifferenttypesofentities aggregation weight ofsource entity ’s embedding on target entity

  24. Step2:linkingbasedonHeterogeneous Graph Attention Networks (HGAT) • Encoderlayers(cont.) • Multi-headattention • Twographattentionlayersintheencoder • Decoderlayers • Fuseembeddingsofcandidatepairs,andusefully-connectedlayerstoproducethefinalmatchingscore. concatenation Element-wisemultiplication

  25. Authorlinkingmodel—heterogenous graph attention Heterogeneoussubgraphforacandidateauthorpair Differentattentionparametersfor differententitytypes Attentioncoefficient

  26. Experiment Setup • Datasets • Baselines • Rule-based method: Keyword • Traditional ML method: SVM & Dedupe • SOTA author linking model • COSNET: based on factor graph model • MEgo2Vec: based on graph neural networks

  27. Experimentalresults LSTM-based method CNN-based method

  28. Modelvariantsofpaperlinking Table3:Running time of different methods for paper linking (in second). Table2:Paperlinkingperformance 100xprediction speed-up

  29. OAG: Open Academic Graph https://www.openacademic.ai/oag/

  30. Applications • Dataintegration • Graphmining • collaboration and citation • Textmining • titleandabstract • Scienceofscience… Citation Network Dataset https://www.aminer.cn/citation

  31. Thank You Code:https://github.com/zfjsail/OAG Data:https://www.openacademic.ai/oag/

More Related