Automated Scientific Paper Classification

Automated Scientific Paper Classification Linlin Jia

Outline • Motivation • Related Work • Problem Setting • Basic Idea

Motivation • Search and organize papers into necessary categories according to different needs • Improving the precision of Web searching • Community Information Management (DBLife / libra / DBRef) • Personal Information Management • Paper-Reviewer dispatch • Any application requiring paper organization or selective and adaptive document dispatching. • Mining topic trend and key factors in research evolution process

Related Work • 知识工程（Knowledge Engineering）1960s • Machine learning(since 1990s) • Native Bayes 朴素贝叶斯 • K-nearest neighbors k-临近 • Support vector machines 支持向量机 • Maximum entropy 最大熵 • Neural networks 神经网络 • Decision trees 决策树 • Similarity measures • Bag-of-word • Cosine • Okapi • Drawback of content-based methods

F G H I E A B A C B D C C D E F D A B E F Related Work • Measure of the relationship between two documents(web pages/papers) • small1973 • Co-citation • Kessler1963 • bibliographic coupling • Amsler1972 • amsler • DeanH1999 • Companion Algorithm (extend HITS) A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.

Related Work • Hybrid methods • PMENBM03 • Combining Link-Based and Content-Based Methods using bayesian network • CaladoCMZNG • combining the decisions of linkage and text classifiers using a belief network strategy. • Fusion of Evidence • JoachimsCT2001 • Study linear combination of support vector machine kernel functions representing co-citation and textual information.

Related Work • ZhangGFCFCC2004 • ZhangCFFGCC2005 • non-linear similarity functions through Genetic Programming techniques • VelosoMCGZ2006 • Rule-based combination • Drawback of above methods • Get low precision when data set has low link density • Not multi-label • high level category • Need big testing set

Problem Setting • Definition • C ={c1,c2,c3,…cn} is a set of predefined categories. • D ={d1,d2,d3,…dm} is a set of scientific papers • Φ: D×C→{T, F} • The meta data of papers are stored in database. • The categories are not just symbolic labels, their meaning is available. • Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.

Analysis • Shortcomings of existing works • Can not interpret the results • Not use network-based machine learning method • Need a big data set and high link density • Extend the source • Authors with different backgrounds • Cross topics • Multi-label • Topic evolution • Time factor

Basic Idea d1 c1 Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 c4 d5 User directory in DBRef papers

Basic idea • Step 1 extended content-based method • Extend text content by citeseer to overcome the limitation of small data set. • Step 2 extended link-based method • Add extra links to overcome the limitation of the low density data set • Step 3 combine

Basic Idea C E A B F D

Author Information • Social Network(co-author network) • How to combine social network and citation network? • Method 1 • Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) • Compute P(ci|dist)

Time Information • MourãoRA2008 • How to express the effect of temporal factor? • Is temporal factor effect the result of link-based method?

Citation Text Information • Citeseer Citation text on papers external to our collection will be add

Location Information • One word at different locations • Experiment: abstract • A word frequently occur, should be deleted • Experiment: keywords/General terms • The main content of paper is exp. • One citation at different locations • Cite A at Introduction/background section • Cite A at experiments section

Automated Scientific Paper Classification