190 likes | 217 Views
This paper outlines the motivation, related work, problem setting, and basic idea of utilizing automated methods to classify scientific papers. By organizing papers into categories based on user needs, improving web search accuracy, and managing community information, it aims to enhance document organization and reviewer dispatch. It explores mining topic trends and research evolution factors, addressing drawbacks of content-based methods through hybrid approaches and fusion of evidence. The goal is to interpret results, utilize network-based machine learning, and address cross-topic categorization needs for evolving research fields.
E N D
Automated Scientific Paper Classification Linlin Jia
Outline • Motivation • Related Work • Problem Setting • Basic Idea
Motivation • Search and organize papers into necessary categories according to different needs • Improving the precision of Web searching • Community Information Management (DBLife / libra / DBRef) • Personal Information Management • Paper-Reviewer dispatch • Any application requiring paper organization or selective and adaptive document dispatching. • Mining topic trend and key factors in research evolution process
Outline • Motivation • Related Work • Problem Setting • Basic Idea
Related Work • 知识工程(Knowledge Engineering)1960s • Machine learning(since 1990s) • Native Bayes 朴素贝叶斯 • K-nearest neighbors k-临近 • Support vector machines 支持向量机 • Maximum entropy 最大熵 • Neural networks 神经网络 • Decision trees 决策树 • Similarity measures • Bag-of-word • Cosine • Okapi • Drawback of content-based methods
F G H I E A B A C B D C C D E F D A B E F Related Work • Measure of the relationship between two documents(web pages/papers) • small1973 • Co-citation • Kessler1963 • bibliographic coupling • Amsler1972 • amsler • DeanH1999 • Companion Algorithm (extend HITS) A and B are related (1) A and B are cited by the same paper, or (2) A and B cite the same paper, or (3) A cites a third paper C that cites B. Paper A and B are associated because they are both cited by C,D,E and F. Citing Papers A and B are related because they cite papers C,D,E and F.
Related Work • Hybrid methods • PMENBM03 • Combining Link-Based and Content-Based Methods using bayesian network • CaladoCMZNG • combining the decisions of linkage and text classifiers using a belief network strategy. • Fusion of Evidence • JoachimsCT2001 • Study linear combination of support vector machine kernel functions representing co-citation and textual information.
Related Work • ZhangGFCFCC2004 • ZhangCFFGCC2005 • non-linear similarity functions through Genetic Programming techniques • VelosoMCGZ2006 • Rule-based combination • Drawback of above methods • Get low precision when data set has low link density • Not multi-label • high level category • Need big testing set
Outline • Motivation • Related Work • Problem Setting • Basic Idea
Problem Setting • Definition • C ={c1,c2,c3,…cn} is a set of predefined categories. • D ={d1,d2,d3,…dm} is a set of scientific papers • Φ: D×C→{T, F} • The meta data of papers are stored in database. • The categories are not just symbolic labels, their meaning is available. • Some exogenous knowledge (i.e., data provided for classification purposes by an external source) is available; In particular, this means that metadata such as, for example, publication date, document type, publication source, etc., is assumed to be available.
Outline • Motivation • Related Work • Problem Setting • Basic Idea
Analysis • Shortcomings of existing works • Can not interpret the results • Not use network-based machine learning method • Need a big data set and high link density • Extend the source • Authors with different backgrounds • Cross topics • Multi-label • Topic evolution • Time factor
Basic Idea d1 c1 Ci=<L, Di> L: label Di: a set of papers which are classified in L(known papers of user i and other papers in directories named L d2 c2 d3 c3 d4 c4 d5 User directory in DBRef papers
Basic idea • Step 1 extended content-based method • Extend text content by citeseer to overcome the limitation of small data set. • Step 2 extended link-based method • Add extra links to overcome the limitation of the low density data set • Step 3 combine
Basic Idea C E A B F D
Author Information • Social Network(co-author network) • How to combine social network and citation network? • Method 1 • Compute the dist of P1(A,B,C,D,E) and P2(A,C,B,D,E) • Compute P(ci|dist)
Time Information • MourãoRA2008 • How to express the effect of temporal factor? • Is temporal factor effect the result of link-based method?
Citation Text Information • Citeseer Citation text on papers external to our collection will be add
Location Information • One word at different locations • Experiment: abstract • A word frequently occur, should be deleted • Experiment: keywords/General terms • The main content of paper is exp. • One citation at different locations • Cite A at Introduction/background section • Cite A at experiments section