140 likes | 254 Views
This paper proposes a (semi-)automatic method for term identification to improve bibliometric mapping accuracy. Traditional approaches rely heavily on manual expert judgment, which can be subjective and labor-intensive. Our method utilizes part-of-speech tagging, lemmatization, and statistical techniques to identify significant terms from scientific texts. We evaluated its effectiveness through a case study in operations research, analyzing over 7,000 abstracts. While the results are promising, manual verification remains essential for ensuring precision in identified terms, paving the way for improved science policy decision-making.
E N D
Automatic Term Identificationfor Bibliometric Mapping Nees Jan van Eck, Ludo Waltman Erasmus University Rotterdam, The Netherlands {nvaneck,lwaltman}@few.eur.nl Ed Noyons, Renald Buter Centre for Science and Technology Studies, Leiden University, The Netherlands {noyons,buter}@cwts.leidenuniv.nl 10th International Conference on Science and Technology Indicators Vienna, September 18, 2008 1 1 1 1
Research problem Important authors or journals in a field can be identified relatively easily based on number of citations (i.e., frequency of occurrence in reference lists) Identification of important terms based on frequency of occurrence gives poor results, with many very general terms Terms are therefore usually identified manually based on expert judgment. This has the disadvantage of being subjective labor-intensive We propose a method for (semi-)automatic term identification 4 4
Method (1) • General overview of the proposed method: • Step 1 involves: • part-of-speech tagging • lemmatizing (stemming) • identifying noun phrases (linguistic filter) • identifying linguistic units (statistical filter; Dunning, 1993) • Step 1 results in a list of linguistic units (noun phrases) that may or may not be terms linguistic units Step 1: Calculation of unithood Step 2: Calculation of termhood corpus terms 5
Method (2) • Step 2 is based on the following idea: • Example: A linguistic unit whose occurrences in a corpus of scientific texts are biased toward one or more topics is likely to refer to a domain-specific concept and, consequently, to be a term 6
Method (3) • How can different topics be identified in a corpus of scientific texts? • We use a statistical latent class model called probabilistic latent semantic analysis (PLSA; Hofmann, 2001) • PLSA provides a kind of fuzzy clustering of the linguistic units occurring in a corpus • Each cluster corresponds with a topic 7
Method (4) • The termhood of a linguistic unit is determined using an entropy-like criterion 8
Application • The proposed method is used to construct a term map of the operations research (OR) field • The map is based on 7492 abstracts of papers published in OR journals between 2001 and 2005 • A two-step approach is taken: • First, terms are identified using the proposed method • Second, the relations between terms are visualized using the VOS method • The proposed method is evaluated in two ways: • Evaluation of the terms based on the criteria of precision and recall • Evaluation of the term map based on a survey among OR experts 9
Precision and recall The proposed method (‘PLSA’) outperforms both a simple variant without PLSA (‘No PLSA’) and a naïve method based on frequency of occurrence (‘Frequency’) 10
Survey • Until now, 3 OR experts have responded (2 assistant professors and 1 full professor)
Conclusions • The results of the proposed method for (semi-)automatic term identification seem promising • For accurate results, manual verification of the identified terms remains necessary • The proposed method should be seen as a first step toward more accurate term maps for science policy decision making 16