Metadata characteristics as predictors for editor selectivity in a current awareness service

Thomas Krichel & Nisa Bakkalbasi 2005-10-31 Metadata characteristics as predictors for editor selectivity in a current awareness service

outline • Background to work that we did • RePEc (Research Papers in Economics) • NEP: New Economics Papers • The research • Theory • Method • Results • Other work done for NEP.

RePEc • Digital library for academic Economics. It collects descriptions of • economics documents (working papers, articles etc) • collections of those documents • economists • collections of economists • Pionneering effort to create a relational dataset describing an academic discipline as a whole. • The data is freely available.

RePEcprinciple • Many archives • Archives offer metadata about digital objects or authors and institutions data. • One database • Many services • Users can access the data through many interfaces. • Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

it's the incentives, stupid • RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library. • The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.

some history • Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers. • In 1993 he made the first economics working paper available online. • In 1997 he wrote the key protocols that “govern” RePEc.

US Fed in Print IMF OECD MIT University of Surrey CO PAH Blackwell RePEc is based on 500+ archives • WoPEc • EconWPA • DEGREE • S-WoPEc • NBER • CEPR • Elsevier

to form a 340+k item dataset 161,000 working papers 180,000 journal articles 1,300 software components 1,200 book and chapter listings 8,000 author contact & publication listings 9,100 institutional contact listings more records than arXiv.org

IDEAS RuPEc EDIRC LogEc CitEc RePEc is used in many services • EconPapers • NEP: New Economics Papers • Inomics • RePEc author service • Z39.50 service by the DEGREE partners

NEP: New Economics Papers • This is a set of current awareness reports on new additions to the working paper stock only. Journal articles would be too old. • Founded by Thomas Krichel in 1998. • Supported by the Economics department at WUStL. • Initial software was written by Jose Manuel Barrueco Cruz. • First general editor was John S. Irons.

why NEP • Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines. • Private aim: It is useful to have some, even though limited classification information. This should be useful in performance measures within subject areas.

modus operandi: stage 1 • The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly. • S/he filters out new descriptions of old papers • date field • handle heuristics • The result is an issue of the nep-all report.

modus operandi: stage 2 • Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???. • nep-all and the subject reports are circulated via email. • A special arrangement makes the data of NEP available to other RePEc services.

some numbers • The are now 60+ NEP lists. • Over 37k subscriptions. • Close to 16k subscribers. • Over 50k papers announced. • Over 100k announcements. • Homepage at http://nep.repec.org All this is a fantastic success!!

problem with the private aim • We would have to have all the papers to be classified not only the working papers. • We would need to have 100% coverage of NEP. • This means every paper in nep-all appears in at least one subject report.

coverage ratio • We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. • We can define this ratio • for each nep-all issue • for a subset of nep-all issues • for NEP as a whole

coverage ratio theory & evidence • Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. • However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is • The coverage ratio of different nep-all issues varies a great deal. • Overall, it remains at around 70%. • We need some theory as to why.

two theories • Target-size theory • Quality theory • descriptive quality • substantive quality

theory 1: target size theory • When editors compose a report issue, they have a size of the issue in mind. • If the nep-all issue is large, editors will take a narrow interpretation of the report subject. • If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

target size theory & static coverage • There are two things going on • The opening new subject reports improves the coverage ratio. • The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. • The static coverage ratio that we observe is the result of both effects canceling out.

theory 2: quality theory • George W. Bush version of quality theory • Some papers are rubbish. They will not get announced. • The amount of rubbish in RePEc remains constant. • This implies constant coverage. • Reality is slightly more subtle.

two versions of quality theory • Descriptive quality theory: papers that are badly described • misleading titles • no abstract • languages other than English • Substantive quality theory: papers that are well described, but not good • from unknown authors • issued by institutions with unenviable research reputation

practical importance • We do care whether one or the other theory is true. • Target size theory implies that NEP should open more reports to achieve perfect coverage. • Quality theory suggests that opening more report will have little to no impact on coverage. • Since operating more reports is costly, there should be an optimal number of reports.

overall model • We need an overall model that explains subject editors behavior. • We can feed this model with variables that represent theoretical determinants of behavior. • We can then assess the strength of various factors empirically.

method • The dependent variable is announced. It is one if the paper has been announced, 0 otherwise. • Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known. • That's why BLRA is popular in the life sciences.

independent variables: size • size is the size of the nep-all issue in which the paper appeared. • This is the critical indicator of target size theory. We expect it to have a negative impact on announced.

independent variables: position • position is the position of the paper in the nep-all issue. • The presence of this variable can be justified by the combined assumption of target size and editor myopia. • If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.

independent variables: title • title is the length of a title of the paper, measured by the number of characters. • This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.

independent variables: abstract • abstract is the presence/absence of an abstract to the paper. • This is also motivated by descriptive quality theory. • Note that we do not use the length of the abstract because that would be a highly skewed variable.

independent variables: language • language is an indicator if the language of the metadata is in English or not. • This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language. • While there are a lot of multilingual editors, customizing this variable would have been rather hard.

independent variables: series • series is the size of the series where a paper appears in. • This variable is motivated by substantive quality theory. • The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality • multi-institution series (NBER, CEPR) • large departments • small departments

independent variables: author • author is the prolificacy of the authors of the paper. • It is justified by substantive quality theory. • This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number. • Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.

create categorical variables • size_1 [179, 326) • size_2 [326, 835] • title_1 [55, 77) • title_2 [77, 1945] • position_1 [0.357, 0.704) • position _2 [0.704, 1.000] • series_1 [98, 231) • series_2 [231, 3654]

results • P(announced=1| x) =(exp(g(x))/(1+exp(g(x)) • g(x) = 0.2401- 0.2774*size_1 - 0.4657* size_2 + 0.1512*title_1+ 0.2469*title_2 + 0.3874*abstract + 0.0001*author + 0.7667*language -0.1159*series_1 + 0.1958*series_2 • position is not significant. author just makes the cut.

odds ratio • size_1 1.32 [1.22, 1.44] • size_2 0.83 [0.76, 0.90] • title_1 1.16 [1.07, 1.26] • title_2 1.28 [1.18, 1.39] • abstract 1.47 [1.34, 1.62] • language 2.15 [1.85, 2.51] • series_1 1.11 [1.02, 1.20] • series_2 1.37 [1.26, 1.49] • author 1.05 [1.01, 1.09]

scandal! • Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject. • The editors have rejected our findings. Almost all protest that there is no quality filtering.

consequences • There has been no program to expand list. • There has to be a concentrated effort to help editors to find subject specific papers. • More effort needs to be made for editors to really find the subject-specific papers. This can be done by • the use of a more efficient interface • the use of automated resource discovery methods.

ernad • editing reports on new academic documents. It is purpose-built software system for current awareness reports. • It has been designed by Thomas Krichel, http://openlib.org/home/krichel/work/altai.html • The system was written by Roman D. Shapiro.

statistical learning • The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. • ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.

performance criteria • We are not aware of performance criteria for the sorting of papers in a report. • Precision and recall appear useless. • Expected search length and average search don't appear very attractive. • Thus research into precise criteria is required.

SVM performance • If we use average search length, we can do performance evaluations. • It turns out that reports have very different forecastability. Some are almost perfect, others are weak. • Again, this raises a few eyebrows!

what is the value of an editor? • If the forecast is perfect, we don't need the editor. • If the forecast is very weak the editor may be a prankster.

pre-sorting reconceived • We should not think of pre-sorting via SVM as something to replace the editor. • We should not think about it encouraging editors to be lazy. • Instead, we should think it as an invitation to examine some papers more closely than others.

headline vs. bottomline data • The editors really have a three stage process of decision. • They read title, author names. • They read the abstract. • They read the full text • A lot of papers fail at the first hurdle. • SVM can read the abstract and prioritize papers for abstract reading. • Editors are happy with the pre-sorting system.

Thank you for your attention! nisa.bakkalbasi@yale.eduhttp://openlib.org/home/krichel/

Metadata characteristics as predictors for editor selectivity in a current awareness service