1 / 100

Quality assessment of a current awareness system

Thomas Krichel LIU & HГУ 2007 – 10 — 23. Quality assessment of a current awareness system. acknowledgments. Thanks to organizers. I am grateful for comment by Bernado Bátiz-Lazo Joanna P. Davies Marco Novarese Christian Zimmermann

gretel
Download Presentation

Quality assessment of a current awareness system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thomas Krichel LIU & HГУ 2007–10—23 Quality assessment of a current awareness system

  2. acknowledgments • Thanks to organizers. • I am grateful for comment by • Bernado Bátiz-Lazo • Joanna P. Davies • Marco Novarese • Christian Zimmermann • I thank everybody involved in RePEc and NEP, as well as JISC.

  3. current awareness • Current Awareness aka “Selective Dissemination of Information” is a simple idea: a user is informed about new documents in her area of interest. • Current awareness generate a double classification in • subject ... • time ... • matter.

  4. why bother? • It is niche activity that has been neglected by the search engines. • I have registered with Google and Amazon. They give me tips but these are generally poor. • We can not trust computers to do it. • Neither on subject matter • Nor on time

  5. types of current awareness • personal (amazon) vs collective (Google news)‏ • machine generated vs human generated? • Actually I claim the only human-generated current awareness service for academic documents is NEP.

  6. computer-based + keywords • In computer generated current awareness one can filter for keywords. • In academic digital libraries, since the papers describe research results, they contain all “ideas” that have not been previously seen. • Therefore getting the keywords right is impossible.

  7. computer-based + categories • It is possible to classify documents based on categories say “football” vs “tennis”. • It works fine when the vocabulary used in different categories is quite different. • For some academic areas the differences are just too subtle.

  8. computers and time problem • In a digital library the “date” of a document can mean anything. • The metadata may be dated in some implicit form. • Recently arrived records can be calculated. • But record handles may be unstable. • Recently arrived records do not automatically mean new documents.

  9. we need humans • Catalogers are expensive. • We need volunteers to do the work. • Junior researchers have good incentives. They • need to be aware of latest literature; • are absent from the informal circulation channels of top level academics; • need to “get their name around” among researchers in the field..

  10. introducing NEP • There is only one large freely-available human-based current awareness service. • It is “NEP: New Economics Papers” at http://nep.repec.org • Remainder of this talk is about NEP.

  11. NEP: New Economics Papers • NEP is a current awareness system for the working paper in the RePEc digital library about economics. • Published articles are excluded because they are way too old.

  12. NEP service model • There is a basic model behind this service we could call the "NEP Service Model". • two stages... • flat report space...

  13. general two stage setup • First stage: A general editor compiles a list of all new papers. This forms an issues of the “allport”. • Second stage: A group of subject editors filter the new allport issue into subject reports. Each editor does this independently for the subject reports she looks after.

  14. a flat space • There is a series of reports. Each reports has a number of issues over time. • There is an “allport” a report that contains all papers that are new in the period covered by the issue.

  15. first stage in NEP • General editor compiles a list of recent additions to the RePEc working papers data. • Computer generated • Journal articles are excluded • Examined by the General Editor (GE, a person)‏ • This list forms an issue of nep-all.

  16. first stage in NEP • nep-all contains all new papers • Circulated to • nep-all subscribers • Editors of subject-reports

  17. second stage • Each editor creates, independently, a subject report for her subject. She does this by removing from nep-all. • A subject report issues (sri) is the result of this process. • There have been over 47,000 sris issued through the lifetime of NEP to date.

  18. history • There are basically two phases in NEP the pre-ernad 1998 to 2004 and the post ernad phase. • I will deal with pre-ernad history here. • Some research on NEP has been conducted in the pre-ernad phase. • This has informed the work that went into ernad.

  19. early history • System was conceived by Thomas Krichel • Name “NEP” by Sune Karlsson • Implemented by José Manuel Barrueco Cruz. • Started to run in May 1998.

  20. starting setup • First the system was all email based. • The nep-all was composed as an email. • It was sent to editors as an email. • Editors used whatever tool they used to compose the email.

  21. web interface • John S. Irons issued the first web interface for report composition on 2000-02-01. • This would just compose the report. • Editors would still cut and paste the results of the form into email clients.

  22. historic mail support • First mail support was given by mailbase.ac.uk. • When this was closed in 2000-11, NEP moved to jiscmail.ac.uk. • Since the mailing list service was only supposed to be for UK academic community, it was deemed not sustainable. • Thomas Krichel started hosting lists on 2002-11-16. It is a nightmare.

  23. Aeroflot document • The Aeroflot document was a thinking piece that Thomas Krichel wrote as early as 2001. http://openlib.org/home/krichel/work/aeroflot.html • This paper already sets out ideas for what would be ernad. • At that time the Siberian RePEc team promised help with building such a system.

  24. discover disaster • In 2002-2003 Jeremiah Cochise Trinidad Christensen and Thomas Krichel were the first people to try to get a systematic picture of how NEP works. • They discover that this is exceedingly difficult.

  25. mail log parsing • Logs were not moved from to Maibase to JISCMail. • Mailbase removed the logs in 2002-11. Thomas Krichel got them just before they were destroyed. • The mail logs were the only source for historic NEP information.

  26. parsing targets • handles: severely compromised by cut-n-paste operations, editor locales, etc. • date of issue: editors were free to set dates, nep-all dates may not be preserved • time of issue: an email is almost impossible to time.

  27. state of pre-ernad data • After a regular expressions orgy, we can get some approximate idea about the handles that were used. • Thus the thematic component is roughly intact. • We have a problem with a bug in the discovery program that made many papers appear several times in nep-all. This makes it difficult to associate subject and allport issues.

  28. state of pre-ernad data • Timing of emails is extremely difficult, even with full headers. • The logs of the Mailbase system only have times for when the email client said it sent the mail. This is the local editor's PC time, can be years out of whack! • We still have some data for research...

  29. research conducted on NEP • Most of the research conducted on ernad has been done in the pre-ernad phase. • The difficulties of some of this work has informed the construction of ernad.

  30. Chu and Krichel (2003)‏ • Heting Chu & Thomas Krichel (2003) “NEP: Current awareness service of the RePEc digital library”. http://www.dlib.org/dlib/december03/chu/12chu.html vaguely talks about NEP. Notes that there is a problem of timeliness in the subject report issue, despite the very shaky data.

  31. Barrueco Cruz et al. (2003)‏ • Jose Manuel Barrueco Cruz, Thomas Krichel and Jeremiah Cochise Trinidad-Chrisitensen “Organizing Current Awareness in a Large Digital Library” http://openlib.org/home/krichel/papers/espoo.pdf have two themes • overlap between reports... • coverage ratio... • as well as history and suggestions.

  32. overlap • Barrueco Cruz et al (2003) argue that overlap occurs not when two papers are appearing in the two reports, but when the two reports are read by the same readers. • They have data on pairwise overlap between reports, based on crude membership data.

  33. overlap puzzle • Here is a puzzle to think about • If a person will be interested in two subject areas because they are close, she will subscribe to both reports. • But since they are thematically close, she will sometimes receive the same papers twice. • With mail technology and asyn-chronous issue generation, this appears difficult to solve.

  34. coverage ratio • We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report. • We can define this ratio • for each nep-all issue • for a subset of nep-all issues • for NEP as a whole

  35. coverage ratio theory & evidence • Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase. • However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is • The coverage ratio of different nep-all issues varies a great deal. • Overall, it remains at around 70%. • We need some theory as to why.

  36. Krichel & Bakkalbaşı (2005)‏ • Thomas Krichel and Nisa Bakkalbaşı “Developing a predicitve model of editor selectivity in a current awareness service of a large digital library”. http://openlib.org/home/krichel/papers/boston.pdf

  37. coverage ratio theories • Krichel & Bakkalbaşı (2005) build two theories of the observations of Barrueco Cruz at al. (2003)‏ • They are • Target-size theory • Quality theory • descriptive quality • substantive quality

  38. theory 1: target size theory • When editors compose a report issue, they have a size of the issue in mind. • If the nep-all issue is large, editors will take a narrow interpretation of the report subject. • If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

  39. target size theory & static coverage • There are two things going on • The opening new subject reports improves the coverage ratio. • The expansion of RePEc implies that the size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates. • The static coverage ratio is the result of both effects canceling out.

  40. theory 2: quality theory • George W. Bush version of quality theory • Some papers are rubbish. They will not get announced. • The amount of rubbish in RePEc remains constant. • This implies constant coverage. • Reality is slightly more subtle.

  41. 2 versions of quality theory • Descriptive quality theory: papers that are badly described • misleading titles • no abstract • languages other than English • Substantive quality theory: papers that are well described, but not good • from unknown authors • issued by institutions with unenviable research reputation

  42. practical importance • We do care whether one or the other theory is true. • Target size theory implies that NEP should open more reports to achieve perfect coverage. • Quality theory suggests that opening more report will have little to no impact on coverage. • Since operating more reports is costly, there should be an optimal number of reports.

  43. results • Krichel & Bakkalbaşı (2005) build a binary logistic regression analysis model. • They find positive evidence for both target size and quality theory. • The NEP editors don't like the results. They insist that they only filter by topic.

  44. Bátiz-Lazo & Krichel (2005)‏ • Bernardo Bátiz-Lazo “On-line distribution of working papers through NEP: A Brief Business History” http://openlib.org/home/krichel/papers/kassel.pdf has an early history of NEP that covers organizational details I don't talk about here.

  45. ernad • stands for editing reports on new academic documents. • Software system designed by Thomas Krichel at http://openlib.org/home/kric hel/work/altai.html. • Software written in Perl by Roman D. Shapiro. Cost $2000. • Started to work after 2004-12.

  46. cut editor freedom I • Editors no longer send mail to lists. • Only one email address sends mail. • But the mail appears like coming from the editor: From: Marcus Desjardin <ernad@nep.repec.org> Reply-To: Marcus Desjardin <desjardin@econ.louvain.be>

  47. cut editor freedom II • Editors can no longer edit report issue emails, e.g. add announcements of conferences. • They are generated from XML files into standardized text and HTML files bound together by MIME multipart/alternative. • They can not change dates of issue.

  48. help editors • Provide a simple-to-use interface for the composition of reports • provide an easy to scroll input • allow for easy sorting of report • do a better job at pretty-printing • Get ready for the introduction of pre-sorting • Actually presorting was only introduced in 2005-08.

  49. statistical learning • The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions. This is known as pre-sorting. • Thomas Krichel “Information retrieval performance measures for a current awareness report composition aid” http://openlib.org/home/krichel/sendai.pdf deals with the evaluation of presorting.

  50. presorting • When an allport issue is created, it is presorted. • In the allport rif each paper has a number in document order. That number is still reported in the presorted rif. • The method is support vector machines svm, using svm_light.

More Related