Developing a Web Integrated Database ~Data mining, data quality and performance evaluation~

Developing a Web Integrated Database ~Data mining, data quality and performance evaluation~ 國立中山大學(National Sun Yat-Sen University) Kaohsiung, December 2006 _______________________________________ 國立中山大學(National Sun Yat-Sen University) Kaohsiung, December 2006 Y. Adachi （足立泰） Regional Manager, Asia Pacific Elsevier

Agenda • Developing a web integrated citation database • User Centered Design • Why a publisher? • Searching – four domains of information space • Integrating our all-science search engine • Ensuring quality of citations • Technology behind citation count and citation matching • Data flow and processing • References matching • Evaluating scientific research output • Why is evaluation so important? • Case study – evaluating an author

Define: data mining Intro

Developing a web integrated database

Researching Research Research carried out on behalf of Elsevier by Redesign Research at the University of Toronto, department of Pharmacology & Pharmaceutical Sciences

User Centered Design Approach… Focus on what they do, Not on what they say… Users talk “Aloud”

Starting from the users’ needs • If we understand the researcher workflow we can design better products Understand- users, their tasks, and their work environments Design- user interfaces that enable users to achieve their goals efficiently Evaluate- product designs with users throughout the product lifecycle

Why a publisher?

Hundreds of new editors per year • 10-20 new journals per year • Article submissions：500,000+ Organize editorial boards Launch new specialist journals • 200,000 referees • 1 million refereereports per year Solicit and manage submissions • 1,800+ journals • 7.5 million articles Archive and promote Managepeer review • 40-90% of articles rejected • 7,000 editors • 70,000 editorial board members • 6.5 million author/publisher communications per year Publish anddisseminate Edit andprepare • 20 million researchers • 6,000+ institutions • 180+ countries • 240 million+ downloads per year • 2.5 million print pages per year Production • 250,000 new articles produced per year • 180 years of back issues scanned, processed and data-tagged

Technologies that drive the process Organize editorial boards Launch new specialist journals Solicit and manage submissions Archive and promote Managepeer review Publish anddisseminate Edit andprepare Production Electronic Warehouse eJournal Backfiles eReference Works Production Tracking System

？Ｘ！ How do users cope with this complex environment?

Searching the four domains Websites and digital archives Patents Peer reviewed literature Institutional repositories Science Medicine Technology Social sciences

Increased use of web documents • Increase in number ofWeb citations • 2.4% of references in all Biomedical journals published between Aug ’97 and April ‘05 are pure URL citations* • % of articles in Oncology journals which include one or more web citations increased from 9% in 2001; to 11% in 2002; to 16% in 2003** Type of Web Content cited in Scopus abstracts *Webcitations archived with WebCite: going, going, still there, Gunter Eysenbach **Internet Citations in Oncology Journals: A Vanishing Resource?, Eric.J.Hester, Journal of the National Cancer Institute, Vol. 96, No.12

Repositories NDLTD, DiVA .edu Scientific Web pages Proprietary Content Pre-print servers ArXiv, CogPrints, Pre print servers .ac.uk Other publishers AIP, BioMedCentral Societies Siam .org others 200M+ 50M+ .gov .com Elsevier ScienceDirect Patents JPO, USPTO only when relevant to research Functionality Searching Author, journal Ranking Optimised for science content Classification document and subject Web search engine - Scirus “Best Directory or Search Engine” WebAward – Won 3 consecutive years from the Web Marketing Association.

Seed List Creation & Maintenance Database Load Focused Crawling OAI Harvesting Pinpointing Results: The Inverted Pyramid Classification Scirus Index Query Ranking Results

Seed list creation and maintenance • Automatic URL extractor tool identifies new scientific seeds • linkanalysis of the most popular sites in specific subject areas • Elsevier publishing units supply a list of sites in theirsubject area • Scientific, Library and Technical AdvisoryBoards provideinput • Webmasters and Scirus users regularly submit suggestions for new sites • Easily identifiable URLs are added on a regularbasis • Example: www.newscientist.com

Focused crawling • The Scirus robot crawls the Web to find new documents and updateexistingdocuments. • A scheduler coordinates the crawl. The job of the scheduler is to prioritizedocuments for crawling, track the rules set by webmasters in their robots.txt filesand limit the number of requests a robot sends to a server. • Independent machine nodes crawl the Web. They work in tandem and share link and meta information. • The robot collects documents and sends them to the Index. • Acopy of the page is stored so that Scirus can show the portion of the document that actuallycontains the search query term.

Results ranking – terms and links • Term frequency • Is the term in the title? • Is the term in the text in a link? • Where is the term located in the text (top, bottom)? • How many times is the term used? • Link analysis • The number of links to a page is analyzed • Importance of a page isdetermined by calculating the number of links to a page • Scirus analyses the anchor text – the text of a link orhyperlink – to determine the relevance of a site

Do a search on ‘nanotube’ using GoogleTM What do you do with 2,170,000 results?

‘nanotube’ search on Google ScholarTM What do you do with 77,900 results?

Search ‘nanotube’ using an integrated database

Results overview from peer reviewed literature

Results overview from selected web sources

Results overview from patent offices

Results overview from NSYSU selected sources NSYSU + NDLTD NSYSU - eThesys Electronic Theses Harvestable and Extensible System

Limit documents to NSYSU NSYSU Only Link to eThesys

Web citations Linking back to all four domains from the Abstract page Web search on article title, author name, keywords

Ensuring the quality of citations Technology behind citation count and citation matching

Our databasefigures • 15,670 titles • 13,500+ academic journals • 750+ conference proceedings • 600+ trade publications • 28 million abstracts • 250 Million references How do we maintain quality?

Why is accurate citation so important? • Citation navigation • Accurate forward citation (cited by) and backward citation links (reference) • Citation count • Accuracy in the number of times an article is cited The accuracy of the references determines the quality of a citation database

Flow of bibliographic data 40% 60% Receipt of FTP’d e-issue from Publisher Receipt of Printed issue from Publisher Registration Content Indexing with Controlled Vocabularies. Enriching records Scanning, OCR (text reader) or retyping OPSbank: Elsevier’s content repository for A&I. Quality Checks Content Indexing with Controlled Vocabularies. Enriching records Database Warehouse: Citation Matching Dayton Server: Ultimate storage FAST: Search Engine Indexing

We have a data processing function specifically for citation matching OPSBANK Databaseserver Database Warehouse Data is first exported from OPSbank to the Database Warehouse, where the data is processed (de-duplication, citation matching) and then exported to our database server.

Matching references new Reference data If a match is found, the item is added to the cluster If no match is found, a new cluster is created

Matching examples the system overcomes a missing volume number and uses title to confirm the match REF: Aracil R. et al., "Multirate sampling technique in digital control systems simulation." IEEE Trans. Systems Man Cybernet., p. 776, 1984. ITEM: Aracil R. et al., "MULTIRATE SAMPLING TECHNIQUE IN DIGITAL CONTROL SYSTEMS SIMULATION." IEEE Trans Syst Man Cybern, v. SMC-14, p. 776, 1984. there are page, author, article title and journal discrepancies,but still a match isfound REF: Keller-Wood M.K., Stenstrom B., Shinsako J. et al., "Interaction between CRF and AngII in control ACTH and adrenal steroids." Am. J. Physiol., v. 250, pp. 306-402, 1986. ITEM: Keller-Wood, M., Kimura B., Shinsako J. et al., "Interaction between CRF and angiotensin II in control of ACTH and adrenal steroids." American Journal of Physiology - Regulatory Integrative and Comparative Physiology, v. 250, pp. 19/3, 1986.

Linking references to records (items) • Volume/Issue number tagging, journal abbrev, author initial: • ref:R. Oliver, "The spots and stains of plate tectonics" Earth Sci. Rev. v. 2, p. 77-106, 1992 • item:J. Oliver, "The spots and stains of plate tectonics" Earth-Science Reviews. v. 32, n. 1-2, p. 77-106, 1992 • Author typo, incomplete page info: • ref:X. Malague, "Pipe inspection by infrared thermography" Mater Eval. v. 57, n. 9, p. 899-902, 1999. • item: Xavier Maldague, "Pipe inspection by infrared thermography" Materials Evaluation. (Mater Eval) v. 57, n. 9, (6 pp), 1999. Reference linking results: • Over 95% of possible links were found • Over 99.9% of links are correct

Bridging clusters These two original clusters/dummy items couldn't be merged previously: Original dummy item for a cluster with 4 refs: Naragan R. et al., In: Supercritical Fluid Science and Technology, pp. 226-241, 1989. (Joohnston, K. P., Penninger, J. M. L., Eds.; American Chemical Society: Washington, DC) Original dummy item for a cluster with 6 refs: Narayan R. et al., "Kinetic elucidation of the acid-catalyzed mechanism of 1-propanol dehydration in supercritical water."In: ACS Symposium Series, v. 406, pp. 226-241, 1989 (Johnston, K. P., Penninger, J. M. L., Eds.; American Chemical Society: Washington, DC) The following new reference came in with both articletitle and book title, which is sufficient to bridge the twoclusters (despite the omitted word in the book title): Narayan R. et al., "A Kinetic Elucidation of the Acid-Catalyzed Mechanism of 1-Propanol Dehydration in Supercritical Water." In: Supercritical Science and Technology, pp. 226-241, 1989. (Johnston, K. P., Penninger, J. M. L., Eds.; ACS Symposium Series 406; American Chemical Society: Washington, DC)

Dummy records A cluster may not match with a record in the database. These are called 'dummy records.‘ It contains all the information of the cluster taken from the references. In our database, you will see them as: dummy record (no link to an abstract) real record (link to an abstract) A dummy record also have “cited times”

As a result you will see… • More accurate references • More citations • references that seem different (eg typo, missing volume/issue/page), cites the same item

Highly cited records 71846 citations: Laemmli U.K., "Cleavage of structural proteins during the assembly of the head of bacteriophage T4." Nature, v. 227, pp. 680-685, 1970. 61429 citations: Bradford M.M., "A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein dye binding." Analytical Biochemistry, v. 72, pp. 248-254, 1976. 37823 citations: Chomczynski P., Sacchi N., "Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction." Analytical Biochemistry, v. 162, pp. 156-159, 1987.

Highly cited dummy records 75452 citations: Sambrook J. et al., "Molecular Cloning: A Laboratory Manual." Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 1989. 39227 citations: Lowry O.H. et al., "Protein measurement with the Folin phenol reagent." Journal of Biological Chemistry, v. 193, pp. 265-275, 1951. 37571 citations: "SAS/STAT User's Guide." SAS Institute, Cary, NC, 1989 37405 citations: "Diagnostic and Statistical Manual of Mental Disorders." Washington, DC: American Psychiatric Association, 1994. 35659 citations: Sheldrick, G.M., "SHELXL-97crystal structure refinement program" University of Göttingen,Germany, 1997.

Conclusions: data mining Data mining tools are in the hands of the end users Technology is enabling researchers to conduct complicating tasks in a matter of a few clicks General search engines are not the answer Intro

Evaluating scientific research output Why is evaluation so important? Case study – evaluating an author

Funding allocations • Grant Allocations • Policy Decisions • Benchmarking • Promotion • Collection management Why do we evaluate scientific output • Government • Funding Agencies • Institutions • Faculties • Libraries • Researchers

Criteria for effective evaluation • Objective • Quantitative • Relevant variables • Independent variables (avoid bias) • Globally comparative

Citation counts • Article counts • Usage counts Data requirements for evaluation • Broad title coverage • Affiliation names • Author names • Including co-authors • References • Subject categories • ISSN (e and print) • Article length (page numbers) • Publication year • Language • Keywords • Article type • Etcetera … There are limitations that complicate author evaluation

Data limitations • Author disambiguation • Normalising Affiliations • Subject allocations may vary • Matching authors to affiliations • Deduplication/grouping • Etcetera Finding/matching all relevant information to evaluate authors is difficult

The Challenge: finding an author • How to distinguish results between those belonging to one author and those belonging to other authors who share the same name? • How to be confident that your search has captured all results for an author when their name is recorded in different ways? • How to be sure that names with unusual characters such as accents have been included – including all variants?

The Solution: Author Disambiguation We have approached solving these problems by using the data available in the publication records such as • Author Names • Affiliation • Co-authors • Self citations • Source title • Subject area … and used this data to group articles that belong to a specific author

Case Study 1: An approach to Author Searching

Enter name in Author Search box Step 1: Searching for an author Professor Chua-Chin Wang National Sun Yat-sen University 組別: 系統晶片組學術專長:積體電路設計、通信界面電路設計、類神經網路實驗室名稱:VLSI設計實驗室研究室分機: 4144

Developing a Web Integrated Database ~Data mining, data quality and performance evaluation~

Developing a Web Integrated Database ~Data mining, data quality and performance evaluation~

Presentation Transcript

Data Mining I

Data Mining

Data Mining

Data Mining

Introduction to Data Mining

Data Mining

Data Mining

Introduction to DATA MINING

Data Mining for Web Personalization

Data Mining

Course Overview

Data Mining

DATA MINING

Data Mining: Introduction

Data Mining Toon Calders

Data Mining: What? WHY? HOW?

Introduction to Data Mining

Data Mining: Data

Data Mining: Applications

Chapter Two