1 / 14

Online Autonomous Citation Management for CiteSeer

Online Autonomous Citation Management for CiteSeer. CSE598B Course Project By Huajing Li. Introduction to CiteSeer. Software package developed at NEC-Labs Domain Independent Software for Automatic Citation Indexing (ACI)

nduncan
Download Presentation

Online Autonomous Citation Management for CiteSeer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li

  2. Introduction to CiteSeer • Software package developed at NEC-Labs • Domain Independent Software for Automatic Citation Indexing (ACI) • Focus is on scholarly publications in electronic format (PS / PDF and variants) • Performs: • Document Discovery / Retrieval / Parsing • Automatic Citation Extraction • Document & Citation Indexing / Search

  3. Crawler Document URL Retrieval Document (PDF/PS) Conversion Document (Plain Text) Web Server Parsing & Meta-Data Extraction Document Database File System Meta-Data Database PDBM_File & Chunk Tables Indexes Document Body Text N Citation Texts N Citation Meta-data Sets CID GID Title Authors etc. Document Meta-data Set DID Title Authors etc. C D Indexing

  4. Submitting Documents • Output of Crawl / User Submission is URL of page linking to document. • These URLs are dumped in Paper Table • Paper Table maintains status for each document: • Downloaded/undownloaded • Processed/unprocessed • Other processing errors (tooshort/noreference/etc.) • CiteSeer regularly scans this table to start download of new documents • Only Documents meeting typical pattern of scholarly publications are eventually added to the collection

  5. Document Structure Identification From document header System info From citation graph • Title • Subject(keywords) • Description (abstract) • Author names • Author affiliations • Author address, email, phone, Homepage URL • Publication date, Publication number • Archive date • Contributor • Type • Format • Identifier • Source • Publisher • Journal/Conference • Pages • Relation • References • Is Referenced By

  6. Citations grouping • Citations to same document have common Group ID • Each Group ID has a set of keys associated to it, based on citation information • {authorkey1-titlekey; … authorkey2-titlekey} • For every single word in the authors information there is an authorkey • For a given citation, titlekey is unique and is concatenation of all title words

  7. Citations Grouping • For newly discovered citation • Extract • Authors : C. Lee Giles, S. Lawrence • Title : “Good Paper Title” • Generate keys {giles-goodpapertitle; lee-goodpapertitle; lawrence-goodpapertitle} • Try to match at least one of them with existing Group ID key • If there is a match, add this citation (Citation ID) to the group • Otherwise create a new Group ID for this citation

  8. Linking Citations to Documents • Citation ID->Group ID • We just saw that … • Document ID->Group ID • Based on document’s metadata, generate authorkey-titlekey in the same way and try to match a Group ID key generated from the citations • Document metadata can be erroneous, so successful mapping often happens AFTER correction by users

  9. Problems of the Current Approach • There is no guarantee that the most similar citation contains the best metadata • Building citation graph is a time-intensive, offline task • Due to batch clustering, the addition of a single citation requires rebuilding the entire citation graph to include the new instance • The so-called canonical metadata is fixed to the document record

  10. Goals of the New Citation Management System • Provide better document metadata • Reduce the cost of maintenance • Use on-line citation matching such that the citation graph environment can be adjusted immediately based on a single new citation • Provide a fluid framework for building canonical metadata in which all evidence is always considered • Allow the development of flexible APIs into CiteSeer citation graph system • Maintain data security despite an open, wiki-like approach to user-contributed metadata changes • Provide better citation matching compared to the current system

  11. Prototype Overview Query Query Handler May ultimately be located in separate service Edge DB (SQL) Document Metadata Index Citation Metadata (XML) Citation Resolver Citation Metadata Index Document Metadata (XML)

  12. Edge DB • One simple table containing one edge per row: • Id: citation handle (equivalent to CID) • citingDoc: citing document handle • citedDoc: cited document handle • Row-level locking

  13. Matching citations and docs • Exact string match across disparate metadata fields way too optimistic - need better matching criteria • Lucene provides two methods out of the box: • Match based on Levenshtein distance • Specify arbitrary distance cut-off per field • choose most similar match out of returned set • Cut out the middleman - similarity-based matching • Specify arbitrary similarity threshold • Choose most similar match out of return set over threshold • Criteria to be determined through empirical tests using prototype system.

More Related