1 / 25

CoBib

CoBib. a collaborative bibliographic database Beth Trushkowsky. Objectives. Create a web repository of research papers Allow users to obtain literature reviews of particular subjects Allow users to group and download citations Suggest citations that are

shelly
Download Presentation

CoBib

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CoBib a collaborative bibliographic database Beth Trushkowsky

  2. Objectives • Create a web repository of research papers • Allow users to obtain literature reviews of particular subjects • Allow users to group and download citations • Suggest citations that are • currently recommended explicitly • currently “hot”  implicit • Limit scope of community to research groups • Use open standards for interoperability

  3. Presentation Outline • Motivation • Methodology • Related work • Architecture • Design • Tagging • Recommending • Reference Reconciliation • String-distance • K-means clustering • DPMM • Reflection and Other Ideas • User Studies

  4. Motivation • Example: someone new to database systems wants to learn about prominent ideas and people in that area • Many resources online to locate research papers • Missing the interaction between users • Academic peers have the potential to help one another find relevant information • Want to promote easy exchange of ideas

  5. Methodology • Database functions • Searching: a direct query • Browsing: an indirect survey • Information integration • Merge user’s personal repositories • Improve utility of citations for community • Web 2.0 • User involvement and interaction • Immediate resultsasynchronous processing

  6. Related Work Also: CiteULike, Bibsonomy, etc…

  7. Related Work • Rexa • Similar project at UMass Amherst • Introduces tagging of citations for personal bookmarking • Tackles object coreference problems • How CoBib is different • Users share tags • Users contributes to system’s integrity via corrections • Citation recommendation • Small academic community

  8. Architecture Relationship

  9. Design: Tagging • Folksonomy: user generated taxonomy for classifying web content with tags • Self-bookmarking is an obvious application • Social tagging systems connect users’ tagging activities to rest of the community • Increases success in citation browsing • Vocabulary problem • Want vocabulary to be consistent • Ex: databases and database systems • Improve consistency with suggestions

  10. Design: Recommendation • Explicit • Recommend citation or citation groups to particular individuals or research areas • Share bibliography for particular project • Implicit: questions to answer • Track user interaction: how do citations viewed indicate interests? • How can the community’s actions be used to recommend new citations to particular users? • Collaborative filtering • Try to find a neighborhood metric • Have advantage of direct input from users about interests

  11. Reference Reconciliation • Multiple citations, single entity • Aha, D. W. & Kibler D. Noise-Tolerant Instance-Based Leanring Algorithms. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence. • D. Aha and D. Kibler. Noise-tolerant instance-based learning algorithms. In Proceedings of IJCAI-89. • Need single thread of discussion

  12. String Distance Metrics • String matching algorithms: Cohenet al. • Edit-distance methods F O R B E SF O _ V E S • Token-based methods hands on labs without computers use computers with your hands • Hybrid methods

  13. K-Means Clustering • Represent set as mixture model • X is set of observed citations • Z is latent, true citation entity • Group into k clusters using string distance • Want to discern number k • Allow number to change while processing • Use McCallum’s canopies to avoid too many pair-wise operations • Complete pass to generate canopies • Compute distance only to points in same canopy

  14. Dirichlet Process Mixture Model • Chinese restaurant process • Represent each citation as vector of counts on vocabulary •  ~ Dirichlet prior over vocabulary size • Hyperparameters (1, 2… v) and 0 •  ~ “Stick-breaking” Dirichlet prior over number of clusters • Gibbs Sampling • Choose a citation • Compute probability belongs to each cluster • Nonzero probability of creating new cluster

  15. Reflection and Other Ideas • Both methods performed poorly • Information loss • String-distance: between groups of citations • Bag-of-words: between citation fields • Other ideas • Explore different feature spaces of citations • Active learning • User feedback for reconciling citations • Order of reconciliation tasks • Determine users most capable of reconciling

  16. User Studies • Currently in data-gathering stage • Want to measure effectiveness of CoBib • Did users find new citations in their research areas of interest? • Were users more likely to interact with citations in their areas? • How often did searching turn into browsing? • Which type of recommendations were more relevant?

  17. Thank you! Suggestions?

  18. Design Principles • Adhere to information standards • z39.50 protocol from Library of Congress • Future to query other university databases • Established profiles for searching bibliographic data • Metadata Object Description Schema (MODS) • BibTeX and Endnote • Portable Document Format for Archives (PDF/A) • CoBib’s Documents will be self-contained

  19. User Interface Issues • User-Centered Design • Give attention to needs and wants of end user during the design process • Decrease processing time • Increase productivity • Use AJAX to give users immediate feedback on possible word completions • Java paper submission • Making tasks easier increases likelihood that users will perform them… thus more data • Give a lot, take a little: users may be more inclined to help with citation modifying if the rest of the process is very easy

  20. Object Coreference • What we’ve done • Advantage of structured XML data, treat different types of data differently • Author fields: Soft Jaro Winkler • Title fields: Soft Monge Elkan • What needs to be improved • Most effective weight for respective fields • Appropriate thresholds for the above algorithms

  21. Storing Citations with Zebra • Zebra • Structured text indexing and retrieval engine • Supports various portable file formats • Scalable to large dataset • Automates indexing • Zebra uses z39.50 protocol • Client-server protocol for retrieving information from remote databases • Syntax of the client’s search query is independent of the server’s database structure • Future possibility to incorporate other libraries data into CoBib

  22. Searching Citations with Zebra • Attribute sets • “Characteristics of a query” • Maps constraints to numeric values • Attribute types in the Bib-1 attribute set • Use: access point in search query (i.e. author) • Relation: relationship of search terms to those in database (less than, greater than, etc) • Position: search terms’ position in field • Structure: format of the query (word, numeric,etc) • Truncation: where in query search terms can match • Completeness: how “full” is the query in its field • Example • @attr 1=1003 @attr 3=3 @attr 5=1 forbes

  23. Storing Other Information • Relational database captures the connections between users and citations • MySQL database • Most tables describe what user x did with citation y • Ex. Tags, UserActions, Recommends

  24. Object Coreference • Reference Reconciliation: Dong et al. • Partition into sets of classes with atomic and association attributes • Goal: each instance of a class will represent a single real-world entity and vice versa • Objectives • Use associations between references to assist reconciliation decisions • Propagate reconciliations for different pairs of references • Enrich references after reconciliation • Construct dependency graph • Node represents similarity between two references • Edge represents dependency between similarities

  25. Object Coreference • Identity uncertainty: Pasula et al. • Proposes a declarative approach using a formal language • Relational Probability Models (RPMs) • Sets of classes, named instances, attributes, conditional probability models, and instance statements • For each citation, construct Bayesian network • Use Markov chain Monte Carlo to make approximate inferences • Results: doesn’t scale well • Use cheap distance metric to make canopies

More Related