1 / 15

JSTOR

JSTOR. Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell. Tools for Linguists. Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot

chaela
Download Presentation

JSTOR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell

  2. Tools for Linguists Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot Clare Llewellyn U. Michigan Matthew Brook O’Donnell

  3. Data for Research Service • The JSTOR archive: • 4.8M journal articles • 2.4M research articles • 1.6M review articles • ~14 billion words • +31M pages of OCR’d text • Multidisciplinary • Content is organized into 50 disciplines • High-quality bibliographic and structural metadata • Including +40M parsed reference citations • The Data for Research service brings much of this content into easy reach of researchers • Powerful search tools • Convenient data retrieval options

  4. Data for Research Service • A self-serve tool for obtaining research data from the JSTOR archive • Provided by a web-interface enabling researchers to identify content of interest in the JSTOR archive and to retrieve associated datasets for research purposes • A researcher-oriented exploration tool complementing the search and browse capabilities offered by the JSTOR main site • Exposes additional fields for enhanced searching and results filtering • Provides data visualizations for viewing aggregate and document-level data • Links to JSTOR main site are provided for documents in search results • Authentication and authorization are required to view article contents

  5. Data for Research – Explore Tool

  6. Data for Research Service • Applications Programming Interface (API) • Provides support for programmatic searching and data retrieval • Utilizes RESTful protocols for ease of use • Plain URL requests, XML responses • Standards-based search protocol • SRU (Search and Retrieval via URL) • Lightweight successor to Z39.50 protocol • CQL (Contextual Query Language) • Formal language defining search syntax • Data retrieval using simple REST protocol • Provides access to back-end content repository • Resource Oriented Architecture (ROA) • Stateless – requests contain all relevant information • Uses HTTP methods (GET, POST) for operations • http://dfr.jstor.org/resource/<resource-id>?view=<view-id>

  7. Data for Research Service • Data Views available in DfR Beta3 • Bibliographic Metadata • Dublin Core • Word frequencies • List of distinct words and their occurrence • N-grams (specifically, word n-grams) • An n-gram is a sub-sequence of n items from a given sequence • Bigrams, trigrams, and quadgrams are provided by DfR • Keywords • Auto-extracted keywords based on their TF*IDF weight • TF*IDF (Term Frequency * Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus • References (citations out) • Raw text for identified references

  8. Components for API Interaction • ** Need to clarify the stuff from Bernie • Primary Component – JSTOR API interface • Persistent SEASR webservice • HTTP Listener • HTTP Responder

  9. Tools and resulting data most likely to be of interest to: • Computational Linguists • For use in range of NLP applications; large discipline-specific datasets open up incredible options in computational semantics, tagging, parsing, text-mining etc. • numerous applications for a JSTOR-derived academic n-gram set (1 million 1960s BROWN corpus still used as source of frequency information!) • Corpus and Applied Linguists • The study of distinctive vocabulary and phraseology (lexical patterns of 2+ grams) in and across academic disciplines currently limited by lack and size of available data • finding words and phrases distinctive to or strongly associated with specific disciplines (statistically identified ‘key words’) requires frequency information from large samples • Need for discipline-specific frequency lists in teaching and testing of English for Academic Purposes (EAP)

  10. Workflow • Define the search terms to create the data set(s) • Submit a query to the JSTOR API and receive a response • Download the data set(s) for one or more of the data views • Conduct analysis using SEASR components • Create visualizations using SEASR components

  11. Comparing the Data • Different data sets: • Different searches in JSTOR, different • Journal • Discipline • Dates • Compare your own data set with one from JSTOR • Use Components to analyze or compare the data • Calculate differences in sets • Extract specific entities – example proper nouns • Extract key differences • Different data views: • Word counts • Bigrams • Trigrams • Quadgrams • Key terms • References

  12. Visualizing the Data • Use the visualization capabilities already in SEASR to display results: • Tables • Graphs • Clustering • Dendograms

  13. Progress • Defining what we wanted to do • Looking at what is already available • Discussions with SEASR folks • Producing a shared area for work at UIUC • Work on making the JSTOR API accessible • Re-defining what we want to do!

  14. Experience • SEASR staff very knowledgeable, helpful and responsive • Learning curve • Easy to do the simple stuff • Can see the benefits of building our own components but can not find the time to learn the skills • Difficult to assign time – really need to build it into another project

  15. Any questions / feedback? • Contact details • michael.krot@jstor.org • clare.llewellyn@jstor.org • mbod@umich.edu

More Related