1 / 34

Data Mining at Duke

Data Mining at Duke. (“What to do with all of those hard drives”) Molly Tamarkin Joel Herndon Associate University Librarian for Information Technology Services Head, Data & GIS Services. Today’s Talk. Rise of text analysis questions Challenges in providing text analysis services

sani
Download Presentation

Data Mining at Duke

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining at Duke (“What to do with all of those hard drives”) Molly Tamarkin Joel Herndon Associate University Librarian for Information Technology Services Head, Data & GIS Services

  2. Today’s Talk • Rise of text analysis questions • Challenges in providing text analysis services • Duke University Libraries’ response

  3. Brandaleone Center for Data and GIS Services

  4. The Rise of Text as Data

  5. New Questions for Research Libraries • How has the North American press covered environmental issues over the last 20 years? • Can we analyze all (17000) journal articles on German studies in the 20th century? • What might tweets reveal about the Arab Spring in social media?

  6. http://sites.duke.edu/digital/

  7. Challenges in Providing Text Analysis Services

  8. Challenges • Collections • Licensing • Infrastructure • Service model

  9. Open (or mostly open) Access

  10. Licensing http://chronicle.com/article/Hot-Type-Elsevier-Experiments/131789/

  11. “We found some … text mining in fields such as biomedical sciences and chemistry and some early adoption within the social sciences and humanities… however… most text mining in UKFHE is based on Open Access documents or bespoke arrangements.” – key findings (p.2) http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx

  12. Licensing

  13. Photo from editorsweblog.org Photo from editorsweblog.org

  14. ECCO Project

  15. “Big Data” ~63 Drives ~63 terabytes >40 Topics

  16. Gale Backup Drive Collection

  17. Infrastructure

  18. Six Methods of Text Analysis • Reading • Counting Words • Human Coding (researchers coding events/texts) • Dictionary Methods (sentiment analysis) • Supervised machine learning (using corpora) • Unsupervised Machine Learning (topic modeling) http://aeshin.org/textmining/ http://dx.doi.org/10.1111/j.1540-5907.2009.00427.x

  19. Infrastructure Issues • Storage/ scratch space • Processing power • Tools for analytics

  20. Our Workstations • 16 gigs of memory • 1 TB of storage • 64 bit computing • Intel Xeon 3.5 GHz, 4 core • Scanner available • Fast networking

  21. Swappable Drives?

  22. General Software

  23. Specialized Software

  24. Service Model

  25. Services - Staffing

  26. Expert on Visualization

  27. Services - Staffing http://aeshin.org/textmining/

  28. Services – Guides http://library.duke.edu/data/guides/index.html

  29. Services – Workshops

  30. In Summary • Lots of research potential • Licensing may be an issue for some • Easy way to get started text mining with little investment but maybe some risk?

  31. Questions? Joel Herndon – joel.herndon@duke.edu Molly Tamarkin – tamarkin@duke.edu

More Related