1 / 15

Text Mining

Text Mining. with R and the tm package. Agenda. Motivation Preliminaries Operations Demo Thoughts System prerequisites Resources References. Motivation. Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints)

yardan
Download Presentation

Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining with R and the tm package

  2. Agenda • Motivation • Preliminaries • Operations • Demo • Thoughts • System prerequisites • Resources • References

  3. Motivation • Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints) • Research categories: • Machine learning • Data mining • Sentiment analysis • ...

  4. Preliminaries • Some terminology • Document • Corpus • Term document matrix • Dissimilarity matrix • We will see some of these in the demo

  5. Typical TM operations • Import • Preprocessing • Stop words • White space • Punctuation • (to) Lower case • Numeric removal • ... Other “mappings”

  6. Typical TM Operations (cont’d) • Metadata management • per document • per corpus • Term document matrix preparation • Distance/nearness calculations • Plotting • ...

  7. DEMO

  8. Thoughts • Package documentation • Overlap/misalignment with other packages • Integration with “big data” facilities

  9. System Prerequisites • Suggested • Weka (for lazy classifiers) • GraphViz (for plot()) • Snowball (for stemDocument()) • Seriation (for dissplot()) • Optional • Antiword (to read Word documents) • pdftotext (to read PDF documents)

  10. Resources • Antiword • http://www.winfield.demon.nl/ • pdftotext • poppler.freedesktop.org • Rgraphviz • http://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html • Seriation • http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=seriation:dissplot • Weka • http://sourceforge.net/projects/weka/

  11. References • Ingo Feinerer (2012). tm: Text Mining Package. R package version 0.5-7.1. • Jeff Gentry, Li Long, Robert Gentleman, Seth, Florian Hahne, Deepayan Sarkar and Kasper Hansen (). Rgraphviz: Provides plotting capabilities for R graph objects. R package version 1.32.0. • Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008. • Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.

  12. Contact Information • Kent Manley • GMU STAT 763, Spring 2012

More Related