Download
text mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
Text Mining PowerPoint Presentation
Download Presentation
Text Mining

Text Mining

192 Views Download Presentation
Download Presentation

Text Mining

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Text Mining with R and the tm package

  2. Agenda • Motivation • Preliminaries • Operations • Demo • Thoughts • System prerequisites • Resources • References

  3. Motivation • Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints) • Research categories: • Machine learning • Data mining • Sentiment analysis • ...

  4. Preliminaries • Some terminology • Document • Corpus • Term document matrix • Dissimilarity matrix • We will see some of these in the demo

  5. Typical TM operations • Import • Preprocessing • Stop words • White space • Punctuation • (to) Lower case • Numeric removal • ... Other “mappings”

  6. Typical TM Operations (cont’d) • Metadata management • per document • per corpus • Term document matrix preparation • Distance/nearness calculations • Plotting • ...

  7. DEMO

  8. Thoughts • Package documentation • Overlap/misalignment with other packages • Integration with “big data” facilities

  9. System Prerequisites • Suggested • Weka (for lazy classifiers) • GraphViz (for plot()) • Snowball (for stemDocument()) • Seriation (for dissplot()) • Optional • Antiword (to read Word documents) • pdftotext (to read PDF documents)

  10. Resources • Antiword • http://www.winfield.demon.nl/ • pdftotext • poppler.freedesktop.org • Rgraphviz • http://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html • Seriation • http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=seriation:dissplot • Weka • http://sourceforge.net/projects/weka/

  11. References • Ingo Feinerer (2012). tm: Text Mining Package. R package version 0.5-7.1. • Jeff Gentry, Li Long, Robert Gentleman, Seth, Florian Hahne, Deepayan Sarkar and Kasper Hansen (). Rgraphviz: Provides plotting capabilities for R graph objects. R package version 1.32.0. • Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008. • Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.

  12. Contact Information • Kent Manley • GMU STAT 763, Spring 2012