Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010. Abe Gong [email protected] www-personal.umich.edu/~agong. Big Picture The field of NLP Automated text classification A census of the political web. Agenda.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Supervised learning, but the computer selects or generates training examples
Blend of supervised and unsupervised learning
In all of these applications, a large degree of control is turned over to the computer.
Bad: Re-run statistical models until p > .05
Good: Tap all the data available for patterns and inference“Data Mining”
Perl, C++, Java, Ruby…
If you’re going to learn a language, make it python
Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula.
Get python here:
Download the script here:
Download the books here:
Assume words are drawn independently, conditional on document class. Infer each document’s class from its words.
Automated classification is just as accurate and reliable as human classification.Reliability
Classifier predictions have known certainty.Estimating the gray area