Text as Data in the
Download
1 / 64

Abe Gong [email protected] www-personal.umich/~agong - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010. Abe Gong [email protected] www-personal.umich.edu/~agong. Big Picture The field of NLP Automated text classification A census of the political web. Agenda.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Abe Gong [email protected] www-personal.umich/~agong' - inara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Text as Data in the Social SciencesIntroduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010

Abe Gong

[email protected]

www-personal.umich.edu/~agong


Agenda

  • Big Picture

  • The field of NLP

  • Automated text classification

  • A census of the political web

Agenda









Learning modes

Supervised learning language.

Using a large set of labeled data, the computer learns to mimic humans on some task

Applications

  • Handwriting, speech, and pattern recognition

  • Spam filtering

  • Bioinformatics

Learning Modes


Learning modes1

Supervised learning language.

Using a large set of labeled data, the computer learns to mimic humans on some task

Strengths

  • Very flexible

  • Easy to adapt to existing theory

    Weaknesses

  • Specifying ontologies can be time-consuming

  • Requires substantial training data

Learning Modes


Learning modes2

Unsupervised learning language.

Using raw, unlabeled data, the computer looks for patterns and regularities

Applications

  • Clustering

  • Neural networks

  • Algorithmic stock trading

  • Data-driven marketing

Learning Modes


Learning modes3

Supervised learning language.

Using raw, unlabeled data, the computer looks for patterns and regularities

Strengths

  • Does not require labeled data

  • Discovers new patterns

    Weaknesses

  • Often difficult to relate to existing theory

Learning Modes


Learning modes4

Active learning language.

Supervised learning, but the computer selects or generates training examples

  • Optimal experimental design

  • Performance boost for supervised learning

    Semi-supervised learning

    Blend of supervised and unsupervised learning

  • Algorithmic forecasting, stock trading

  • Topic maps

  • Machine summarization

Learning Modes


Data mining

In all of these applications, a large degree of control is turned over to the computer.

  • “Data Mining” is not always a dirty word.

    Bad: Re-run statistical models until p > .05

    Good: Tap all the data available for patterns and inference

“Data Mining”


Data mining1

Google Image Search: “data mining books” turned over to the computer.

“Data Mining”


Current applications

Topic tracking and sentiment analysis turned over to the computer.

Track trends in attention and opinion over time.

http://www.google.com/trends

http://memetracker.org

http://textmap.com

http://www.ccs.neu.edu/home/amislove/twittermood/

Current applications


Current applications1

Data visualization turned over to the computer.

Clever ways to make data accessible

http://manyeyes.alphaworks.ibm.manyeyes

http://flowingdata.com

http://morningside-analytics.com

Current applications


Current applications2

Machine translation turned over to the computer.

Translate text from one language to another.

http://babelfish.yahoo.com/

Machine summarization

Summarize the most important points from a document or group of related documents.

http://newsblaster.cs.columbia.edu/

http://www.newsinessence.com/

Current applications


Current applications3

Miscellaneous turned over to the computer.

  • Language detection http://www.google.com/uds/samples/language/detect.html

  • Part-of-speech tagging

  • Word-sense disambiguation

  • Probabilistic parsing

  • Spell checking

  • Grammar checking

  • Spam filtering

Current applications


Data sources

  • Speeches turned over to the computer.

  • Legislation

  • Amendments

  • Hearings

  • Rules

  • Floor debate

  • Public comments

  • Judicial opinions

  • Legal Briefs

  • Party Manifestos

  • Media coverage

  • Blogs

  • Treaties

  • Reports

  • Anything on the public record…

Data sources


Data sources1

http://bulk.resource.org/ turned over to the computer.

Data sources


Data sources2
Data sources turned over to the computer.


Software

Two options turned over to the computer.

  • Out-of-the-box software

    • Nice for getting started

    • Methodology is constrained

    • Lags the development curve

  • Build it yourself

    • High overhead

    • Requires skill development

    • Extremely flexible

       Make sure to use existing libraries!

Software


Software1

Ex: Provalis WordStat turned over to the computer.

  • Out of Box, Plug and Play

  • Software Package Developed by Provalis

    • http://www.provalisresearch.com/

  • Booth at Midwest & APSA -- 2008, 2009

  • The Full Package: WordStat, QDA Miner, SimStat

Software


Software2

Programming languages turned over to the computer.

Perl, C++, Java, Ruby…

Python

If you’re going to learn a language, make it python

  • Free, open source

  • Intuitive syntax

  • Enormous code and user base

  • Well-documented, with excellent references

  • Multiplatform, mature distribution

  • Strong NLP capability

    • Ex: nltk, lxml, numpy, scipy, scikits libraries

Software


5-minute demo turned over to the computer.

Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula.

Get python here:

http://www.python.org/download/

Download the script here:

http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip

Download the books here:

http://www.gutenberg.org/files/32325/32325-h/32325-h.htm

http://www.gutenberg.org/files/345/345-h/345-h.htm

Demo


Demo turned over to the computer.


Demo turned over to the computer.


Automated text classification

Automated text classification turned over to the computer.


Terminology

Goal: Sort documents into predefined categories, based on their text.

  • Task

  • Document

  • Corpus

  • Token

  • Feature

  • Feature string

  • Feature vector

  • Bag-of-words classifiers

Terminology


Algorithms and estimators

Naïve Bayes Classifiers their text.

Assume words are drawn independently, conditional on document class. Infer each document’s class from its words.

Strengths

  • Clear statistical foundation

  • Fast to train and implement

  • Lightweight

    Weaknesses

  • Noticeably less effective than other approaches

  • Statistical foundation is based on false assumptions

Algorithms and Estimators


Algorithms and estimators1

Support Vector Machines (SVM) their text.

Vectorize documents, then find the maximum-margin separating hyperplane.

Strengths

  • High accuracy

  • Intuitive explanation

  • Work with little training data

    Weaknesses

  • No explicit statistical foundation

  • Training is slow with large data sets

Algorithms and Estimators


Algorithms and estimators2

Support Vector Machines (SVM) their text.

Vectorize documents, then find the maximum-margin separating hyperplane.

Algorithms and Estimators


Algorithms and estimators3

Logistic regression their text.

Maximum likelihood estimator

Algorithms and Estimators


Algorithms and estimators4

Decision Trees their text.

Like playing 20 questions.

Strengths

  • Able to capture subtle details

    Weaknesses

  • Require large amounts of training data

  • Classification is often “brittle”

Algorithms and Estimators


Terminology1

Goal: Sort documents into predefined categories, based on their text.

  • Task

  • Document

  • Corpus

  • Token

  • Feature

  • Feature string

  • Feature vector

  • Bag-of-words classifiers

Terminology


Evaluation

Percent agreement their text.

Precision

Recall

F-measure

Cohen’s kappa

Krippendorff’s alpha

Evaluation


Evaluation1

Evaluation



Motivation

Why study politics online? their text.

  • Impact of new technology on politics

    • Barack Obama did 60% of his record-breaking fundraising online

    • Trent Lott, Dan Rather, Howard Dean

  • New data on age-old political behavior

    • Examples to follow shortly

Motivation


Motivation1

Motivation


Goal a complete census of the political web
Goal: their text.A complete census of the political web


Web sites v web pages

Web site their text. http://domain

Web page http://domain/path

Examples (3 sites and 1 page)

http://www.yahoo.com

http://www.yahoo.com/politics

http://www.dailykos.com

http://abegong.dailykos.com

Web sites v. web pages


Why web sites

Why web sites?


Automated snowball census

  • Train an their text.automated text classifier to recognize political content.

  • Start from a seed batch of political sites.

  • Download and classify each site in the batch.

  • For political sites:

    • Harvest all outbound hyperlinks.

    • Add previously unvisited links to the next batch.

  • Repeat until no new links are found.

Automated snowball census


Evaluation2

How can we know if the automated classifier their text.

is working properly?

The same way we know if a human coder is working properly: compare coding with others

  • Hand-code a training set (n=1,000 x 1)

  • Train the classifier

  • Hand-code a testing set (n=200 x 4)

  • Compare results

    • Human-human

    • Human-computer

Evaluation


Coding protocol

Coding protocol


Reliability

Human-human coding their text.

.733 Ordinal Kripp. Alpha

  • Even with minimal training, our shared definition of political content is quite strong.

    Sites in the gray area: www.msnbc.com, www.rff.org, …

Reliability


Training a text classifier

Prob(political) ≈ logit( their text.α+βX)

X = Vector of word counts

α = Bias term

β = Word weights

  • Max. Likelihood Estimator

  • Asymp. unbiased

  • Asymp. efficient

Training a text classifier



Thresholds

.4 Threshold human classification.

[.95] Precision

[.90] Recall

Thresholds


Results

23 hrs Runtime human classification.

120 GB Hard drive

1.8 million Sites visited

650,000 Political sites

112,000 Est. False positives

60,000 Est. False negatives

Results


Limitations

  • Stability across time human classification.

    • Is the political web today the same as the web last year?

  • Clutter

    • Advertising, spam, etc.

  • Private sites

    • Password protection: Facebook, myspace, twitter

  • Improved classifier

    • Other predictors of political-ness (esp. links)

Limitations


  • Survey human classification.

  • Content analysis

    • By author

    • Over time

    • In panel

  • Network analysis

Uses


Are estimates really unbiased
Are estimates really unbiased? human classification.


Estimating the gray area

Allows us to human classification.estimate the gray area in our definition.

Classifier predictions have known certainty.

Estimating the gray area


Sentiment analysis

http://textmap.org/ human classification.

Sentiment analysis

Abe Gong - Evaluating text classifiers and text generators


The bayesian approach to content coding

An easy task human classification.

A hard task

Density

Density

Prob(X|f)

Prob(X|f)

The Bayesian approachto content coding


Markov text generation

  • Intuition human classification.

  • Applications

    • Data compression

    • Telecommunications

    • Cryptography

      Example:

      http://www.in-vacua.com/markov_gen.html

Markov text generation


ad