slide1
Download
Skip this Video
Download Presentation
Abe Gong [email protected] www-personal.umich/~agong

Loading in 2 Seconds...

play fullscreen
1 / 64

Abe Gong [email protected] www-personal.umich/~agong - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

Text as Data in the Social Sciences Introduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010. Abe Gong [email protected] www-personal.umich.edu/~agong. Big Picture The field of NLP Automated text classification A census of the political web. Agenda.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Abe Gong [email protected] www-personal.umich/~agong' - inara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Text as Data in the Social SciencesIntroduction to Computing for Complex Systems (Session XVI) ICPSR – August 11, 2010

Abe Gong

[email protected]

www-personal.umich.edu/~agong

agenda

Big Picture

  • The field of NLP
  • Automated text classification
  • A census of the political web
Agenda
learning modes

Supervised learning

Using a large set of labeled data, the computer learns to mimic humans on some task

Applications

  • Handwriting, speech, and pattern recognition
  • Spam filtering
  • Bioinformatics
Learning Modes
learning modes1

Supervised learning

Using a large set of labeled data, the computer learns to mimic humans on some task

Strengths

  • Very flexible
  • Easy to adapt to existing theory

Weaknesses

  • Specifying ontologies can be time-consuming
  • Requires substantial training data
Learning Modes
learning modes2

Unsupervised learning

Using raw, unlabeled data, the computer looks for patterns and regularities

Applications

  • Clustering
  • Neural networks
  • Algorithmic stock trading
  • Data-driven marketing
Learning Modes
learning modes3

Supervised learning

Using raw, unlabeled data, the computer looks for patterns and regularities

Strengths

  • Does not require labeled data
  • Discovers new patterns

Weaknesses

  • Often difficult to relate to existing theory
Learning Modes
learning modes4

Active learning

Supervised learning, but the computer selects or generates training examples

  • Optimal experimental design
  • Performance boost for supervised learning

Semi-supervised learning

Blend of supervised and unsupervised learning

  • Algorithmic forecasting, stock trading
  • Topic maps
  • Machine summarization
Learning Modes
data mining

In all of these applications, a large degree of control is turned over to the computer.

  • “Data Mining” is not always a dirty word.

Bad: Re-run statistical models until p > .05

Good: Tap all the data available for patterns and inference

“Data Mining”
current applications

Topic tracking and sentiment analysis

Track trends in attention and opinion over time.

http://www.google.com/trends

http://memetracker.org

http://textmap.com

http://www.ccs.neu.edu/home/amislove/twittermood/

Current applications
current applications1

Data visualization

Clever ways to make data accessible

http://manyeyes.alphaworks.ibm.manyeyes

http://flowingdata.com

http://morningside-analytics.com

Current applications
current applications2

Machine translation

Translate text from one language to another.

http://babelfish.yahoo.com/

Machine summarization

Summarize the most important points from a document or group of related documents.

http://newsblaster.cs.columbia.edu/

http://www.newsinessence.com/

Current applications
current applications3

Miscellaneous

  • Language detection http://www.google.com/uds/samples/language/detect.html
  • Part-of-speech tagging
  • Word-sense disambiguation
  • Probabilistic parsing
  • Spell checking
  • Grammar checking
  • Spam filtering
Current applications
data sources

Speeches

  • Legislation
  • Amendments
  • Hearings
  • Rules
  • Floor debate
  • Public comments
  • Judicial opinions
  • Legal Briefs
  • Party Manifestos
  • Media coverage
  • Blogs
  • Treaties
  • Reports
  • Anything on the public record…
Data sources
software

Two options

  • Out-of-the-box software
    • Nice for getting started
    • Methodology is constrained
    • Lags the development curve
  • Build it yourself
    • High overhead
    • Requires skill development
    • Extremely flexible

 Make sure to use existing libraries!

Software
software1

Ex: Provalis WordStat

  • Out of Box, Plug and Play
  • Software Package Developed by Provalis
    • http://www.provalisresearch.com/
  • Booth at Midwest & APSA -- 2008, 2009
  • The Full Package: WordStat, QDA Miner, SimStat
Software
software2

Programming languages

Perl, C++, Java, Ruby…

Python

If you’re going to learn a language, make it python

  • Free, open source
  • Intuitive syntax
  • Enormous code and user base
  • Well-documented, with excellent references
  • Multiplatform, mature distribution
  • Strong NLP capability
    • Ex: nltk, lxml, numpy, scipy, scikits libraries
Software
slide27

5-minute demo

Train a classifier to recognize the difference between Twain’s Huck Finn and Stoker’s Dracula.

Get python here:

http://www.python.org/download/

Download the script here:

http://www-personal.umich.edu/~agong/temp/text_classifier_demo.zip

Download the books here:

http://www.gutenberg.org/files/32325/32325-h/32325-h.htm

http://www.gutenberg.org/files/345/345-h/345-h.htm

Demo
terminology

Goal: Sort documents into predefined categories, based on their text.

  • Task
  • Document
  • Corpus
  • Token
  • Feature
  • Feature string
  • Feature vector
  • Bag-of-words classifiers
Terminology
algorithms and estimators

Naïve Bayes Classifiers

Assume words are drawn independently, conditional on document class. Infer each document’s class from its words.

Strengths

  • Clear statistical foundation
  • Fast to train and implement
  • Lightweight

Weaknesses

  • Noticeably less effective than other approaches
  • Statistical foundation is based on false assumptions
Algorithms and Estimators
algorithms and estimators1

Support Vector Machines (SVM)

Vectorize documents, then find the maximum-margin separating hyperplane.

Strengths

  • High accuracy
  • Intuitive explanation
  • Work with little training data

Weaknesses

  • No explicit statistical foundation
  • Training is slow with large data sets
Algorithms and Estimators
algorithms and estimators2

Support Vector Machines (SVM)

Vectorize documents, then find the maximum-margin separating hyperplane.

Algorithms and Estimators
algorithms and estimators4

Decision Trees

Like playing 20 questions.

Strengths

  • Able to capture subtle details

Weaknesses

  • Require large amounts of training data
  • Classification is often “brittle”
Algorithms and Estimators
terminology1

Goal: Sort documents into predefined categories, based on their text.

  • Task
  • Document
  • Corpus
  • Token
  • Feature
  • Feature string
  • Feature vector
  • Bag-of-words classifiers
Terminology
evaluation

Percent agreement

Precision

Recall

F-measure

Cohen’s kappa

Krippendorff’s alpha

Evaluation
motivation

Why study politics online?

  • Impact of new technology on politics
    • Barack Obama did 60% of his record-breaking fundraising online
    • Trent Lott, Dan Rather, Howard Dean
  • New data on age-old political behavior
    • Examples to follow shortly
Motivation
motivation1

“No complete index of political websites exists.”

  • Unable to use sampling theory
    • Size, representativeness, generalizability, etc.
    • Possible bias, error in existing methods
Motivation
web sites v web pages

Web site http://domain

Web page http://domain/path

Examples (3 sites and 1 page)

http://www.yahoo.com

http://www.yahoo.com/politics

http://www.dailykos.com

http://abegong.dailykos.com

Web sites v. web pages
why web sites

Sites correspond with human beings

  • Feasibility.

~ 230 million websites

~ 30 billion web pages

Why web sites?
automated snowball census

Train an automated text classifier to recognize political content.

  • Start from a seed batch of political sites.
  • Download and classify each site in the batch.
  • For political sites:
    • Harvest all outbound hyperlinks.
    • Add previously unvisited links to the next batch.
  • Repeat until no new links are found.
Automated snowball census
evaluation2

How can we know if the automated classifier

is working properly?

The same way we know if a human coder is working properly: compare coding with others

  • Hand-code a training set (n=1,000 x 1)
  • Train the classifier
  • Hand-code a testing set (n=200 x 4)
  • Compare results
    • Human-human
    • Human-computer
Evaluation
coding protocol

Intuitive definition

  • Minimal training

Amazon Mechanical Turk

Coding protocol
reliability

Human-human coding

.733 Ordinal Kripp. Alpha

  • Even with minimal training, our shared definition of political content is quite strong.

Sites in the gray area: www.msnbc.com, www.rff.org, …

Reliability
training a text classifier

Prob(political) ≈ logit(α+βX)

X = Vector of word counts

α = Bias term

β = Word weights

  • Max. Likelihood Estimator
  • Asymp. unbiased
  • Asymp. efficient
Training a text classifier
thresholds

.4 Threshold

[.95] Precision

[.90] Recall

Thresholds
results

23 hrs Runtime

120 GB Hard drive

1.8 million Sites visited

650,000 Political sites

112,000 Est. False positives

60,000 Est. False negatives

Results
limitations

Stability across time

    • Is the political web today the same as the web last year?
  • Clutter
    • Advertising, spam, etc.
  • Private sites
    • Password protection: Facebook, myspace, twitter
  • Improved classifier
    • Other predictors of political-ness (esp. links)
Limitations
slide57

Survey

  • Content analysis
    • By author
    • Over time
    • In panel
  • Network analysis
Uses
sentiment analysis

http://textmap.org/

Sentiment analysis

Abe Gong - Evaluating text classifiers and text generators

the bayesian approach to content coding

An easy task

A hard task

Density

Density

Prob(X|f)

Prob(X|f)

The Bayesian approachto content coding
markov text generation

Intuition

  • Applications
    • Data compression
    • Telecommunications
    • Cryptography

Example:

http://www.in-vacua.com/markov_gen.html

Markov text generation
ad