Loading in 2 Seconds...

Probability Theory Bayes Theorem and Naïve Bayes classification

Loading in 2 Seconds...

- 454 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Probability Theory Bayes Theorem and Naïve Bayes classification' - teenie

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Probability Theory Bayes Theorem and Naïve Bayes classification

Definition of Probability

- Probability theory encodes our knowledge or belief about the collective likelihood of the outcome of an event.
- We use probability theory to try to predict which outcome will occur for a given event.

Slide adapted from Dan Jurafsky's

Sample Spaces

- We think about the “sample space” as being set of all possible outcomes:
- For tossing a coin, the possible outcomes are Heads or Tails.
- For competing in the Olympics, the set of outcomes for a given contest are {gold, silver, bronze, no_award}.
- For computing part-of-speech, the set of outcomes are {JJ, DT, NN, RB, … etc.}
- We use probability theory to try to predict which outcome will occur for a given event.

Slide adapted from Dan Jurafsky's

Axioms of Probability Theory

- Probabilities are real numbers between 01 representing the a priori likelihood that a proposition is true.
- Necessarily true propositions have probability 1, necessarily false have probability 0.

P(true) = 1 P(false) = 0

Slides adapted from Mary Ellen Califf

Probability Axioms

- Probability law must satisfy certain properties:
- Nonnegativity
- P(A) >= 0, for every event A
- Additivity
- If A and B are two disjoint events, then the probability of their union satisfies:
- P(A U B) = P(A) + P(B)
- Normalization
- The probability of the entire sample space S is equal to 1, i.e., P(S) = 1.

P(Cold) = 0.1 (the probability that it will be cold)

P(¬Cold) = 0.9 (the probability that it will be not cold)

Slide adapted from Dan Jurafsky's

An example

- An experiment involving a single coin toss
- There are two possible outcomes, Heads and Tails
- Sample space S is {H,T}
- If coin is fair, should assign equal probabilities to 2 outcomes
- Since they have to sum to 1
- P({H}) = 0.5
- P({T}) = 0.5
- P({H,T}) = P({H})+P({T}) = 1.0

Slide adapted from Dan Jurafsky's

Another example

- Experiment involving 3 coin tosses
- Outcome is a 3-long string of H or T
- S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
- Assume each outcome is equally likely (equiprobable)
- “Uniform distribution”
- What is probability of the event A that exactly 2 heads occur?
- A = {HHT,HTH,THH}
- P(A) = P({HHT})+P({HTH})+P({THH})
- = 1/8 + 1/8 + 1/8
- =3/8

Slide adapted from Dan Jurafsky's

Probability definitions

- In summary:

number of outcomes corresponding to event E

P(E) = ---------------------------------------------------------

total number of outcomes

- Probability of drawing a spade from 52 well-shuffled playing cards:
- 13/52 = ¼ = 0.25

Slide adapted from Dan Jurafsky's

How about non-uniform probabilities? An example:

- A biased coin,
- twice as likely to come up tails as heads,
- is tossed twice
- What is the probability that at least one head occurs?
- Sample space = {hh, ht, th, tt} (h = heads, t = tails)
- Sample points/probability for the event:
- (each head has probability 1/3; each tail has prob 2/3):
- ht 1/3 x 2/3 = 2/9 hh 1/3 x 1/3= 1/9
- th 2/3 x 1/3 = 2/9 tt 2/3 x 2/3 = 4/9
- Answer: 5/9 = 0.56
- (sum of weights in red since want outcome with at least one heads)
- By contrast, probability of event for the unbiased coin = 0.75

Slide adapted from Dan Jurafsky's

Moving toward language

- What’s the probability of drawing a 2 from a deck of 52 cards?
- What’s the probability of a random word (from a random dictionary page) being a verb?

Slide adapted from Dan Jurafsky's

Probability and part of speech tags

- What’s the probability of a random word (from a random dictionary page) being a verb?
- How to compute each of these?
- All words = just count all the words in the dictionary
- # of ways to get a verb: number of words which are verbs!
- If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20

Slide adapted from Dan Jurafsky's

Independence

- Most things in life depend on one another, that is, they are dependent:
- If I drive to SF, this may effect your attempt to go there.
- It may also effect the environment, the economy of SF,etc.
- If 2 events are independent, this means that the occurrence of one event makes it neither more nor less probable that the other occurs.
- If I flip my coin, it won’t have an effect on the outcome of your coin flip.

P(A,B)= P(A) ·P(B) if and only if A and B are independent.

P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25

Note: P(A|B)=P(A) iff A,B independent

P(B|A)=P(B) iff A,B independent

Slide adapted from Dan Jurafsky's

Conditional Probability

- A way to reason about the outcome of an experiment based on partial information
- How likely is it that a person has a disease given that a medical test was negative?
- In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”?
- A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

Slide adapted from Dan Jurafsky's

Conditional Probability

- Conditional probability specifies the probability given that the values of some other random variables are known.

P(Sneeze | Cold) = 0.8

P(Cold | Sneeze) = 0.6

- The probability of a sneeze given a cold is 80%.
- The probability of a cold given a sneeze is 60%.

Slides adapted from Mary Ellen Califf

More precisely

- Given an experiment, a corresponding sample space S, and the probability law
- Suppose we know that the outcome is within some given event B
- The first letter was ‘t’
- We want to quantify the likelihood that the outcome also belongs to some other given event A.
- The second letter will be ‘h’
- We need a new probability law that gives us the conditional probability of A given B
- P(A|B) “the probability of A given B”

Slide adapted from Dan Jurafsky's

Joint Probability Distribution

- The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values

P(X1,...,Xn)

Sneeze ¬Sneeze

Cold 0.08 0.01

¬Cold 0.01 0.9

- The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution.
- All conditional probabilities can therefore also be calculated
- P(Cold |¬Sneeze)
- BUT it’s often very hard to obtain all the probabilities for a joint distribution

Slides adapted from Mary Ellen Califf

Conditional Probability Example

- Let’s say A is “it’s raining”.
- Let’s say P(A) in dry California is .01
- Let’s say B is “it was sunny ten minutes ago”
- P(A|B) means
- “what is the probability of it raining now if it was sunny 10 minutes ago”
- P(A|B) is probably way less than P(A)
- Perhaps P(A|B) is .0001
- Intuition: The knowledge about B should change our estimate of the probability of A.

Slide adapted from Dan Jurafsky's

Conditional Probability

- Let A and B be events
- P(A,B) and P(A B) both means “the probability that BOTH A and B occur”
- p(B|A) = the probability of event B occurring given event A occurs
- definition: p(A|B) = p(A B) / p(B)
- P(A, B) = P(A|B) * P(B) (simple arithmetic)
- P(A, B) = P(B, A) (AND is symmetric)

Slide adapted from Dan Jurafsky's

Bayes Theorem

- We start with conditional probability definition:
- So say we know how to compute P(A|B). What if we want to figure out P(B|A)? We can re-arrange the formula using Bayes Theorem:

Deriving Bayes Rule

Slide adapted from Dan Jurafsky's

How to compute probabilities?

- We don’t have the probabilities for most NLP problems
- We can try to estimate them from data
- (that’s the learning part)
- Usually we can’t actually estimate the probability that something belongs to a given class given the information about it
- BUT we can estimate the probability that something in a given class has particular values.

Slides adapted from Mary Ellen Califf

Simple Bayesian Reasoning

- If we assume there are n possible disjoint rhetorical zones, z1 … zn

P(zi | w) = P(w | zi) P(zi) P(w)

- Want to know the probability of the zone given the word.
- Can count how often we see the word in a sentence from a given zone in the training set, and how often the zone itself occurs.
- P(w| zi ) = number of times we see this zone with this word divided by how often we see the zone
- This is the learning part.
- Since P(w) is always the same, we can ignore it.
- So now only need to compute P(w | zi) P(zi)

Slides adapted from Mary Ellen Califf

Bayes Independence Example

- Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER

Prob Well Cold Allergy

P(d) 0.9 0.05 0.05

P(sneeze|d) 0.1 0.9 0.9

P(cough | d) 0.1 0.8 0.7

P(fever | d) 0.01 0.7 0.4

Slides adapted from Mary Ellen Califf

Bayes Independence Example

- By assuming independence, we ignore the possible interactions.
- If symptoms are: sneeze & cough & no fever:

P(well | e) = (0.9)(0.1)(0.1)(0.99)/P(e) = 0.0089/P(e)

P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e)

P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)

P(e) = .0089 + .01 + .019 = .0379

P(well | e) = .23

P(cold | e) = .26

P(allergy | e) = .50

- Diagnosis: allergy

Slides adapted from Mary Ellen Califf

Kupiec et al. Feature Representation

- Fixed-phrase feature
- Certain phrases indicate summary, e.g. “in summary”
- Paragraph feature
- Paragraph initial/final more likely to be important.
- Thematic word feature
- Repetition is an indicator of importance
- Uppercase word feature
- Uppercase often indicates named entities. (Taylor)
- Sentence length cut-off
- Summary sentence should be > 5 words.

Training

- Hand-label sentences in training set (good/bad summary sentences)
- Train classifier to distinguish good/bad summary sentences
- Model used: Naïve Bayes
- Can rank sentences according to score and show top n to user.

Naïve Bayes Classifier

- The simpler version of Bayes was:
- P(B|A) = P(A|B)P(B)/P(A)
- P(Sentence | feature) = P(feature | S) P(S)/P(feature)
- Using Naïve Bayes, we expand the number of feaures by defining a joint probability distribution:
- P(Sentence, f1, f2, … fn) = P(Sentence)P(fi| Sentence)/ P(fi )
- We learn P(Sentence) and P(fi| Sentence) in training
- Test: we need to state P(Sentence | f1, f2, … fn)
- P(Sentence| f1, f2, … fn) =

P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)

Probability of feature-value pair

occurring in a source sentence

which is also in the summary

compression

rate

Probability that sentence s is included

in summary S, given that sentence’s

feature value pairs

Probability of feature-value pair

occurring in a source sentence

Details: Bayesian Classifier- Assuming statistical independence:

Bayesian Classifier (Kupiec at el 95)

- Each Probability is calculated empirically from a corpus
- See how often each feature is seen with a sentence selected for a summary, vs how often that feature is seen in any sentence.
- Higher probability sentences are chosen to be in the summary
- Performance:
- For 25% summaries, 84% precision

From lecture notes by Nachum Dershowitz & Dan Cohen

How to compute this?

- For training, for each feature f:
- For each sentence s:
- Is the sentence in the target summary S? Increment T
- Increment F no matter which sentence it is in.
- P(f|S) = T/N
- P(f) = F/N
- For testing, for each document:
- For each sentence
- Multiply the probabilities of all of the features occurring in the sentence times the probability of any sentence being in the summary (a constant). Divide by the probability of the features occurring in any sentence.

Learning Sentence Extraction Rules (Kupiec et al. 95)

- Results
- About 87% (498) of all summary sentences (568) could be directly matched to sentences in the source (79% direct matches, 3% direct joins, 5% incomplete joins)
- Location was best feature at 163/498 = 33%
- Para+fixed-phrase+sentence length cut-off gave best sentence recall performance … 217/498=44%
- At compression rate = 25% (20 sentences), performance peaked at 84% sentence recall

J. Kupiec, J. Pedersen, and F. Chen. "A Trainable Document Summarizer", Proceedings of the 18th ACM-SIGIR Conference, pages 68--73, 1995.

Language identification

- Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza.
- Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

Universal Declaration of Human Rights, UN, in 363 languages

http://www.unhchr.ch/udhr/navigate/alpha.htm

Language identification

- égaux
- eguali
- iguales
- edistämään
- Ü
- ¿
- How to do determine, for a stretch of text, which language it is from?

Language Identification

- Turns out to be really simple
- Just a few character bigrams can do it (Sibun & Reynar 96)
- Based on language models for sample languages

Language model basics

- Unigram and bigram models
- Evaluating N-gram language models
- Perplexity
- The intuition of perpexity is that given two probabilistic models, the better model is the one that better predicts new data (not used to train the model)
- We can measure better prediction by comparing the probability the model assigns to the test data
- The better probability will assign a higher probability

Perplexity

minimazing perplexity is equivalent to maximazing the test set probability according to the language model

Entropy

- X---a random variable with probability distribution p(x)

Measures how predictive a given N-gram model is about what the next word could be

KL divergence (relative entropy)

Basis of comparing two probability distributions

Language Identification

- Turns out to be really simple
- Just a few character bigrams can do it (Sibun & Reynar 96)
- Used Kullback Leibler distance (relative entropy)
- Compare probability distribution of the test set to those for the languages trained on
- Smallest distance determines the language
- Using special character sets helps a bit, but barely

Language Identification

(Sibun & Reynar 96)

Confusion Matrix

- A table that shows, for each class, which ones your algorithm got right and which wrong

Gold standard

Algorithm’s guess

Author Identification

- Also called Stylometry in the humanities
- An example of a Classification Problem
- Classifiers:
- Decide which of N buckets to put an item in
- (Some classifiers allow for multiple buckets)

The Disputed Federalist Papers

- In 1787-1788, Jay, Madison, and Hamilton wrote a series of anonymous essays to convince the voters of New York to ratify the new U. S. Constitution.
- Scholars have consensus that:
- 5 authored by Jay
- 51 authored by Hamilton
- 14 authored by Madison
- 3 jointly by Hamilton and Madison
- 12 remain in dispute … Hamilton or Madison?

Author identification

- Federalist papers
- In 1963 Mosteller and Wallace solved the problem
- They identified function words as good candidates for authorships analysis
- Using statistical inference they concluded the author was Madison
- Since then, other statistical techniques have supported this conclusion.

Function vs. Content Words

High rates for “by” favor M, low favor H

High rates for “from” favor M, low says little

High rates for “to” favor H, low favor M

Function vs. Content Words

No consistent pattern for “war”

Federalist Papers Problem

Fung, The Disputed Federalist Papers: SVM Feature Selection

Via Concave Minimization, ACM TAPIA’03

Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Tagging Word POS

Sense Disambiguation Word The word’s senses

Information retrieval Document Relevant/not relevant

Sentiment classification Document Positive/negative

Author identification Document Authors

From: Foundations of Statistical Natural Language Processing. Manning and Schutze

Text Categorization Applications

- Web pages organized into category hierarchies
- Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.)
- Responses to Census Bureau occupations
- Patents archived using International Patent Classification
- Patient records coded using international insurance categories
- E-mail message filtering
- News events tracked and filtered by topics
- Spam

Slide adapted from Paul Bennet

Cost of Manual Text Categorization

- Yahoo!
- 200 (?) people for manual labeling of Web pages
- using a hierarchy of 500,000 categories
- MEDLINE (National Library of Medicine)
- $2 million/year for manual indexing of journal articles
- using MEdical Subject Headings (18,000 categories)
- Mayo Clinic
- $1.4 million annually for coding patient-record events
- using the International Classification of Diseases (ICD) for billing insurance companies
- US Census Bureau decennial census (1990: 22 million responses)
- 232 industry categories and 504 occupation categories
- $15 million if fully done by hand

Slide adapted froml Paul Bennett

Knowledge Statistical Engineering Learning

vs.

- For US Census Bureau Decennial Census 1990
- 232 industry categories and 504 occupation categories
- $15 million if fully done by hand
- Define classification rules manually:
- Expert System AIOCS
- Development time: 192 person-months (2 people, 8 years)
- Accuracy = 47%
- Learn classification function
- Nearest Neighbor classification (Creecy ’92: 1-NN)
- Development time: 4 person-months (Thinking Machine)
- Accuracy = 60%

Slide adapted froml Paul Bennett

Text Topic categorization

- Topic categorization: classify the document into semantics topics

The Reuters collection

- A gold standard
- Collection of (21,578) newswire documents.
- For research purposes: a standard text collection to compare systems and algorithms
- 135 valid topics categories

Reuters

- Top topics in Reuters

Reuters Document Example

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

</BODY></TEXT></REUTERS>

Classification vs. Clustering

- Classificationassumes labeled data: we know how many classes there are and we have examples for each class (labeled data).
- Classification is supervised
- In Clusteringwe don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are
- Clustering is unsupervised

Categories (Labels, Classes)

- Labeling data
- 2 problems:
- Decide the possible classes (which ones, how many)
- Domain and application dependent
- Label text
- Difficult, time consuming, inconsistency between annotators

Reuters Example, revisited

Why not topic = policy ?

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

</BODY></TEXT></REUTERS>

Binary vs. multi-way classification

- Binary classification: two classes
- Multi-way classification: more than two classes
- Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

Features

- >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."
- >>> label = “sport”
- >>> labeled_text = LabeledText(text, label)

- Here the classification takes as input the whole string
- What’s the problem with that?
- What are the features that could be useful for this example?

Feature terminology

- Feature: An aspect of the text that is relevant to the task
- Some typical features
- Words present in text
- Frequency of words
- Capitalization
- Are there NE?
- WordNet
- Others?

Feature terminology

- Feature: An aspect of the text that is relevant to the task
- Feature value: the realization of the feature in the text
- Words present in text : Kerry, Schumacher, China…
- Frequency of word: Kerry(10), Schumacher(1)…
- Are there dates? Yes/no
- Are there PERSONS? Yes/no
- Are there ORGANIZATIONS? Yes/no
- WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

Feature Types

- Boolean (or Binary) Features
- Features that generate boolean (binary) values.
- Boolean features are the simplest and the most common type of feature.
- f1(text) = 1 if text contain “Kerry”

0 otherwise

- f2(text) = 1 if text contain PERSON

0 otherwise

Feature Types

- Integer Features
- Features that generate integer values.
- Integer features can be used to give classifiers access to more precise information about the text.
- f1(text) = Number of times text contains “Kerry”
- f2(text) = Number of times text contains PERSON

Feature selection

- How do we choose the “right” features?
- Next lecture

Download Presentation

Connecting to Server..