SemiSupervised Learning over Text. Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006. Statistical learning methods require LOTS of training data Can we use all that unlabelled text?. Outline. Maximizing likelihood in probabilistic models
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
September 2006
Can we use all that unlabelled text?
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Accuracy vs. # training examples
For code and data, seewww.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”
X1
X2
X3
X4
What if we have labels for only some documents?
Learn P(YX)
Using one labeled example per class
Words sorted by P(wcourse) / P(w : course)
Chosen by cross validation
New M step:
Key idea: Classifier1and ClassifierJ must:
1. Correctly classify labeled examples
2. Agree on classification of unlabeled
Answer1
Answer2
Classifier1
Classifier2
Typical run:
[Riloff&Jones 98; Collins et al., 98; Jones 05]
Answer1
Answer2
Classifier1
Classifier2
New York
I flew to ____ today
I flew to New York today.
One result [Blum&Mitchell 1998]:

+
+
x1
x2
Where g is the jth connected component of graph of L+U, m is number of labeled examples
j
Expected Rote CoTraining error given m examplesWant to assure that connected components in the underlying distribution, GD, are connected components in the observed sample, GS
GD
GS
O(log(N)/) examples assure that with high probability, GS has same connected components as GD [Karger, 94]
N is size of GD, is min cut over all connected components of GD
[Dasgupta et al., NIPS 2001]
This theorem assumes X1 and X2 are conditionally independent given Y
How can we tune learning environment to enhance effectiveness of CoTraining?
# labeled examples
# Redundantly predictive inputs
# unlabeled examples
dependencies among input features
Final Accuracy
Correctness of confidence assessments
best: inputs conditionally indep given class, increased number of redundant inputs, …
+

+
What if CoTraining Assumption Not Perfectly Satisfied?location?
I arrived in Beijing on Saturday.
If: “I arrived in <X> on Saturday.”
Then: Location(X)
[Riloff&Jones 98; Collins et al., 98; Jones 05]
Answer1
Answer2
Classifier1
Classifier2
Beijing
I arrived in __ saturday
I arrived in Beijing saturday.
South Africa
United Kingdom
Warrenton
Far_East
Oregon
Lexington
Europe
U.S._A.
Eastern Canada
Blair
Southwestern_states
Texas
States
Singapore …
Thailand
Maine
production_control
northern_Los
New_Zealand
eastern_Europe
Americas
Michigan
New_Hampshire
Hungary
south_america
district
Latin_America
Florida ...
Initialization
Australia
Canada China England France Germany Japan Mexico Switzerland United_states
…
...
Iterations
locations in ?x
operations in ?x
republic of ?x
Idea:
Update rules:
Update rules:
Update rules:
Can use this for active learning...
+

+
What if CoTraining Assumption Not Perfectly Satisfied?.11
*
*
Gradient CoTrainingClassifying Capitalized sequences as Person NamesEg., “Company presidentMary Smithsaid today…”
x1
x2
x1
Error Rates
25 labeled 5000 unlabeled
2300 labeled 5000 unlabeled
Using labeled data only
.13
.24
Cotraining
Cotraining without fitting class priors (E4)
*
.27
* sensitive to weights of error terms E3 and E4
Successfully learns “context word sense” rules when word occurs multiples times in documents.
X1: HTML preceding the target
X2: HTML following the target