Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG 2011 2011-08-21, San Diego, CA, USA

Overview • Graph-based SSL • The Problem with Text Data • Implicit Manifolds • Two GSSL Methods • Three Similarity Functions • A Framework • Results • Related Work

Graph-based SSL • Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph. • Naturally they are a good fit for network data Labeled class A How to label the rest of the nodes in the graph? Labeled Class B

The Problem with Text Data • Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself…

The Problem with Text Data • Feature vectors are often sparse • But similarity matrix is not! Mostly non-zero - any two documents are likely to have a word in common Mostly zeros - any document contains only a small fraction of the vocabulary

The Problem with Text Data • A similarity matrix is the input to many GSSL methods • How can we apply GSSL methods efficiently? Too expensive! Does not scale up to big datasets! O(n2) time to construct O(n2) space to store > O(n2) time to operate on

The Problem with Text Data A lot of cool work has gone into how to do this… • Solutions: • Make the matrix sparse • Implicit Manifold But this is what we’ll talk about! (If you do SSL on large-scale non-network data, there is a good chance you are using this already...)

Implicit Manifolds What do you mean by…? • A pair-wise similarity matrix is a manifold under which the data points “reside”. • It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did. Sounds too good. What’s the catch?

Implicit Manifolds • Two requirements for using implicit manifolds on text (-like) data: • The GSSL method can be implemented with matrix-vector multiplications • We can decompose the dense similarity matrix into a series of sparse matrix-matrix multiplications As long as they are met, we can obtain the exact same results without ever constructing a similarity matrix! Let’s look at some specifics, starting with two GSSL methods

If you’ve never heard of them… Two GSSL Methods You might have heard one of these: • Harmonic functions method (HF) • MultiRankWalk (MRW) Propagation via forward random walks (w/ restart) Propagation via backward random walks HF’s cousins: • Gaussian fields and harmonic functions classifier (Zhu et al. 2003) • Weighted-voted relational network classifier (Macskassy & Provost 2007) • Weakly-supervised classification via random walks (Talukdar et al. 2008) • Adsorption (Baluja et al. 2008) • Learning on diffusion maps (Lafon & Lee 2006) • and others … MRW’s cousins: • Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001) • Learning with local and global consistency (Zhou et al. 2004) • Graph-based SSL as a generative model (He et al. 2007) • Ghost edges for classification (Gallagher et al. 2008) • and others …

Let’s look at their implicit manifold qualification… Two GSSL Methods • Harmonic functions method (HF) • MultiRankWalk (MRW) In both of these iterative implementations, the core computation are matrix-vector multiplications!

Three Similarity Functions Simple, good for binary features • Inner product similarity • Cosine similarity • Bipartite graph walk Often used for document categorization Can be viewed as a bipartite graph walk, or as feature reweighting by relative frequency; related to TF-IDF term weighting… Side note: perhaps all the cool work on matrix factorization could be useful here too…

How about a similarity function we actually use for text data? Putting Them Together • Example: HF + inner product similarity: Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=FFT1 Parentheses are important! • Iteration update becomes: Construction: O(n) Storage: O(n) Operation: O(n)

Putting Them Together Diagonal cosine normalizing matrix Construction: O(n) • Example: HF +cosine similarity: Storage: O(n) Compact storage: we don’t need a cosine-normalized version of the feature vectors • Iteration update: Operation: O(n) Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=NcFFTNc1

A Framework • So what’s the point? • Towards a principled way for researchers to apply GSSL methods to text data, and the conditions under which they can do this efficiently • Towards a framework on which researchers can develop and discover new methods (and recognizing old ones) • Building a SSL tool set – pick one that works for you, according to all the great work that has been done

A Framework How about MultiRankWalk with a low restart probability? Hmm… I have a large dataset with very few training labels, what should I try? Choose your SSL Method… Harmonic Functions MultiRankWalk

A Framework Can’t go wrong with cosine similarity! But the documents in my dataset are kinda long… … and pick your similarity function Inner Product Cosine Similarity Bipartite Graph Walk

Results • On 2 commonly used document categorization datasets • On 2 less common NP classification datasets • Goal: to show they do work on large text datasets, and consistent with what we know about these SSL methods and similarity functions

Results • Document categorization

Results • Noun phrase classification dataset:

Questions?

Additional Information

MRW: RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

MRW: Seed Preference • Obtaining labels for data points is expensive • We want to minimize cost for obtaining labels • Observations: • Some labels inherently more useful than others • Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

Seed Preference • Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label • The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data • We consider 3 preferences: • Random • Link Count • PageRank Nodes with highest counts make the list Nodes with highest scores make the list

MRW: The Question • What really makes MRW and wvRN different? • Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods – note that they are call by many names: Great…but we still don’t know why the differences in their behavior on these network datasets!

MRW: The Question • It’s difficult to answer exactly why MRW does better with a smaller number of seeds. • But we can gather probable factors from their propagation models:

1. Centrality-sensitive: seeds have different scores and not necessarily the highest MRW: The Question Seed labels underlined • An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 0.020 firstdownpolitics.com 0.019 neoconservatives.blogspot.com 0.017 jmbzine.com 0.017 strangedoctrines.typepad.com 0.013 millers_time.typepad.com 0.011 decision08.blogspot.com 0.010 gopandcollege.blogspot.com 0.010 charlineandjamie.com 0.008 marksteyn.com 0.007 blackmanforbush.blogspot.com 0.007 reggiescorner.blogspot.com 0.007 fearfulsymmetry.blogspot.com 0.006 quibbles-n-bits.com 0.006 undercaffeinated.com 0.005 samizdata.net 0.005 pennywit.com 0.005 pajamahadin.com 0.005 mixtersmix.blogspot.com 0.005 stillfighting.blogspot.com 0.005 shakespearessister.blogspot.com 0.005 jadbury.com 0.005 thefulcrum.blogspot.com 0.005 watchandwait.blogspot.com 0.005 gindy.blogspot.com 0.005 cecile.squarespace.com 0.005 usliberals.about.com 0.005 twentyfirstcenturyrepublican.blogspot.com 1.000 neoconservatives.blogspot.com 1.000 strangedoctrines.typepad.com 1.000 jmbzine.com 0.593 presidentboxer.blogspot.com 0.585 rooksrant.com 0.568 purplestates.blogspot.com 0.553 ikilledcheguevara.blogspot.com 0.540 restoreamerica.blogspot.com 0.539 billrice.org 0.529 kalblog.com 0.517 right-thinking.com 0.517 tom-hanna.org 0.514 crankylittleblog.blogspot.com 0.510 hasidicgentile.org 0.509 stealthebandwagon.blogspot.com 0.509 carpetblogger.com 0.497 politicalvicesquad.blogspot.com 0.496 nerepublican.blogspot.com 0.494 centinel.blogspot.com 0.494 scrawlville.com 0.493 allspinzone.blogspot.com 0.492 littlegreenfootballs.com 0.492 wehavesomeplanes.blogspot.com 0.491 rittenhouse.blogspot.com 0.490 secureliberty.org 0.488 decision08.blogspot.com 0.488 larsonreport.com 2. Exponential drop-off: much less sure about nodes further away from seeds We still don’t really understand it yet. 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?)

Random walk without restart, heuristic stopping MRW: Related Work RWR ranking as features to SVM Similar formulation, different view • MRW is very much related to • “Local and global consistency” (Zhou et al. 2004) • “Web content categorization using link information” (Gyongyi et al. 2006) • “Graph-based semi-supervised learning as a generative model” (He et al. 2007) • Seed preference is related to the field of active learning • Active learning chooses which data point to label next based on previous labels; the labeling is interactive • Seed preference is a batch labeling method Authoritative seed preference a good base line for active learning on network data!

Results • How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

A Web-Scale Knowledge Base • Read the Web (RtW) project: Build a never-ending system that learns to extract information from unstructured web pages, resulting in a knowledge base of structured information.

Noun Phrase and Context Data • As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced: • NP-context co-occurrence data • NP-NP co-occurrence data • These datasets can be treated as graphs

Noun Phrase and Context Data • NP-context data: … know that drinking pomegranate juice may not be a bad … before context after context NP know that drinking _ 3 pomegranate juice 2 _ may not be a bad 5 JELL-O 8 _ is made from 2 Jagermeister 1 NP-context graph _ promotes responsible

Noun Phrase and Context Data Context can be used for weighting edges or making a more complex graph • NP-NP data: … French Constitution Council validates burqa ban … context NP NP veil French Constitution Council burqa ban hot pants French Court NP-NP graph JELL-O Jagermeister

Noun Phrase Categorization • We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. • Challenges: • Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). • What’s the right function for NP-NP categorical similarity? • Which learned category assignment should we “promote” to the knowledge base? • How to evaluate it?

Noun Phrase Categorization • Preliminary experiment: • Small subset of the NP-context data • 88k NPs • 99k contexts • Find category “city” • Start with a handful of seeds • Ground truth set of 2,404 city NPs created using

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Presentation Transcript

An Overview on Semi-Supervised Learning Methods

Semi-supervised Learning

Semi-Supervised Learning over Text

Semi-Supervised Learning

Semi-Supervised Learning

Lecture 24: Model Adaptation and Semi-Supervised Learning

Semi-Supervised Learning with Graph Transduction

Semi-Supervised Learning with Graph Transduction

Semi-Supervised Learning

Replication-based Fault-tolerance for Large-scale Graph Processing

Semi-Supervised Feature Selection for Graph Classification

Semi-Supervised Training for Appearance-Based Statistical Object Detection Methods

LING 696B: Graph-based methods and Supervised learning

Supervised and unsupervised methods for large scale genomic data integration

Large Scale Radial Graph Drawing

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Introduction to Large-Scale Graph Computation

Semi-Supervised Learning

STRING Large-scale data and text mining

Semi-supervised Relation Extraction with Large-scale Word Clustering

Semi-Supervised Learning

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning