New models for relational classification
Sponsored Links
This presentation is the property of its rightful owner.
1 / 56

New Models for Relational Classification PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

New Models for Relational Classification. Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani. The talk. Classification with non-iid data A source of non-iidness: relational information A new family of models, and what is new

Download Presentation

New Models for Relational Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


New Models for Relational Classification

Ricardo Silva (Statslab)

Joint work with Wei Chu and Zoubin Ghahramani


The talk

  • Classification with non-iid data

  • A source of non-iidness: relational information

  • A new family of models, and what is new

  • Applications to classification of text documents


The prediction problem

X

Y


Standard setup

Xnew

X

N

Ynew

Y


Prediction with non-iid data

X1

X2

Xnew

Ynew

Y1

Y2


Where does the non-iid information come from?

  • Relations

    • Links between data points

      • Webpage A links to Webpage B

      • Movie A and Movie B are often rented together

  • Relations as data

    • “Linked webpages are likely to present similar content”

    • “Movies that are rented together often have correlated personal ratings”


The vanilla relational domain: time-series

  • Relations: “Yi precedes Yi + k”, k > 0

  • Dependencies: “Markov structure G”

Y1

Y2

Y3


A model for integrating link data

  • How to model the class labels dependencies?

  • Movies that are rented together often might have all other sources of common, unmeasured factors

  • These hidden common causes affect the ratings


MovieFeatures(M2)

MovieFeatures(M1)

Rating(M2)

Rating(M1)

Example

Same director?

Same genre?

Both released in same year?

Target same age groups?


Integrating link data

  • Of course, many of these common causes will be measured

  • Many will not

  • Idea:

    • Postulate a hidden common cause structure, based on relations

    • Define a model Markov to this structure

    • Design an adequate inference algorithm


Example: Political Books database

  • A network of books about recent US politics sold by the online bookseller Amazon.com

    • Valdis Krebs, http://www.orgnet.com/

  • Relations: frequent co-purchasing of books by the same buyers

    • Political inclination factors as the hidden common causes


Political Books relations


Political Books database

  • Features:

    • I collected the Amazon.com front page for each of the books

    • Bag-of-words, tf-idf features, normalized to unity

  • Task:

    • Binary classification: “liberal” or “not-liberal” books

    • 43 liberal books out of 105


Contribution

  • We will

    • show a classical multiple linear regression model

    • built a relational variation

    • generalize with a more complex set of independence constraints

    • generalize it using Gaussian processes


Seemingly unrelated regression (Zellner,1962)



X

  • Y = (Y1, Y2), X = (X1, X2)

  • Suppose you regress Y1 ~ X1, X2 and

    • X2 turns out to be useless

    • Analogously for Y2 ~ X1, X2 (X1 vanishes)

  • Suppose you regress Y1 ~ X1, X2, Y2

    • And now every variable is a relevant predictor

X1

X2

Y1







X1

X2

Y2

Y1


Graphically, with latents

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Industry factor k?

Industry factor 2

Industry factor 1


The Directed Mixed Graph (DMG)

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Richardson (2003), Richardson and Spirtes (2002)


A new family of relational models

  • Inspired by SUR

  • Structure: DMG graphs

    • Edges postulated from given relations

X1

X2

X3

X4

X5

Y3

Y1

Y4

Y5

Y2


Model for binary classification

  • Nonparametric Probit regression

  • Zero-mean Gaussian process prior over f( . )

P(yi = 1| xi) = P(y*(xi) > 0)

y*(xi) = f(xi) + i, i ~ N(0, 1)


Relational dependency model

  • Make {} dependent multivariate Gaussian

  • For convenience, decouple it into two error terms

 = * + 


Dependency model: the decomposition

Independent from each other

 = * + 

Marginally independent

Dependent according to relations

 =* + 

Diagonal

Not diagonal, with 0s onlyon unrelated pairs


Dependency model: the decomposition

  • If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply

y*(xi) = f(xi) +  = f(xi) +  + * = g(xi) + *

g(.) = K + *


Approximation

  • Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate

  • Approximate posterior with a Gaussian

    • Expectation-Propagation (Minka, 2001)

  • The reason for * becomes apparent in the EP approximation


Approximation

  • Likelihood does not factorize over f( . ), but factorizes over g( . )

  • Approximate each factor p(yi | g(xi)) with a Gaussian

    • if * were 0, yi would be a deterministic function of g(xi)

p(g | x, y)  p(g | x) p(yi | g(xi))

i


Generalizations

  • This can be generalized for any number of relations

Y3

Y1

Y4

Y5

Y2

 = * + 1 + 2 + 3


But how to parameterize ?

  • Non-trivial

  • Desiderata:

    • Positive definite

    • Zeroes on the right places

    • Few parameters, but broad family

    • Easy to compute


But how to parameterize ?

  • “Poking zeroes” on a positive definite matrix doesn’t work

Y1

Y2

Y3

positive definite

not positive definite


Approach #1

  • Assume we can find all cliques for the bi-directed subgraph of relations

  • Create a “factor analysis model”, where

    • for each clique Ci there is a latent variable Li

    • members of each clique are the only children of Li

    • Set of latents {L} is a set of N(0, 1) variables

    • coefficients in the model are equal to 1


Approach #1

L1

L2

  • Y1 = L1 + 1

  • Y2 = L1 + L2 + 2

Y3

Y1

Y4

Y2

Y1

Y2

Y3

Y4


Approach #1

  • In practice, we set the variance of each  to a small constant (10-4)

  • Covariance between any two Ys is

    • proportional to the number of cliques they belong together

    • inversely proportional to the number of cliques they belong to individually


Approach #1

  • Let U be the correlation matrix obtained from the proposed procedure

  • To define the error covariance, use a single hyperparameter   [0, 1]

 =(I – Udiag) + U

* 


Approach #1

  • Notice: if everybody is connected, model is exchangeable and simple

L1

Y3

Y1

Y2

Y3

Y4

Y1

Y4

Y2

 =


Approach #1

  • Finding all cliques is “impossible”, what to do?

  • Triangulate and them extract cliques

    • Can be done in polynomial time

  • This is a relaxation of the problem, since constraints are thrown away

  • Can have bad side effects: the “Blow-Up” effect


Political Books dataset


Political Books dataset:the “Blow-up” effect


Approach #2

  • Don’t look for cliques: create a latent for each pair of variables

  • Very fast to compute, zeroes respected

L13

Y3

Y3

Y1

Y4

L13

Y1

Y4

Y2

Y2

L13

L13


Approach #2

  • Correlations, however, are given by

  • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common

  • We call this the “pulverization” effect

1

Corr(i, j) 

Sqrt(#neigh(i) . #neigh(j))


Political Books dataset


Political Books dataset:the “pulverization” effect


WebKB dataset: links of pages in University of Washington


Approach #1


Approach #2


Comparison:undirected models

  • Generative stories

    • Conditional random fields (Lafferty, McCallum, Pereira, 2001)

    • Wei et al., 2006/Richardson and Spirtes, 2002;

X1

X3

X2

Y1

Y2

Y3


Chu Wei’s model

X1

X3

X2

  • Dependency family equivalent to a pairwise Markov random field

Y1*

Y3*

Y2*

Y1

Y2

Y3

R12 = 1

R23 = 1

Y1

Y2

Y3


Properties of undirected models

  • MRFs propagate information among “test” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of DMG models

  • DMGs propagate information among “training” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of DMG models

  • In a DMG, each “test” point will have in the Markov blanket a whole “training component”

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of DMG models

  • It seems acceptable that a typical relational domain will not have a “extrapolation” pattern

    • Like typical “structured output” problems, e.g., NLP domains

  • Ultimately, the choice of model concerns the question:

    • “Hidden common causes” or “relational indicators”?


Experiment #1

  • A subset of the CORA database

    • 4,285 machine learning papers, 7 classes

    • Links: citations between papers

      • “hidden common cause” interpretation: particular ML subtopic being treated

    • Experiment: 7 binary classification problems, Class 5 vs. others

    • Criterion: AUC


Experiment #1

  • Comparisons:

    • Regular GP

    • Regular GP + citation adjacency matrix

    • Chu Wei’s Relational GP (RGP)

    • Our method, miXed graph GP (XGP)

  • Fairly easy task

  • Analysis of low-sample tasks

    • Uses 1% of the data (roughly 10 data points for training)

    • Not that useful for XGP, but more useful for RGP


Experiment #1

  • Chu Wei’s method get up to 0.99 in several of those…


Experiment #2

  • Political Books database

    • 105 datapoints, 100 runs using 50% for training

  • Comparison with standard Gaussian processes

    • Linear kernels

  • Results

    • 0.92 for regular GP

    • 0.98 for XGP (using pairwise kernel generator)

      • Hyperparameters optimized by grid search

    • Difference: 0.06 with std 0.02

    • Chu Wei’s method does the same…


Experiment #3

  • WebKB

    • Collections of webpages from 4 different universities

  • Task: “outlier classification”

    • Identify which pages are not a student, course, project or faculty pages

    • 10% for training data (still not that hard)

      • However, an order of magnitude of more data than in Cora


Experiment #3

  • As far as I know, XGP gets easily the best results on this task


Future work

  • Tons of possibilities on how to parameterize output covariance matrix

    • Incorporating relation attributes too

  • Heteroscedastic relational noise

  • Mixtures of relations

  • New approximation algorithms

  • Clustering problems

  • On-line learning


Thank You


  • Login