New models for relational classification
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

New Models for Relational Classification PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

New Models for Relational Classification. Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani. The talk. Classification with non-iid data A source of non-iidness: relational information A new family of models, and what is new

Download Presentation

New Models for Relational Classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


New models for relational classification

New Models for Relational Classification

Ricardo Silva (Statslab)

Joint work with Wei Chu and Zoubin Ghahramani


The talk

The talk

  • Classification with non-iid data

  • A source of non-iidness: relational information

  • A new family of models, and what is new

  • Applications to classification of text documents


The prediction problem

The prediction problem

X

Y


Standard setup

Standard setup

Xnew

X

N

Ynew

Y


Prediction with non iid data

Prediction with non-iid data

X1

X2

Xnew

Ynew

Y1

Y2


Where does the non iid information come from

Where does the non-iid information come from?

  • Relations

    • Links between data points

      • Webpage A links to Webpage B

      • Movie A and Movie B are often rented together

  • Relations as data

    • “Linked webpages are likely to present similar content”

    • “Movies that are rented together often have correlated personal ratings”


The vanilla relational domain time series

The vanilla relational domain: time-series

  • Relations: “Yi precedes Yi + k”, k > 0

  • Dependencies: “Markov structure G”

Y1

Y2

Y3


A model for integrating link data

A model for integrating link data

  • How to model the class labels dependencies?

  • Movies that are rented together often might have all other sources of common, unmeasured factors

  • These hidden common causes affect the ratings


Example

MovieFeatures(M2)

MovieFeatures(M1)

Rating(M2)

Rating(M1)

Example

Same director?

Same genre?

Both released in same year?

Target same age groups?


Integrating link data

Integrating link data

  • Of course, many of these common causes will be measured

  • Many will not

  • Idea:

    • Postulate a hidden common cause structure, based on relations

    • Define a model Markov to this structure

    • Design an adequate inference algorithm


Example political books database

Example: Political Books database

  • A network of books about recent US politics sold by the online bookseller Amazon.com

    • Valdis Krebs, http://www.orgnet.com/

  • Relations: frequent co-purchasing of books by the same buyers

    • Political inclination factors as the hidden common causes


Political books relations

Political Books relations


Political books database

Political Books database

  • Features:

    • I collected the Amazon.com front page for each of the books

    • Bag-of-words, tf-idf features, normalized to unity

  • Task:

    • Binary classification: “liberal” or “not-liberal” books

    • 43 liberal books out of 105


Contribution

Contribution

  • We will

    • show a classical multiple linear regression model

    • built a relational variation

    • generalize with a more complex set of independence constraints

    • generalize it using Gaussian processes


Seemingly unrelated regression zellner 1962

Seemingly unrelated regression (Zellner,1962)



X

  • Y = (Y1, Y2), X = (X1, X2)

  • Suppose you regress Y1 ~ X1, X2 and

    • X2 turns out to be useless

    • Analogously for Y2 ~ X1, X2 (X1 vanishes)

  • Suppose you regress Y1 ~ X1, X2, Y2

    • And now every variable is a relevant predictor

X1

X2

Y1







X1

X2

Y2

Y1


Graphically with latents

Graphically, with latents

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Industry factor k?

Industry factor 2

Industry factor 1


The directed mixed graph dmg

The Directed Mixed Graph (DMG)

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Richardson (2003), Richardson and Spirtes (2002)


A new family of relational models

A new family of relational models

  • Inspired by SUR

  • Structure: DMG graphs

    • Edges postulated from given relations

X1

X2

X3

X4

X5

Y3

Y1

Y4

Y5

Y2


Model for binary classification

Model for binary classification

  • Nonparametric Probit regression

  • Zero-mean Gaussian process prior over f( . )

P(yi = 1| xi) = P(y*(xi) > 0)

y*(xi) = f(xi) + i, i ~ N(0, 1)


Relational dependency model

Relational dependency model

  • Make {} dependent multivariate Gaussian

  • For convenience, decouple it into two error terms

 = * + 


Dependency model the decomposition

Dependency model: the decomposition

Independent from each other

 = * + 

Marginally independent

Dependent according to relations

 =* + 

Diagonal

Not diagonal, with 0s onlyon unrelated pairs


Dependency model the decomposition1

Dependency model: the decomposition

  • If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply

y*(xi) = f(xi) +  = f(xi) +  + * = g(xi) + *

g(.) = K + *


Approximation

Approximation

  • Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate

  • Approximate posterior with a Gaussian

    • Expectation-Propagation (Minka, 2001)

  • The reason for * becomes apparent in the EP approximation


Approximation1

Approximation

  • Likelihood does not factorize over f( . ), but factorizes over g( . )

  • Approximate each factor p(yi | g(xi)) with a Gaussian

    • if * were 0, yi would be a deterministic function of g(xi)

p(g | x, y)  p(g | x) p(yi | g(xi))

i


Generalizations

Generalizations

  • This can be generalized for any number of relations

Y3

Y1

Y4

Y5

Y2

 = * + 1 + 2 + 3


But how to parameterize

But how to parameterize ?

  • Non-trivial

  • Desiderata:

    • Positive definite

    • Zeroes on the right places

    • Few parameters, but broad family

    • Easy to compute


But how to parameterize1

But how to parameterize ?

  • “Poking zeroes” on a positive definite matrix doesn’t work

Y1

Y2

Y3

positive definite

not positive definite


Approach 1

Approach #1

  • Assume we can find all cliques for the bi-directed subgraph of relations

  • Create a “factor analysis model”, where

    • for each clique Ci there is a latent variable Li

    • members of each clique are the only children of Li

    • Set of latents {L} is a set of N(0, 1) variables

    • coefficients in the model are equal to 1


Approach 11

Approach #1

L1

L2

  • Y1 = L1 + 1

  • Y2 = L1 + L2 + 2

Y3

Y1

Y4

Y2

Y1

Y2

Y3

Y4


Approach 12

Approach #1

  • In practice, we set the variance of each  to a small constant (10-4)

  • Covariance between any two Ys is

    • proportional to the number of cliques they belong together

    • inversely proportional to the number of cliques they belong to individually


Approach 13

Approach #1

  • Let U be the correlation matrix obtained from the proposed procedure

  • To define the error covariance, use a single hyperparameter   [0, 1]

 =(I – Udiag) + U

* 


Approach 14

Approach #1

  • Notice: if everybody is connected, model is exchangeable and simple

L1

Y3

Y1

Y2

Y3

Y4

Y1

Y4

Y2

 =


Approach 15

Approach #1

  • Finding all cliques is “impossible”, what to do?

  • Triangulate and them extract cliques

    • Can be done in polynomial time

  • This is a relaxation of the problem, since constraints are thrown away

  • Can have bad side effects: the “Blow-Up” effect


Political books dataset

Political Books dataset


Political books dataset the blow up effect

Political Books dataset:the “Blow-up” effect


Approach 2

Approach #2

  • Don’t look for cliques: create a latent for each pair of variables

  • Very fast to compute, zeroes respected

L13

Y3

Y3

Y1

Y4

L13

Y1

Y4

Y2

Y2

L13

L13


Approach 21

Approach #2

  • Correlations, however, are given by

  • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common

  • We call this the “pulverization” effect

1

Corr(i, j) 

Sqrt(#neigh(i) . #neigh(j))


Political books dataset1

Political Books dataset


Political books dataset the pulverization effect

Political Books dataset:the “pulverization” effect


Webkb dataset links of pages in university of washington

WebKB dataset: links of pages in University of Washington


Approach 16

Approach #1


Approach 22

Approach #2


Comparison undirected models

Comparison:undirected models

  • Generative stories

    • Conditional random fields (Lafferty, McCallum, Pereira, 2001)

    • Wei et al., 2006/Richardson and Spirtes, 2002;

X1

X3

X2

Y1

Y2

Y3


Chu wei s model

Chu Wei’s model

X1

X3

X2

  • Dependency family equivalent to a pairwise Markov random field

Y1*

Y3*

Y2*

Y1

Y2

Y3

R12 = 1

R23 = 1

Y1

Y2

Y3


Properties of undirected models

Properties of undirected models

  • MRFs propagate information among “test” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of dmg models

Properties of DMG models

  • DMGs propagate information among “training” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of dmg models1

Properties of DMG models

  • In a DMG, each “test” point will have in the Markov blanket a whole “training component”

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12


Properties of dmg models2

Properties of DMG models

  • It seems acceptable that a typical relational domain will not have a “extrapolation” pattern

    • Like typical “structured output” problems, e.g., NLP domains

  • Ultimately, the choice of model concerns the question:

    • “Hidden common causes” or “relational indicators”?


Experiment 1

Experiment #1

  • A subset of the CORA database

    • 4,285 machine learning papers, 7 classes

    • Links: citations between papers

      • “hidden common cause” interpretation: particular ML subtopic being treated

    • Experiment: 7 binary classification problems, Class 5 vs. others

    • Criterion: AUC


Experiment 11

Experiment #1

  • Comparisons:

    • Regular GP

    • Regular GP + citation adjacency matrix

    • Chu Wei’s Relational GP (RGP)

    • Our method, miXed graph GP (XGP)

  • Fairly easy task

  • Analysis of low-sample tasks

    • Uses 1% of the data (roughly 10 data points for training)

    • Not that useful for XGP, but more useful for RGP


Experiment 12

Experiment #1

  • Chu Wei’s method get up to 0.99 in several of those…


Experiment 2

Experiment #2

  • Political Books database

    • 105 datapoints, 100 runs using 50% for training

  • Comparison with standard Gaussian processes

    • Linear kernels

  • Results

    • 0.92 for regular GP

    • 0.98 for XGP (using pairwise kernel generator)

      • Hyperparameters optimized by grid search

    • Difference: 0.06 with std 0.02

    • Chu Wei’s method does the same…


Experiment 3

Experiment #3

  • WebKB

    • Collections of webpages from 4 different universities

  • Task: “outlier classification”

    • Identify which pages are not a student, course, project or faculty pages

    • 10% for training data (still not that hard)

      • However, an order of magnitude of more data than in Cora


Experiment 31

Experiment #3

  • As far as I know, XGP gets easily the best results on this task


Future work

Future work

  • Tons of possibilities on how to parameterize output covariance matrix

    • Incorporating relation attributes too

  • Heteroscedastic relational noise

  • Mixtures of relations

  • New approximation algorithms

  • Clustering problems

  • On-line learning


Thank you

Thank You


  • Login