- 84 Views
- Uploaded on
- Presentation posted in: General

New Models for Relational Classification

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

New Models for Relational Classification

Ricardo Silva (Statslab)

Joint work with Wei Chu and Zoubin Ghahramani

- Classification with non-iid data
- A source of non-iidness: relational information
- A new family of models, and what is new
- Applications to classification of text documents

X

Y

Xnew

X

N

Ynew

Y

X1

X2

Xnew

Ynew

Y1

Y2

- Relations
- Links between data points
- Webpage A links to Webpage B
- Movie A and Movie B are often rented together

- Links between data points
- Relations as data
- “Linked webpages are likely to present similar content”
- “Movies that are rented together often have correlated personal ratings”

- Relations: “Yi precedes Yi + k”, k > 0
- Dependencies: “Markov structure G”

Y1

Y2

Y3

…

…

- How to model the class labels dependencies?
- Movies that are rented together often might have all other sources of common, unmeasured factors
- These hidden common causes affect the ratings

MovieFeatures(M2)

MovieFeatures(M1)

Rating(M2)

Rating(M1)

Same director?

Same genre?

Both released in same year?

Target same age groups?

- Of course, many of these common causes will be measured
- Many will not
- Idea:
- Postulate a hidden common cause structure, based on relations
- Define a model Markov to this structure
- Design an adequate inference algorithm

- A network of books about recent US politics sold by the online bookseller Amazon.com
- Valdis Krebs, http://www.orgnet.com/

- Relations: frequent co-purchasing of books by the same buyers
- Political inclination factors as the hidden common causes

- Features:
- I collected the Amazon.com front page for each of the books
- Bag-of-words, tf-idf features, normalized to unity

- Task:
- Binary classification: “liberal” or “not-liberal” books
- 43 liberal books out of 105

- We will
- show a classical multiple linear regression model
- built a relational variation
- generalize with a more complex set of independence constraints
- generalize it using Gaussian processes

X

- Y = (Y1, Y2), X = (X1, X2)
- Suppose you regress Y1 ~ X1, X2 and
- X2 turns out to be useless
- Analogously for Y2 ~ X1, X2 (X1 vanishes)

- Suppose you regress Y1 ~ X1, X2, Y2
- And now every variable is a relevant predictor

X1

X2

Y1

X1

X2

Y2

Y1

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Industry factor k?

Industry factor 2

Industry factor 1

…

Capital(GE)

Capital(Westinghouse)

X:

Stock price(GE)

Stock price(Westinghouse)

Y:

Richardson (2003), Richardson and Spirtes (2002)

- Inspired by SUR
- Structure: DMG graphs
- Edges postulated from given relations

X1

X2

X3

X4

X5

Y3

Y1

Y4

Y5

Y2

- Nonparametric Probit regression
- Zero-mean Gaussian process prior over f( . )

P(yi = 1| xi) = P(y*(xi) > 0)

y*(xi) = f(xi) + i, i ~ N(0, 1)

- Make {} dependent multivariate Gaussian
- For convenience, decouple it into two error terms

= * +

Independent from each other

= * +

Marginally independent

Dependent according to relations

=* +

Diagonal

Not diagonal, with 0s onlyon unrelated pairs

- If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply

y*(xi) = f(xi) + = f(xi) + + * = g(xi) + *

g(.) = K + *

- Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate
- Approximate posterior with a Gaussian
- Expectation-Propagation (Minka, 2001)

- The reason for * becomes apparent in the EP approximation

- Likelihood does not factorize over f( . ), but factorizes over g( . )
- Approximate each factor p(yi | g(xi)) with a Gaussian
- if * were 0, yi would be a deterministic function of g(xi)

p(g | x, y) p(g | x) p(yi | g(xi))

i

- This can be generalized for any number of relations

Y3

Y1

Y4

Y5

Y2

= * + 1 + 2 + 3

- Non-trivial
- Desiderata:
- Positive definite
- Zeroes on the right places
- Few parameters, but broad family
- Easy to compute

- “Poking zeroes” on a positive definite matrix doesn’t work

Y1

Y2

Y3

positive definite

not positive definite

- Assume we can find all cliques for the bi-directed subgraph of relations
- Create a “factor analysis model”, where
- for each clique Ci there is a latent variable Li
- members of each clique are the only children of Li
- Set of latents {L} is a set of N(0, 1) variables
- coefficients in the model are equal to 1

L1

L2

- Y1 = L1 + 1
- Y2 = L1 + L2 + 2

Y3

Y1

Y4

Y2

Y1

Y2

Y3

Y4

- In practice, we set the variance of each to a small constant (10-4)
- Covariance between any two Ys is
- proportional to the number of cliques they belong together
- inversely proportional to the number of cliques they belong to individually

- Let U be the correlation matrix obtained from the proposed procedure
- To define the error covariance, use a single hyperparameter [0, 1]

=(I – Udiag) + U

*

- Notice: if everybody is connected, model is exchangeable and simple

L1

Y3

Y1

Y2

Y3

Y4

Y1

Y4

Y2

=

- Finding all cliques is “impossible”, what to do?
- Triangulate and them extract cliques
- Can be done in polynomial time

- This is a relaxation of the problem, since constraints are thrown away
- Can have bad side effects: the “Blow-Up” effect

- Don’t look for cliques: create a latent for each pair of variables
- Very fast to compute, zeroes respected

L13

Y3

Y3

Y1

Y4

L13

Y1

Y4

Y2

Y2

L13

L13

- Correlations, however, are given by
- Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common
- We call this the “pulverization” effect

1

Corr(i, j)

Sqrt(#neigh(i) . #neigh(j))

- Generative stories
- Conditional random fields (Lafferty, McCallum, Pereira, 2001)
- Wei et al., 2006/Richardson and Spirtes, 2002;

X1

X3

X2

Y1

Y2

Y3

X1

X3

X2

- Dependency family equivalent to a pairwise Markov random field

Y1*

Y3*

Y2*

Y1

Y2

Y3

R12 = 1

R23 = 1

Y1

Y2

Y3

- MRFs propagate information among “test” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12

- DMGs propagate information among “training” points

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12

- In a DMG, each “test” point will have in the Markov blanket a whole “training component”

Y2

Y3

Y4

Y6

Y1

Y7

Y8

Y10

Y5

Y9

Y11

Y12

- It seems acceptable that a typical relational domain will not have a “extrapolation” pattern
- Like typical “structured output” problems, e.g., NLP domains

- Ultimately, the choice of model concerns the question:
- “Hidden common causes” or “relational indicators”?

- A subset of the CORA database
- 4,285 machine learning papers, 7 classes
- Links: citations between papers
- “hidden common cause” interpretation: particular ML subtopic being treated

- Experiment: 7 binary classification problems, Class 5 vs. others
- Criterion: AUC

- Comparisons:
- Regular GP
- Regular GP + citation adjacency matrix
- Chu Wei’s Relational GP (RGP)
- Our method, miXed graph GP (XGP)

- Fairly easy task
- Analysis of low-sample tasks
- Uses 1% of the data (roughly 10 data points for training)
- Not that useful for XGP, but more useful for RGP

- Chu Wei’s method get up to 0.99 in several of those…

- Political Books database
- 105 datapoints, 100 runs using 50% for training

- Comparison with standard Gaussian processes
- Linear kernels

- Results
- 0.92 for regular GP
- 0.98 for XGP (using pairwise kernel generator)
- Hyperparameters optimized by grid search

- Difference: 0.06 with std 0.02
- Chu Wei’s method does the same…

- WebKB
- Collections of webpages from 4 different universities

- Task: “outlier classification”
- Identify which pages are not a student, course, project or faculty pages
- 10% for training data (still not that hard)
- However, an order of magnitude of more data than in Cora

- As far as I know, XGP gets easily the best results on this task

- Tons of possibilities on how to parameterize output covariance matrix
- Incorporating relation attributes too

- Heteroscedastic relational noise
- Mixtures of relations
- New approximation algorithms
- Clustering problems
- On-line learning