Create Presentation
Download Presentation

Download Presentation
## New Models for Relational Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**New Models for Relational Classification**Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani**The talk**• Classification with non-iid data • A source of non-iidness: relational information • A new family of models, and what is new • Applications to classification of text documents**Standard setup**Xnew X N Ynew Y**Prediction with non-iid data**X1 X2 Xnew Ynew Y1 Y2**Where does the non-iid information come from?**• Relations • Links between data points • Webpage A links to Webpage B • Movie A and Movie B are often rented together • Relations as data • “Linked webpages are likely to present similar content” • “Movies that are rented together often have correlated personal ratings”**The vanilla relational domain: time-series**• Relations: “Yi precedes Yi + k”, k > 0 • Dependencies: “Markov structure G” Y1 Y2 Y3 … …**A model for integrating link data**• How to model the class labels dependencies? • Movies that are rented together often might have all other sources of common, unmeasured factors • These hidden common causes affect the ratings**MovieFeatures(M2)**MovieFeatures(M1) Rating(M2) Rating(M1) Example Same director? Same genre? Both released in same year? Target same age groups?**Integrating link data**• Of course, many of these common causes will be measured • Many will not • Idea: • Postulate a hidden common cause structure, based on relations • Define a model Markov to this structure • Design an adequate inference algorithm**Example: Political Books database**• A network of books about recent US politics sold by the online bookseller Amazon.com • Valdis Krebs, http://www.orgnet.com/ • Relations: frequent co-purchasing of books by the same buyers • Political inclination factors as the hidden common causes**Political Books database**• Features: • I collected the Amazon.com front page for each of the books • Bag-of-words, tf-idf features, normalized to unity • Task: • Binary classification: “liberal” or “not-liberal” books • 43 liberal books out of 105**Contribution**• We will • show a classical multiple linear regression model • built a relational variation • generalize with a more complex set of independence constraints • generalize it using Gaussian processes**Seemingly unrelated regression (Zellner,1962)** X • Y = (Y1, Y2), X = (X1, X2) • Suppose you regress Y1 ~ X1, X2 and • X2 turns out to be useless • Analogously for Y2 ~ X1, X2 (X1 vanishes) • Suppose you regress Y1 ~ X1, X2, Y2 • And now every variable is a relevant predictor X1 X2 Y1 X1 X2 Y2 Y1**Graphically, with latents**Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Industry factor k? Industry factor 2 Industry factor 1 …**The Directed Mixed Graph (DMG)**Capital(GE) Capital(Westinghouse) X: Stock price(GE) Stock price(Westinghouse) Y: Richardson (2003), Richardson and Spirtes (2002)**A new family of relational models**• Inspired by SUR • Structure: DMG graphs • Edges postulated from given relations X1 X2 X3 X4 X5 Y3 Y1 Y4 Y5 Y2**Model for binary classification**• Nonparametric Probit regression • Zero-mean Gaussian process prior over f( . ) P(yi = 1| xi) = P(y*(xi) > 0) y*(xi) = f(xi) + i, i ~ N(0, 1)**Relational dependency model**• Make {} dependent multivariate Gaussian • For convenience, decouple it into two error terms = * + **Dependency model: the decomposition**Independent from each other = * + Marginally independent Dependent according to relations =* + Diagonal Not diagonal, with 0s onlyon unrelated pairs**Dependency model: the decomposition**• If K was the original kernel matrix for f(. ), the covariance of g(. ) is simply y*(xi) = f(xi) + = f(xi) + + * = g(xi) + * g(.) = K + ***Approximation**• Posterior for f(.), g(.) is a truncated Gaussian, hard to integrate • Approximate posterior with a Gaussian • Expectation-Propagation (Minka, 2001) • The reason for * becomes apparent in the EP approximation**Approximation**• Likelihood does not factorize over f( . ), but factorizes over g( . ) • Approximate each factor p(yi | g(xi)) with a Gaussian • if * were 0, yi would be a deterministic function of g(xi) p(g | x, y) p(g | x) p(yi | g(xi)) i**Generalizations**• This can be generalized for any number of relations Y3 Y1 Y4 Y5 Y2 = * + 1 + 2 + 3**But how to parameterize ?**• Non-trivial • Desiderata: • Positive definite • Zeroes on the right places • Few parameters, but broad family • Easy to compute**But how to parameterize ?**• “Poking zeroes” on a positive definite matrix doesn’t work Y1 Y2 Y3 positive definite not positive definite**Approach #1**• Assume we can find all cliques for the bi-directed subgraph of relations • Create a “factor analysis model”, where • for each clique Ci there is a latent variable Li • members of each clique are the only children of Li • Set of latents {L} is a set of N(0, 1) variables • coefficients in the model are equal to 1**Approach #1**L1 L2 • Y1 = L1 + 1 • Y2 = L1 + L2 + 2 Y3 Y1 Y4 Y2 Y1 Y2 Y3 Y4**Approach #1**• In practice, we set the variance of each to a small constant (10-4) • Covariance between any two Ys is • proportional to the number of cliques they belong together • inversely proportional to the number of cliques they belong to individually**Approach #1**• Let U be the correlation matrix obtained from the proposed procedure • To define the error covariance, use a single hyperparameter [0, 1] =(I – Udiag) + U * **Approach #1**• Notice: if everybody is connected, model is exchangeable and simple L1 Y3 Y1 Y2 Y3 Y4 Y1 Y4 Y2 =**Approach #1**• Finding all cliques is “impossible”, what to do? • Triangulate and them extract cliques • Can be done in polynomial time • This is a relaxation of the problem, since constraints are thrown away • Can have bad side effects: the “Blow-Up” effect**Approach #2**• Don’t look for cliques: create a latent for each pair of variables • Very fast to compute, zeroes respected L13 Y3 Y3 Y1 Y4 L13 Y1 Y4 Y2 Y2 L13 L13**Approach #2**• Correlations, however, are given by • Penalizes nodes with many neighbors, even if Yi and Yj have many neighbors in common • We call this the “pulverization” effect 1 Corr(i, j) Sqrt(#neigh(i) . #neigh(j))**Comparison:undirected models**• Generative stories • Conditional random fields (Lafferty, McCallum, Pereira, 2001) • Wei et al., 2006/Richardson and Spirtes, 2002; X1 X3 X2 Y1 Y2 Y3**Chu Wei’s model**X1 X3 X2 • Dependency family equivalent to a pairwise Markov random field Y1* Y3* Y2* Y1 Y2 Y3 R12 = 1 R23 = 1 Y1 Y2 Y3**Properties of undirected models**• MRFs propagate information among “test” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12**Properties of DMG models**• DMGs propagate information among “training” points Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12**Properties of DMG models**• In a DMG, each “test” point will have in the Markov blanket a whole “training component” Y2 Y3 Y4 Y6 Y1 Y7 Y8 Y10 Y5 Y9 Y11 Y12**Properties of DMG models**• It seems acceptable that a typical relational domain will not have a “extrapolation” pattern • Like typical “structured output” problems, e.g., NLP domains • Ultimately, the choice of model concerns the question: • “Hidden common causes” or “relational indicators”?**Experiment #1**• A subset of the CORA database • 4,285 machine learning papers, 7 classes • Links: citations between papers • “hidden common cause” interpretation: particular ML subtopic being treated • Experiment: 7 binary classification problems, Class 5 vs. others • Criterion: AUC**Experiment #1**• Comparisons: • Regular GP • Regular GP + citation adjacency matrix • Chu Wei’s Relational GP (RGP) • Our method, miXed graph GP (XGP) • Fairly easy task • Analysis of low-sample tasks • Uses 1% of the data (roughly 10 data points for training) • Not that useful for XGP, but more useful for RGP