Probabilistic Reasoning and Learning with Permutations Thesis Defense, 7/29/2011

Probabilistic Reasoning and Learning with PermutationsThesis Defense, 7/29/2011 Jonathan Huang Collaborators: Carlos Guestrin CMU LeonidasGuibas Stanford Xiaoye Jiang Stanford AshishKapoor Microsoft

Political Elections in Ireland “Recent polling … indicates Doherty [Sinn Fein Party] is leading the race.” “But Ireland's complicated [election] system of proportional representation, … could upset the front-runner and help… the Fianna Fail candidate running second in the polls, to snatch victory.” 3 2 1 4

Proportional Representation Irish Parliament, Maltese Parliament, Australian Senate, Iceland Constitutional Assembly, Academy Awards, University of Cambridge, Scotland local Governments, Cambridge (Mass) local, …..

2002 Irish Election Data 64,081 votes, 14 candidates Statistical analysis of voting data can: • Predict winners • Identify “voting-blocs” • Formulate campaign strategies • Engender an informed, effective democracy Ireland [Gormley, Murphy, 2006]

Distributions over Permutations Candidates “With probability 1/10: Candidate A ranked first, Candidate B ranked third, Candidate C ranked second, Candidate D ranked last” Rankings

Permutations are Ubiquitous! Politics Preferences > > 3 4 2 7 8 6 9 7 9 2 1 2 > > Multiobject Tracking 3 1 1 5 4 4 3 7 5

Problem #1: Representation How can we tractably represent distributions over n! permutations in storage?

First-order summary [Shin et al, ‘03] For each (j,i) pair, store P(candidatej is in rank i) 0.25 1 10% voters rank Sinn Fein first 3 0.2 5 0.15 7 14 Ranks 9 0.1 11 25% voters rank Sinn Fein last 0.05 13 FF FG I FF FG FG I I I GP CS SF FF L 14 Candidates

Decomposable Distributions Decompose functions on permutations into sums of simpler functions Additive Decomposition Decompose functions on permutations into products of simpler functions Multiplicative Decomposition

Additive (Fourier) Decompositions Approximate distributions over permutations with low frequency basis functions ([Kondor2007,Huang2007,Huang2009]) Fourier coefficients Fourier basis functions f(x)= .6 x +.2 x +.1 x +.01 x high frequency low frequency Storing low frequency coefficients to approximate f f .6 .2 .1 .1 .05 .01 .01 0 0 0

Fourier coefficients for permutations Fourier coefficients for distributions on permutations are matrix-valued f high frequency low frequency Can exactly reconstruct all second-order probabilities with first three matrices Can exactly reconstruct all first-order probabilities with first two matrices Can exactly reconstruct all n! original probabilities [Diaconis, ‘88]

Second order summary (submatrix) 0.07 1,2 0.06 7% voters placed two Fianna Fail candidates consecutively in ranks 1 and 2 1,3 0.05 1,4 0.04 Ranks pairs 2,3 0.03 Capture higher order dependencies with O(n4) storage 0.02 2,4 0.01 3,4 0 FF,FG FF,FF FF,SF FG,FF FG,SF FF,SF Candidate pairs

Accuracy/Storage Trade-off Problem #1: Representation Storing a low frequency Fourier approximation is equivalent to storing low-order probabilities (and can be done in polynomial space) Low-frequency Fourier approximations generalize the first-order summary!

Contributions Inference Polynomial storage for approximate distributions Additive (Fourier) Decomposition Low frequency = maintaining probabilities over small sets [NIPS07, JMLR09] Multiplicative Decomposition

Problem #2: Probabilistic Inference in Ranking What are the odds that someone will rank Sinn Fein first if he ranks Fianna Fail second? If I prefer Titanic to Star Wars, am I likely to also prefer The English Patient to Jurassic Park? If a voter ranks Labour first, is he more likely to prefer Fine Gael over Fianna Fail?

Problem #2: Inference P( | ) candidate ranking σ z = “Fianna Fail ranked second” Bayes Rule: How can we efficiently compute a posterior based on a new observation? Posterior Likelihood Prior ABCD BACD ACBD CABD … Rev. Bayes prior Complexity: O(n!) x likelihood = posterior

Inference with Fourier coefficients Given: P(ranking) P(“Sinn Fein is first” | ranking) prior likelihood From Signal Processing: Pointwise products correspond to convolutions of Fourier coefficients Compute: posterior P(ranking | “Sinn Fein is first”)

Inference with Fourier coefficients Pointwise products correspond to (generalized) convolution in the Fourier domain P(ranking) P(“Sinn Fein is first” | ranking) Our algorithm applies to arbitrary distributions defined over arbitrary finite groups prior likelihood posterior P(ranking | “Sinn Fein is first”) [Huang et al, NIPS 2007]

Bandlimiting • Discard “high-frequency” coefficients after conditioning • Equivalently, maintain low-order probabilities Theorem. Given rthorder terms of the prior and an sth order likelihood, then the (r-s)thorder terms of the posteriorcan be exactly computed. (Fourier methods work best on low-order observations) [Huang et al, NIPS 2009]

Dealing with the Impossible Infeasible approximations (e.g. negative probabilities)can arise due to bandlimiting Solution [Huang, 2007]: Project to space of coefficients corresponding to feasible probabilities Feasible Coefficients Infeasible Coefficients Nearest feasible Fourier coefficients Infeasible approximation (Efficient projection (to a relaxed polytope) possible using a quadratic program)

Permutations in Tracking Applications to: - Monitoring for Assisted Living - Video analysis for sports - Video surveillance for crowds Track 1 Track 2 Track 3 Track 4

Probabilistic Inference in Tracking Inference problem: Where is Alice? ? Mixing @Tracks 1,2 Track 1 Mixing @Tracks 1,3 Track 2 Track 3 Track 4 Mixing @Tracks 1,4

Simulated tracking data Projection to the Marginal polytope versus no projection (n=6) 0.12 w/o Projection 0.1 0.08 w/Projection 1st order Better Error 0.06 2nd order 3rd order 0.04 0.02 0 1st order 2nd order 3rd order Approximation by a uniform distribution

Tracking with a camera network • Camera Network data: • 8 cameras, multi-view, occlusion effects • 11 individuals in lab • Identity observations obtained from color histograms • Mixing events declared when people walk close to each other Omniscient tracker Problem #2:Inference 60 50 can be formulated in Fourier domain as (generalized) convolution, and approximated via bandlimiting/projections; low-order observations = polytime/accurate inference 40 Better % Tracks correctly Identified 30 20 10 0 time-independent classification 2nd order w/Projection w/o Projection

Contributions Polynomial storage for approximate distributions Polytime Fourier domain conditioning algorithm for finite groups Additive (Fourier) Decomposition Low frequency = maintaining probabilities over small sets Approximation guarantee for low order observations [NIPS07, JMLR09] [NIPS07, JMLR09] Multiplicative Decomposition

Even polynomial is too slow… 4 Exact inference 3 Better Running time in seconds 2 3rd order Can we achieve more compact representations? 1 2nd order 1st order 0 4 5 6 7 8 n

Riffled Independence Idea: Assume a ranking is created by “shuffling” smaller, independent rankings Riffle independent distributions can be represented with a reduced set of parameters! Veggie Veggie Fruit Fruit Rank Fruits Rank Veggies > > > > Artichoke > Dates Broccoli > Cherry Veggie Fruit Veggie Veggie > > > > Fruit Veggie Veggie Fruit > > > > • Interleave (riffle shuffle) veggie/fruit rankings to form a complete ranking Artichoke Fruit Fruit Fruit Veggie > Cherry > Broccoli > Dates [Huang, Guestrin, 2009]

American Psych. Assoc. (APA) Election (1980) Empirically, we can find approximate riffled independence in real datasets 0.1 0.09 0.08 0.07 0.06 probability 0.05 Blue line: candidate {2} riffle independent of candidates {1,3,4,5} 0.04 0.03 0.02 0.01 0 10 20 30 40 50 60 70 80 90 100 110 120 permutations 5738 full ballots, 5 candidates 5!=120 William Bevan Max Siegle Logan Wright Ira Iscoe Charles Kiesler dataset from [Diaconis, ‘89]

Parameter Counting Item set decomposition Problem #1: Representation Can we do better? {1,2,3,4,5} Distributions which decompose into riffle independent factors can be represented using exponentially fewer parameters Relative ranking of candidates {4,3} 2! = 2 Relative ranking of candidates {1,3,4,5} 4! = 24 {1,3,4,5} {2} Relative ranking of candidates {1,5} 2! = 2 Relative ranking of candidates {2} 1! = 1 Interleaving candidates {4,3} with candidates {1,5} {1,5} {4,3} 6 Interleaving candidate {2} with remaining candidates Total # of model parameters < Total # of model parameters < 30 16 5 #(rankings) 5! = 120

Hierarchical Decompositions: Drawing a Ranking Food Preferences better Rank fruits Rank vegetables Interleave fruits/vegetables Rank Junk food Interleave Healthy foods with Junk food Problem:For APA data, don’t know the hierarchy!

Reverse Engineering the Hierarchy Machine learning approach: Use the structure that best explains the data {A,B,C,D} DCAB ABCD CBDA Structure Learning Algorithm ADBC BACD {C,D,A} {B} CABD BDCA CBAD BADC DCBA CDBA {C,D} {A} Data Hierarchy Core Problem:Given ranked data, determine whether subsets are riffle independent

Measuring riffled independence • Riffled independence: absoluterankings of Fruits not informative about relative rankings in Vegetables Idea: measure independence between singleton rankings and pairwise rankings relativepreference over Vegetablej,k preference over Fruiti If i, (j,k) lie on opposite sides, Mutual information=0 [Huang, Guestrin, 2010]

Exponential number of possible splits,but there is an efficient minimization algorithm that works with high probability Tripletwise objective function Measuring departure from riffled independence: Minimize: B all items in set A –plays nono role inobjective A

Learning Structure from APA Data {12345} • # model parameters: 11 “True” first order Hierarchical first order {1345} {2} Hierarchy respects political coalition structure of the APA! Community psychologists 1 2 {13} {45} 1 Research psychologists Clinical psychologists ranks 0.25 3 2 0.2 4 Candidates ranks 3 0.15 5 0.1 1 2 3 4 5 1. William Bevan 4. Max Siegle 4 candidates 0.05 2. Ira Iscoe 5. Logan Wright 5 0 3. Charles Kiesler 1 2 3 4 5 candidates

Structure learning with synthetic data 16 items, 4 items in each leaf -5400 -5500 true structure known -5600 -5700 -5800 Better log-likelihood -5900 Theorem: Our algorithm recovers riffle independent split with probability given samples (under mild assumptions on connectivity). learned structure -6000 -6100 -6200 random 1-chain ([Doignon et al, 2004]) -6300 1 2 3 4 5 log10(# samples) [Huang, Guestrin, 2011]

Irish Election (No Structure Learning) 0.25 “True” first order Probability Riffle Independent approximation 0.2 2 4 0.15 6 ranks 8 0.1 Sinn Fein, Christian Solidaritycolumns not well captured by a single split! 10 12 0.05 14 FF FF FG FG I I FF FF FG FG FG FG I I I I I I GP GP CS CS SF SF FF FF L L Candidates Candidates 0 Major parties riffle independent of minor parties?

Structure Learning the Irish Election {1,2,3,4,5,6,7,8,9,10,11,12,13,14} 0.25 2 “True” first order Learned first order {1,2,3,4,5,6,7,8,9,10,11,13,14} 0.2 {12} 4 Sinn Fein 2 6 0.15 4 Ranks {1,2,3,4,5,6,7,8,9,10,13,14} {11} 8 0.1 Christian Solidarity 6 10 Ranks 8 0.05 12 10 {2,3,5,6,7,8,9,10,14} {1,4,13} 14 Fianna Fail 12 0 2 4 6 8 10 12 14 14 Candidates {2,5,6} {3,7,8,9,10,14} # Parameters Running time 2 4 6 8 10 12 14 Fine Gael Independents, Labour, Green Candidates Brute force optimization: 70.2s Full model ~ 87 billion Hierarchical model ~1000 Our method: 2.3s

Preference Analysis (for Sushi) First Fatty tuna (Toro) is a favorite! Contenders Ranks 1. Ebi(shrimp) 2. Anago(sea eel) 3. Maguro(tuna) 4. Ika (squid) 5. Uni (sea urchin) 6. Sake (salmon roe) 7. Tamago (egg) 8. Toro (fatty tuna) 9. Tekka-make (tuna roll) 10. Kappa-maki (cucumber roll) No one likes cucumber roll ! Last 1 2 3 4 5 6 7 8 9 10 Sushi 5000 preference rankings of 10 types of sushi

Sushi Hierarchy {1,2,3,4,5,6,7,8,9,10} {2} {1,3,4,5,6,7,8,9,10} (sea eel) {4} {1,3,5,6,7,8,9,10} (squid) {1,3,7,8,9,10} {5,6} (sea urchin, salmon roe) {3,7,8,9,10} {1} (shrimp) {3,8,9} {7,10} (tuna, fatty tuna, tuna roll) (egg, cucumber roll)

Contributions Polynomial storage for approximate distributions Polytime Fourier domain conditioning algorithm for finite groups Additive (Fourier) Decomposition Low frequency = maintaining probabilities over small sets Approximation guarantee for low order observations [NIPS07, JMLR09] [NIPS07, JMLR09] Introduction of Hierarchical Riffled Independence models Multiplicative (Riffle Independent) Decomposition Structure learning algorithm with polynomial time/samples guarantee [NIPS09, ICML10, EJS11]

Top-k Inference Problem • # candidates specified 3 20,000 - number of votes - Most voters rank just the top-3 or top-4 candidates Inference problem: Given an observation of a voter’s top-k rankings, infer his preferences over remaining candidates 10,000 2 1 2 4 6 8 10 12 14 0 4 k -

Inference in Riffled Independent Models O(n!) operation Bayes Rule: Answer: Efficient inference possible if and only if observations take the form of partial rankings! (including top-k observations) Posterior Likelihood Prior

The Top-1 Inference Problem Sometimeswe can decompose the observation into smaller observations Interleaving Observation: Fianna Fail candidate ranked in first place overall Observation: {all candidates} Top-1 inference always decomposes into inference for each node in the hierarchical model “Candidate 3(FF party) ranked in first place” decomposes as: Fianna Fail Observation: Candidate 3 ranked first among FF candidates {1,2,3} {4,5,6,7,8} Fianna Fail Other Candidates Bayes rule complexity: factorial in number of items? Bayes rule complexity: linear in # of parameters

Efficient inference for partial rankings x x • First place observations: • Top-k observations: • Approval voting observations: (Approval voting) Sinn Fein In general, there are many forms of partial rankings allowing items to be tied Independent Green Fianna Fail G|ABCDEFH “G in first place” Fine Gael Labour x G|F|A|BCDEH “G in first place, F in second, A in third” Socialist ACFG|BDEH “Approve of candidates in {ACFG} ”

Main Theorem i.e., Inference for partial rankings is efficient with running time linear in #(parameters) Converse to Main Theorem: Every observation that decomposes with respect to all hierarchies takes the form of some partial ranking. Hierarchy H1 Hierarchy H2 Theorem: Any partial ranking observation is decomposable with respect to any hierarchy. Partial rankings (But what’s out here?) Hierarchy H3 [Huang, Kapoor, 2011]

Learning with Top-k Votes (Irish Data) Riffle independent model 22600 Nonparametric Mallows [Lebanon, 2008] Using inference, we can efficiently build accurate, interpretable models of partial rankings Lower is Better Negative Log-Likelihood 22200 21800 Full rankings only Full rankings + Partial rankings

Contributions [AISTATS08, NIPS09, EJS11] Polynomial storage for approximate distributions Polytime Fourier domain conditioning algorithm for finite groups Additive Decomposition Low frequency = maintaining probabilities over small sets Approximation guarantee for low order observations Algorithms for exploiting both decompositions for scalable inference [NIPS07, JMLR09] [NIPS07, JMLR09] Introduction of Hierarchical Riffled Independence models Decomposability theorem for partial rankings Multiplicative Decomposition Structure learning algorithm with polynomial time/samples guarantee Learning distributions with partial rankings [NIPS09, ICML10, EJS11] [NIPS-CSS10]

Main Technical Contributions • Fourier theoretic conditioning algorithm with projection to the marginal polytope [NIPS07, JMLR09] • Fourier theoretic characterization of probabilistic independence [AISTATS07] • Definition of riffled independence [NIPS09] • Polynomial sample/time complexity structure learning algorithms [ICML10] • Theoretical connection between efficient inference in riffle independent models and partial ranking [UAI11] • Tractable model estimation algorithm with partial rankings [UAI11]

Thank You  Carlos Guestrin Leo Guibas, John Lafferty, Drew Bagnell, Alex Smola AshishKapoor, Eric Horvitz, Ali Rahimi RisiKondor, Marina Meila, Guy Lebanon, Tiberio Caetano, Xiaoye Jiang SELECT Lab, Michelle Martin Friends Lucia Castellanos Billy, Farn-lin, and Jonah Huang

Probabilistic Reasoning and Learning with Permutations Thesis Defense, 7/29/2011

Probabilistic Reasoning and Learning with Permutations Thesis Defense, 7/29/2011

Presentation Transcript

Probabilistic Reasoning; Network-based reasoning

Probabilistic Reasoning

Bayesian Probabilistic reasoning and learning

Thesis Defense

Probabilistic Reasoning with Permutations

Thesis Defense

Probabilistic Reasoning

THESIS DEFENSE

Thesis Defense

THESIS DEFENSE

Probabilistic Reasoning with Uncertain Data

Probabilistic Reasoning With Bayes’ Rule

Probabilistic Reasoning

Probabilistic Reasoning

Probabilistic Reasoning; Network-based reasoning

Probabilistic Reasoning with Uncertain Data

Probabilistic Reasoning

Probabilistic Reasoning; Network-based reasoning

Reasoning with Uncertainty; Probabilistic Reasoning