370 likes | 476 Views
This talk focuses on learning multidimensional (noisy) data efficiently for tasks like categorization, classification, and novelty detection. Combining probability and algorithms, the talk covers tree statistical models, mixtures of trees, Bayesian learning, and accelerated learning experiments.
E N D
Efficient Learning in High Dimensions withTrees and Mixtures Marina Meila Carnegie Mellon University
Learning Multidimensional data • Multidimensional (noisy) data • Learning tasks - intelligent data analysis • categorization (clustering) • classification • novelty detection • probabilistic reasoning • Data is changing, growing • Tasks change need to make learning automatic, efficient
Combining probability and algorithms • Automatic probability and statistics • Efficient algorithms • This talk the tree statistical model
Talk overview Perspective: generative models and decision tasks Introduction: statistical models The tree model Mixtures of trees Accelerated learning Bayesian learning Learning Experiments
Statistical model A multivariate domain Cough X ray Bronchitis Lung cancer Smoker • Data Patient1 Patient2 . . . . . . . . . . . . • Queries • Diagnose new patient • Is smoking related to lung cancer? • Understand the “laws” of the domain Cough X ray Bronchitis Lung cancer Smoker Cough X ray Smoker Bronchitis Lung cancer? X ray Smoker Cough Bronchitis Lung cancer?
Probabilistic approach • Smoker, Bronchitis .. (discrete) random variables • Statistical model (joint distribution) P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain • Queries: • inference e.g. P( Lung cancer = true | Smoker = true, Cough = false ) • structure of the model • discovering relationships • categorization
v1v2 00 01 11 00 v3 0 .01 .14 .22 .01 1 .23 .03 .33 .03 .14 + .03 .14 + .3 + .22 + .33 P(v1=0, v2=1) P(v2=1) Probability table representation • Query: P(v1=0 | v2=1) = = = .23 • Curse of dimensionality if v1, v2, … vn binary variables PV1,V2…Vntable with 2n entries! • How to represent? • How to query? • How to learn from data? • Structure?
Structure vertices = variables edges = “direct dependencies” Parametrization by local probability tables compact parametric representation efficient computation learning parameters by simple formula learning structure is NP-hard spectrum Z (red-shift) dust Obs spectrum Graphical models distance Galaxy type size spectrum Z (red-shift) dust observed size Obs spectrum photometric measurement
1 1 3 3 4 4 5 5 2 2 equivalent P Tuv(xuxv) P Tv(xv) T(x) = P Tv|u(xv|xu) uv E T(x) = uv E deg v-1 v V The tree statistical model • Structure tree (graph with no cycles) • Parameters • probability tables associated to edges T3 T34 T4|3 • T(x) factors over tree edges
Examples • Splice junction domain • Premature babies’ Bronho-Pulmonary Disease (BPD) junction type -7 -3 +7 +8 +5 +2 +3 +4 +6 -2 -4 -6 -5 -1 +1 PulmHemorrh Coag HyperNa Thrombocyt Hypertension Acidosis Gestation Weight Temperature BPD Neutropenia Suspect Lipid
P Tuv(xuxv) P Tv (xv) uv E T(x) = deg v -1 v V Trees - basic operations |V| =n • computing likelihood T(x) ~ n • conditioning TV-A|A(junction tree algorithm) ~ n • marginalization Tuv for arbitrary u,v ~ n • sampling ~ n • fitting to a given distribution ~ n2 • learning from data ~ n2Ndata • is a simple model Querying the model Estimating the model
The mixture of trees (Meila 97) h = “hidden” variable P( h=k ) = lk k = 1, 2 . . . m • NOT a graphical model • computational efficiency preserved m Q(x) = S lkTk(x) k=1
Learning - problem formulation • Maximum Likelihood learning • given a data set D = { x1, . . . xN } • find the model that best predicts the data Topt = argmax T(D) • Fitting a tree to a distribution • given a data set D = { x1, . . . xN } and distribution P that weights each data point, • find Topt = argmin KL( P|| T ) • KL is Kullbach-Leibler divergence • includes Maximum likelihood learning as a special case
Puv PuPv Fitting a tree to a distribution (Chow & Liu 68) Topt = argmin KL( P|| T ) • optimization over structure + parameters • sufficient statistics • probability tables Puv= Nuv/N u,v V • mutual informations Iuv Iuv = S Puv log
I12 I23 I61 I63 I34 I56 I45 Fitting a tree to a distribution - solution • Structure Eopt = argmax S Iuv uv E • found by Maximum Weight Spanning Tree algorithm with edge weights Iuv • Parameters • copy marginals of P Tuv = Puvfor uv E
Learning mixtures by the EM algorithm Meila & Jordan ‘97 E step which xi come from T k? distributionP k(x) • Initialize randomly • converges to local maximum of the likelihood M step fit T k to set of points min KL( Pk||Tk )
Remarks • Learning a tree • solution is globally optimal over structures and parameters • tractable: running time ~ n2N • Learning a mixture by the EM algorithm • both E and M steps are exact, tractable • running time • E step ~ mnN • M step ~ mn2N • assumes m known • converges to local optimum
Finding structure - the bars problem Data n=25 learned structure Structure recovery: 19 out of 20 trials Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous) 0.95 +/- 0.01 (unambiguous) Data likelihood [bits/data point] true model 8.58 learned model 9.82 +/-0.95
Experiments - density estimation • Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000 n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees ) Mix Trees Mix Trees
Tree TANB NB Supervised (DELVE) DNA splice junction classification • n = 61 variables • class = Intron/Exon, Exon/Intron, Neither
IE junction Intron Exon 15 16 . . . 25 26 27 28 29 30 31 Tree - CT CT CT - - CT A G G True CT CT CT CT - - CT A G G EI junction Exon Intron 28 29 30 31 32 33 34 35 36 Tree CA A G G T AG A G - True CA A G G T AG A G T (Watson “The molecular biology of the gene” 87) Discovering structure Tree adjacency matrix class
Irrelevant variables 61 original variables + 60 “noise” variables Original Augmented with irrelevant variables
Accelerated tree learning Meila ‘99 • Running time for the tree learning algorithm ~ n2N • Quadratic running time may be too slow: Example: document classification • document = data point --> N= 103-4 • word = variable --> n= 103-4 • sparse data --> #words in document s and s << n,N • Can sparsity be exploited to create faster algorithms?
0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Sparsity • assume special value “0” that occurs frequently sparsity = s # non-zero variables in each data point s s << n, N • Idea: “do not represent / count zeros” Linked list length s Sparse data
Presort mutual informations Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then Nv > Nv’ ==> Iuv > Iuv’ • Consequences • sort Nv=> all edges uv , Nuv = 0 implicitly sorted by Iuv • these edges need not be represented explicitly • construct black box that outputs next “largest” edge
The black box data structure v1 Nv v2 list of u , Nuv > 0, sorted by Iuv v F-heap of size ~n list of u, Nuv =0, sorted by Nv (virtual) vn next edge uv Total running timen log n + s2N + nK log n (standard alg.running time n2N )
Standard accelerated Experiments - sparse binary data • N = 10,000 • s = 5, 10, 15, 100
Remarks • Realistic assumption • Exact algorithm, provably efficient time bounds • Degrades slowly to the standard algorithm if data not sparse • General • non-integer counts • multi-valued discrete variables
1Z Bayesian learning of trees Meila & Jaakkola ‘00 • Problem • given prior distribution over trees P0(T) data D = { x1, . . . xN } • find posterior distribution P(T|D) • Advantages • incorporates prior knowledge • regularization • Solution • Bayes’ formula P(T|D) = P0(T) P T(xi) i=1,N • practically hard • distribution over structure E and parameters qE hard to represent • computing Z is intractable in general • exception: conjugate priors
Decomposable priors T = P f( u, v, qu|v) uv E • want priors that factor over tree edges • prior for structure E P0(E) a Pbuv uv E • prior for tree parameters P0(qE) = P D( qu|v ; N’uv ) uv E • (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v V • posterior is also Dirichlet with hyper-parameters Nuv(xuxv) + N’uv(xuxv), u,v V
Decomposable posterior • Posterior distribution P(T|D) aP Wuv uv E • factored over edges • same form as prior Wuv = buv D( qu|v; N’uv+ Nuv ) • Remains to compute the normalization constant
v -buv u 1Z Sv’bvv' -buv The Matrix tree theorem Discrete: graph theory continuous: Meila & Jaakkola 99 • Matrix tree theorem If P0(E) = P buv, buv 0 uv E M( b ) = Then Z = det M( b )
Remarks on the decomposable prior • Is a conjugate prior for the tree distribution • Is tractable • defined by ~ n2 parameters • computed exactly in ~ n3 operations • posterior obtained in ~ n2N + n3 operations • derivatives w.r.t parameters, averaging, . . . ~ n3 • Mixtures of trees with decomposable priors • MAP estimation with EM algorithm tractable • Other applications • ensembles of trees • maximum entropy distributions on trees
So far . . • Trees and mixtures of trees are structured statistical models • Algorithmic techniques enable efficient learning • mixture of trees • accelerated algorithm • matrix tree theorem & Bayesian learning • Examples of usage • Structure learning • Compression • Classification
Generative models and discrimination • Trees are generative models • descriptive • can perform many tasks suboptimally • Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99) • optimize for specific tasks • use generative models • combine simple models into ensembles • complexity control - by information theoretic principle • Discrimination tasks • detecting novelty • diagnosis • classification
Bridging the gap Tasks Descriptive learning Discriminative learning
Future . . . • Tasks have structure • multi-way classification • multiple indexing of documents • gene expression data • hierarchical, sequential decisions Learn structured decision tasks • sharing information btw tasks (transfer) • modeling dependencies btw decisions