1 / 37

Efficient Learning in High Dimensions with Trees and Mixtures

Efficient Learning in High Dimensions with Trees and Mixtures. Marina Meila Carnegie Mellon University. Learning. Multidimensional data. Multidimensional (noisy) data Learning tasks - intelligent data analysis categorization (clustering) classification novelty detection

elise
Download Presentation

Efficient Learning in High Dimensions with Trees and Mixtures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Learning in High Dimensions withTrees and Mixtures Marina Meila Carnegie Mellon University

  2. Learning Multidimensional data • Multidimensional (noisy) data • Learning tasks - intelligent data analysis • categorization (clustering) • classification • novelty detection • probabilistic reasoning • Data is changing, growing • Tasks change need to make learning automatic, efficient

  3. Combining probability and algorithms • Automatic probability and statistics • Efficient algorithms • This talk the tree statistical model

  4. Talk overview Perspective: generative models and decision tasks Introduction: statistical models The tree model Mixtures of trees Accelerated learning Bayesian learning Learning Experiments

  5. Statistical model A multivariate domain Cough X ray Bronchitis Lung cancer Smoker • Data Patient1 Patient2 . . . . . . . . . . . . • Queries • Diagnose new patient • Is smoking related to lung cancer? • Understand the “laws” of the domain Cough X ray Bronchitis Lung cancer Smoker Cough X ray Smoker Bronchitis Lung cancer? X ray Smoker Cough Bronchitis Lung cancer?

  6. Probabilistic approach • Smoker, Bronchitis .. (discrete) random variables • Statistical model (joint distribution) P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain • Queries: • inference e.g. P( Lung cancer = true | Smoker = true, Cough = false ) • structure of the model • discovering relationships • categorization

  7. v1v2 00 01 11 00 v3 0 .01 .14 .22 .01 1 .23 .03 .33 .03 .14 + .03 .14 + .3 + .22 + .33 P(v1=0, v2=1) P(v2=1) Probability table representation • Query: P(v1=0 | v2=1) = = = .23 • Curse of dimensionality if v1, v2, … vn binary variables PV1,V2…Vntable with 2n entries! • How to represent? • How to query? • How to learn from data? • Structure?

  8. Structure vertices = variables edges = “direct dependencies” Parametrization by local probability tables compact parametric representation efficient computation learning parameters by simple formula learning structure is NP-hard spectrum Z (red-shift) dust Obs spectrum Graphical models distance Galaxy type size spectrum Z (red-shift) dust observed size Obs spectrum photometric measurement

  9. 1 1 3 3 4 4 5 5 2 2 equivalent P Tuv(xuxv) P Tv(xv) T(x) = P Tv|u(xv|xu) uv E T(x) = uv E deg v-1 v V The tree statistical model • Structure tree (graph with no cycles) • Parameters • probability tables associated to edges T3 T34 T4|3 • T(x) factors over tree edges

  10. Examples • Splice junction domain • Premature babies’ Bronho-Pulmonary Disease (BPD) junction type -7 -3 +7 +8 +5 +2 +3 +4 +6 -2 -4 -6 -5 -1 +1 PulmHemorrh Coag HyperNa Thrombocyt Hypertension Acidosis Gestation Weight Temperature BPD Neutropenia Suspect Lipid

  11. P Tuv(xuxv) P Tv (xv) uv E T(x) = deg v -1 v V Trees - basic operations |V| =n • computing likelihood T(x) ~ n • conditioning TV-A|A(junction tree algorithm) ~ n • marginalization Tuv for arbitrary u,v ~ n • sampling ~ n • fitting to a given distribution ~ n2 • learning from data ~ n2Ndata • is a simple model Querying the model Estimating the model

  12. The mixture of trees (Meila 97) h = “hidden” variable P( h=k ) = lk k = 1, 2 . . . m • NOT a graphical model • computational efficiency preserved m Q(x) = S lkTk(x) k=1

  13. Learning - problem formulation • Maximum Likelihood learning • given a data set D = { x1, . . . xN } • find the model that best predicts the data Topt = argmax T(D) • Fitting a tree to a distribution • given a data set D = { x1, . . . xN } and distribution P that weights each data point, • find Topt = argmin KL( P|| T ) • KL is Kullbach-Leibler divergence • includes Maximum likelihood learning as a special case

  14. Puv PuPv Fitting a tree to a distribution (Chow & Liu 68) Topt = argmin KL( P|| T ) • optimization over structure + parameters • sufficient statistics • probability tables Puv= Nuv/N u,v V • mutual informations Iuv Iuv = S Puv log

  15. I12 I23 I61 I63 I34 I56 I45 Fitting a tree to a distribution - solution • Structure Eopt = argmax S Iuv uv E • found by Maximum Weight Spanning Tree algorithm with edge weights Iuv • Parameters • copy marginals of P Tuv = Puvfor uv E

  16. Learning mixtures by the EM algorithm Meila & Jordan ‘97 E step which xi come from T k? distributionP k(x) • Initialize randomly • converges to local maximum of the likelihood M step fit T k to set of points min KL( Pk||Tk )

  17. Remarks • Learning a tree • solution is globally optimal over structures and parameters • tractable: running time ~ n2N • Learning a mixture by the EM algorithm • both E and M steps are exact, tractable • running time • E step ~ mnN • M step ~ mn2N • assumes m known • converges to local optimum

  18. Finding structure - the bars problem Data n=25 learned structure Structure recovery: 19 out of 20 trials Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous) 0.95 +/- 0.01 (unambiguous) Data likelihood [bits/data point] true model 8.58 learned model 9.82 +/-0.95

  19. Experiments - density estimation • Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000 n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees ) Mix Trees Mix Trees

  20. Tree TANB NB Supervised (DELVE) DNA splice junction classification • n = 61 variables • class = Intron/Exon, Exon/Intron, Neither

  21. IE junction Intron Exon 15 16 . . . 25 26 27 28 29 30 31 Tree - CT CT CT - - CT A G G True CT CT CT CT - - CT A G G EI junction Exon Intron 28 29 30 31 32 33 34 35 36 Tree CA A G G T AG A G - True CA A G G T AG A G T (Watson “The molecular biology of the gene” 87) Discovering structure Tree adjacency matrix class

  22. Irrelevant variables 61 original variables + 60 “noise” variables Original Augmented with irrelevant variables

  23. Accelerated tree learning Meila ‘99 • Running time for the tree learning algorithm ~ n2N • Quadratic running time may be too slow: Example: document classification • document = data point --> N= 103-4 • word = variable --> n= 103-4 • sparse data --> #words in document s and s << n,N • Can sparsity be exploited to create faster algorithms?

  24. 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Sparsity • assume special value “0” that occurs frequently sparsity = s # non-zero variables in each data point s s << n, N • Idea: “do not represent / count zeros” Linked list length s Sparse data

  25. Presort mutual informations Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then Nv > Nv’ ==> Iuv > Iuv’ • Consequences • sort Nv=> all edges uv , Nuv = 0 implicitly sorted by Iuv • these edges need not be represented explicitly • construct black box that outputs next “largest” edge

  26. The black box data structure v1 Nv v2 list of u , Nuv > 0, sorted by Iuv v F-heap of size ~n list of u, Nuv =0, sorted by Nv (virtual) vn next edge uv Total running timen log n + s2N + nK log n (standard alg.running time n2N )

  27. Standard accelerated Experiments - sparse binary data • N = 10,000 • s = 5, 10, 15, 100

  28. Remarks • Realistic assumption • Exact algorithm, provably efficient time bounds • Degrades slowly to the standard algorithm if data not sparse • General • non-integer counts • multi-valued discrete variables

  29. 1Z Bayesian learning of trees Meila & Jaakkola ‘00 • Problem • given prior distribution over trees P0(T) data D = { x1, . . . xN } • find posterior distribution P(T|D) • Advantages • incorporates prior knowledge • regularization • Solution • Bayes’ formula P(T|D) = P0(T) P T(xi) i=1,N • practically hard • distribution over structure E and parameters qE hard to represent • computing Z is intractable in general • exception: conjugate priors

  30. Decomposable priors T = P f( u, v, qu|v) uv E • want priors that factor over tree edges • prior for structure E P0(E) a Pbuv uv E • prior for tree parameters P0(qE) = P D( qu|v ; N’uv ) uv E • (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v V • posterior is also Dirichlet with hyper-parameters Nuv(xuxv) + N’uv(xuxv), u,v V

  31. Decomposable posterior • Posterior distribution P(T|D) aP Wuv uv E • factored over edges • same form as prior Wuv = buv D( qu|v; N’uv+ Nuv ) • Remains to compute the normalization constant

  32. v -buv u 1Z Sv’bvv' -buv The Matrix tree theorem Discrete: graph theory continuous: Meila & Jaakkola 99 • Matrix tree theorem If P0(E) = P buv, buv 0 uv E M( b ) = Then Z = det M( b )

  33. Remarks on the decomposable prior • Is a conjugate prior for the tree distribution • Is tractable • defined by ~ n2 parameters • computed exactly in ~ n3 operations • posterior obtained in ~ n2N + n3 operations • derivatives w.r.t parameters, averaging, . . . ~ n3 • Mixtures of trees with decomposable priors • MAP estimation with EM algorithm tractable • Other applications • ensembles of trees • maximum entropy distributions on trees

  34. So far . . • Trees and mixtures of trees are structured statistical models • Algorithmic techniques enable efficient learning • mixture of trees • accelerated algorithm • matrix tree theorem & Bayesian learning • Examples of usage • Structure learning • Compression • Classification

  35. Generative models and discrimination • Trees are generative models • descriptive • can perform many tasks suboptimally • Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99) • optimize for specific tasks • use generative models • combine simple models into ensembles • complexity control - by information theoretic principle • Discrimination tasks • detecting novelty • diagnosis • classification

  36. Bridging the gap Tasks Descriptive learning Discriminative learning

  37. Future . . . • Tasks have structure • multi-way classification • multiple indexing of documents • gene expression data • hierarchical, sequential decisions Learn structured decision tasks • sharing information btw tasks (transfer) • modeling dependencies btw decisions

More Related