210 likes | 304 Views
Learn how RBForest models offer a scalable solution for density function estimation in multivariate binary data, replacing RBMs with structured binary variables for efficient computations. Discover eHive's robust and scalable design features for computational pipelines. Benefit from a normalization scheme for improved drug signal detection, EMBASE's comprehensive Emtee equivalent, and a sample pipeline for multiplying large numbers. Explore eHive's versatile features and performance optimizations for short, repetitive tasks.
E N D
Butte Lab Journal Club 10/25/2010
Boltzmann machines able to solve difficult combinatorial problems • Estimating the density function of multivariate binary data typically done with mixture models or factor models • Problem: Too computationally expensive for many multivariate binary density modeling problems • Solution: Authors describe a generalization of the restricted Boltzmann Machine (RBM), the restricted Boltzmann forest (RBForest) • replaces the binary hidden variables of the RBM with groups of tree-structured binary variables • when the size of the trees is varied, the number of parameters of the model can be increased while keeping the computations of the density function tractable. • basically, “structured” binning of variables • Example application: automated diagnosis using involving large number of feature types
Computational pipelines are essential, yet paucity of “good” tools for designing pipelines • eHive has many design features for robustness and scalability: • Fault tolerance • Agents (“bees”) • Graph-based • Cloud/GRID-friendly • Generic infrastructure: PERL, MySQL
vorinostat • Normalization scheme enables better detection of drug signals • Less susceptible to known confounders Asthma drugs trichostatin A antifungal drugs Calmodulin inhibitors Anti-neoplastic drugs
Emtree = EMBASE’s MeSH equivalent; much more comprehensive in certain areas, e.g., pharmacology Caveat: SCOPUS is not EMBASE SCOPUS does not support the kinds of complex Emtree queries EMBASE supports, as well as other features e.g., no thesaurus explosion in SCOPUS
Example Pipeline for Multiplying Large Numbers • Pipeline defined in 4 files: • Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs • PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table • AddTogether.pm waits for partial multiplication results to compute and adds them together into final result • LongMult_conf.pm, the pipeline configuration module that links the previous Runnables into one pipeline
Features Used in Example Pipeline • A pipeline can have multiple analyses (e.g.,'start', 'part_multiply' and 'add_together'). • A job of one analysis can create jobs of other analyses by 'flowing the data' down branches. These branches are then assigned specific analysis names in the pipeline configuration file • one 'start' job flows partial multiplication subtasks down to branch #2, and a task of adding them together down branch #1. • Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed ('add_together' is blocked both by 'part_multiply'). • eHive processes store intermediate and final results in a database (in this pipeline, 'intermediate_result' and 'final_result' tables are used).
Other Worthy Features • eHive performance good for jobs that run for very short time but repeated millions of time • Converse of typical job scheduling systems, which have high latency