1 / 29

Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Dr. Mahout: Analyzing clinical data using scalable and distributed computing. Shannon Quinn CPCB squinn@cmu.edu | spq1@pitt.edu November 10, 2011. 1/29. Punchline. Cloud computing for biological and clinical data analysis Problem: high- dimensional, noisy!. tech2date.com.

alaura
Download Presentation

Dr. Mahout: Analyzing clinical data using scalable and distributed computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dr. Mahout:Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edu | spq1@pitt.edu November 10, 2011 1/29

  2. Punchline • Cloud computing for biological and clinical data analysis • Problem: high- dimensional, noisy! tech2date.com Heart tissue: biomedcentral fMRI: wikipedia segmentation: biodynamics UCSD 2/29

  3. Disclaimer • Biology jargon • Academic jargon 3/29

  4. My Background • 2nd year Ph.D. student in CPCB Program • Research in bioimage informatics 4/29

  5. My Background • Other http://collegefootballbelt.com/Logos/ http://s3.amazonaws.com/data.tumblr.com/ 5/29

  6. Computational biology and …the cloud? • Biological data • is BIG • requires repetitive analysis in chunks • modeling involves linear algebra and statistics 6/29

  7. Use case 1: protein behavior [ 10-15 10-12 10-9 10-6 10-3 100 timescale of relevant motions bond vibration side-chain rotation domain shifts/ max. catalysis protein folding global conformational shifts sampling detail a common tradeoff… 7/29

  8. Molecular dynamics 8/29

  9. “The curse of [MD] dimensionality” • MD := • for every atom • for every t • … http://icanhascheezburger.files.wordpress.com/ http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi 9/29

  10. Pipeline for MD trajectory analysis • Find a “surface” of protein shapes • MD output • Define surface (graph!) • Partition surface http://www.dillgroup.ucsf.edu/ 10/29

  11. Mahout implementation Defining surface/graph: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Partitioning surface/graph: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Kmeans (kmeans) . . . 11/29

  12. MD in Mahout conclusion • MD simulations (x@Home projects) • Existing Mahout functionality • Additional algorithms http://folding.stanford.edu/ 12/29

  13. Use case 2: diseases affecting cilia • What are cilia? • Hairlike structures • Keep things moving • Diseasedcilia = http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg 13/29

  14. Importance of correct diagnoses • Symptoms look familiar • Consequences do not 14/29

  15. Beat pattern of cilia tells a lot! • What is the motion called? • Can we create a database of motions? • Clinicians look at cilia motion in making their diagnoses 15/29

  16. Clinicians’ ultimate goal ? ? ? Category 1 Category 2 Category 3 16/29

  17. Cilia as dynamic textures • Properties • Computer vision Saisanet al 2001 17/29

  18. The [proposed] pipeline • Step 1 • Clinician captures video and uploads it http://googolplex.dyndns.org/cilia/ 18/29

  19. The [proposed] pipeline • Step 2 • Mahout job: autoregressive modeling Appearance Model Dynamic Model http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg 19/29

  20. The [proposed] pipeline • Step 3 • Add the transition matrices to cloud library A = 20/29

  21. The [proposed] pipeline • Step 4 • Recompute network with added videos ? Axis 2 Axis 1 21/29

  22. One more thing… • What’s really cool about AR models: • Can you spot the fake? Synthetic Original 22/29

  23. Mahout implementation Learning autoregressive models: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Comparing autoregressive parameters: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Frobenius norm Tensors ? ? ? 23/29

  24. Cilia on Mahout conclusions • Autoregressive modeling uses linear algebra that is already implemented • Maintaining AR library requires new functionality • Mahout framework gives us elbow room 24/29

  25. Final Thoughts • Biological / biomedical data is large, high-dimensional, and noisy • We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models) • We provide a cloud framework! 25/29

  26. Research Group • University of Pittsburgh • Dr. Chakra Chennubhotla Lab (advisor) • CMU@Qatar • Dr. Majd Sakr Lab (collaborator) • University of Pittsburgh Medical Center • Dr. Cecilia Lo Lab (collaborator) 26/29

  27. Sources • Resources • Apache Mahout • Spectrally Clustered • Links • Categorizing ciliary motion defects (BSEC 2011) • Eigencuts spectral clustering algorithm • Technical report (coming soon!) 27/29

  28. Contact • Shannon Quinn • squinn@cmu.edu | spq1@pitt.edu • http://www.magsolweb.net/ 28/29

  29. Thank you! http://icanhascheezburger.files.wordpress.com/ 29/29

More Related