Integrative Analysis of multiple large-scale molecular biological data

Integrative Analysis of multiple large-scale molecular biological data Sri Priya Ponnapalli Genomic Signal Processing Laboratory The University of Texas at Austin

Project Objectives Specimen Under Analysis : The National Cancer Institute’s 60 cell lines (NCI60). Dataset #1 RNA Expression profiles, [ Ross et al, 2000 ] Dataset #2 Proteomic profiles, [ Nishizuka et al , 2003 ]Dataset #3 Drug Activity Levels, [ Scherf et al, 2000 ] Perceive relationships between three datasets, each containing a different attribute of the NCI60 – genome-scale expression, sensitivities to more than 70,000 chemical compounds and chemotherapeutics, and proteomic profiles. CHIEF OBJECTIVE : DEVELOP A METHOD TO ANALYSE THE RELATIONSHIPS BETWEEN MULTIPLE DATASETS.

Initial Analysis : SVD • All three datasets were processed using Singular value decomposition [ Alter et al, 2000]. The results look interesting but as you can see, it is difficult to interpret them very well, let alone integrate the SVD results of all three datasets. Plot of the First 5 sorted Eigengenes Tumor Samples Eigengenes

Analysis using GSVD • Every pair of datasets was then processed using Generalized Singular value decomposition. • Dataset1=U1E1X • Dataset2=U2E2X • If a dataset is thought to represent a line, the GSVD of two datasets represent the point of intersection of these lines. i.e. It highlights the similarities and dissimilarities between the two datasets. This simple fact suggests a method to study the similarities and differences between multiple datasets.

Consider the case of finding similarities and dissimilarities between 3 pairs of datasets ( this can be extended to multiple datasets). • These 3 datasets maybe thought of representing 3 lines. • Any two non-parallel lines intersect at a point. • Three non-parallel lines form a triangle ( unless they all have a common point in which case all three vertices of the triangle converge to that point). • To goal is to express the three datasets in the form • Dataset1=U1E1X • Dataset2=U2E2X • Dataset3=U3E3X

If we compute the GSVD of every two datasets ( find the points of intersection of every two lines), we get three matrices that each correspond to a vertex of a triangle. We want a matrix that best approximates these three matrices i.e. a point that is closest to all three vertices simultaneously. This point would be the centroid of the triangle. Given the co-ordinates of the vertices, the centroid may be easily computed. All these results have to be interpreted in terms of matrices. This may be easily done by considering the distances between matrices as defined by the Frobenius distance.

This method is an approximation, but the best possible approximation. • It minimizes the error between the original dataset and the dataset obtained by the product of the three matrices. • It has been tried on the three datasets under study and the results look promising. • Please read the paper for further details.

Integrative Analysis of multiple large-scale molecular biological data

Integrative Analysis of multiple large-scale molecular biological data

Presentation Transcript

Computational Methods for Large Scale DNA Data Analysis

Large-Scale Global Alignments Multiple Alignments

Proteomics Analysis and integration of large-scale data sets

Large-Scale Phylogenetic Analysis

Domain agnostic tools for multi-scale/integrative sensor data analysis

Large scale genomic data mining

Integrative Analysis of Pathology, Radiology and High Throughput Molecular Data

Network analysis of biological data

Analysis of Large Scale Visual Recognition

Analysis and Integration of Large-scale Molecular and Clinical Data in Cancers

Large-Scale Global Alignments Multiple Alignments

A Comparison of Approaches to Large-Scale Data Analysis

Large scale data processing

Handling Large Amounts of Biological Data

Web Research - Large-Scale Web Data Analysis

Large Scale Data Integration

Large Scale Data Analytics

Integrative Molecular Medicine

large scale data analysis

Computational Mathematics for Large-scale Data Analysis

Large-Scale Multiple Sequence Alignment