1 / 22

Analysis and Management of Microarray Data

Analysis and Management of Microarray Data. Dr G. P. S. Raghava. Major Applications  Identification of differentially expressed genes in diseased tissues (in presence of drug)  Classification of differentially expressed (genes) or clustering/ grouping of genes having similar

shanton
Download Presentation

Analysis and Management of Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis and Management of Microarray Data Dr G. P. S. Raghava

  2. Major Applications  Identification of differentially expressed genes in diseased tissues (in presence of drug)  Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions  Use expression profile of known disease to diagnosis and classify of unknown genes

  3. Management of Microarray Data  Magnitude of Data – Experiments  50 000 genes in human  320 cell types  2000 compunds  3 times points  2 concentrations  2 replicates – Data Volume  4*1011 data-points  1015= 1 petaB of Data

  4. Gene expression database – a conceptual view Samples Sample annotations Gene expression matrix Genes Gene annotations Gene expression levels

  5. Management of Microarray Data Major Issues  Large volume of microarray data in last few years – Storage and efficient access – Comparison and integration of data  Problem of data access and exchange – Data scattered around Internet – Supplementary material of publications – Difficult for user to access relivent data  Problems with existing databases – Diverse purpose – Developed for specific purpose

  6. Management of Microarray Data  Specific Database – Platform (eg.Stanford MA Database; SMD) – Organism (Yeast MA global viewer) – Project (Life cycle database of Drosophila)  Problem with Supplement and MA databases – Lack of direct access – Quality not checked – No standard format – Incomplete data

  7.  Comprehensive database server to manage massive amount of Microarray Data – Biomaterial Information – Raw Data & Images – Web Tools (normalization; data viewing; analysis)  Run on local servers allows full management and permission to add and view data  Minimum Information about Microarray Experiment (MIAME)  BASE http://bioinformatics1.uams.edu:8081:/

  8. Public Databases Gene Expression data is an essential aspect of annotating the genome Publication and data exchange for microarray experiments Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a Microarray Experiment)

  9. GEO at the NCB ? I

  10. Microarray Data Mining Challenges too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Too many columns likely to lead to False positives for exploration, a large set of all relevant genes is desired for diagnostics or identification of

  11. Analysis of Microarray Data  Analysis of images  Preprocessing of gene expression data  Normalization of data – Subtraction of Background Noise – Global/local Normalization – House keeping genes (or same gene) – Expression in ratio (test/references) in log  Differential Gene expression – Repeats and calculate significance (t-test) – Significance of fold used statistical method  Clustering – Supervised/Unsupervised (Hierarchical, K-means, SOM)  Prediction or Supervised Machine Learnning (SVM)

  12. Low Level Analysis or Preprocessing of gene expression data Scale Transformation Normalization and Scaling Replicate Handling Missing value Handling Flat pattern filtering Pattern standardization

  13. Normalization Techniques  Global normalization – Divide channel value by means  Control spots – Common spots in both channels – House keeping genes – Ratio of intensity of same gene in two channel is used for correction  Iterative linear regression  Parametric nonlinear nomalization – log(CY3/CY5) vs log(CY5)) – Fitted log ratio – observed log ratio  General Non Linear Normalization – LOESS – curve between log(R/G) vs log(sqrt(R.G))

  14. Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

  15. Cluster analysis Used to find groups of objects when not already known “Unsupervised learning” Associated with each object is a set of measurements (the feature vector) Aim is to identify groups of similar objects on the basis of the observed measurements

  16. Unsupervised Learnning  Hierarchical clustering: merging two branches at the time until all vari-ables(genes) are in one tree. [it does not answer the question of “howmany gene clusters there are”?]  K-mean clustering: assuming there are K clusters. [what if this assumption is incorrect?]  Self Organizing Maps (SOM) – Split all genes into similar sub-groups – Finds its own groups (machine learning)  Principle Component – every gene is a dimension (vector), find a single dimension that best represents the differences in the data  Model-based clustering: the number of clusters is determined dynamically [could be one of the most promising methods]

  17. Average linkage hierarchical clustering, melanoma only unclustered ‘cluster’

  18. Supervised Analysis Fisher’s linear discriminant analysis Quadratic discriminant analysis Logistic regression (a linear discriminant analysis) Neural networks Support vector machine

  19. Example: Tumor Classification  Reliable and precise classification essential for successful cancer treatment  Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables  Uncertainties in diagnosis remain; likely that existing classes are heterogeneous  Characterize molecular variations among tumors by monitoring gene expression (microarray)  Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

  20. Higher Level Microarray data analysis  Clustering and pattern detection  Data mining and visualization  Controls and normalization of results  Statistical validatation  Linkage between gene expression data and gene sequence/function/metabolic pathways databases  Discovery of common sequences in co-regulated genes  Meta-studies using data from multiple experiments

  21. Thanks

More Related