1 / 27

Outline

Marianna De Santis, Francesco Rinaldi, Emmanuela Falcone, Stefano Lucidi, Giulia Piaggio, Aymone Gurtner and Lorenzo Farina BIOINFORMATICS - Systems biology Vol. 30 no. 2 2014, pages 228–233.

trish
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Marianna De Santis, Francesco Rinaldi, Emmanuela Falcone, Stefano Lucidi, Giulia Piaggio, Aymone Gurtner and Lorenzo Farina BIOINFORMATICS - Systems biology Vol. 30 no. 2 2014, pages 228–233 Combining optimization and machine learning techniques forgenome-wide prediction of human cell cycle-regulated genes

  2. Outline • Background • Motivation • Problem • Proposed Solution • Experiment & Results • Comment

  3. Background • Cell division cycle has 4 phasesl:

  4. Background • Cell division cycle is triggered by expressions of various genes • The genes are transcribed: • In a interdependent sequence • At peak levels when they are needed • To discover these genes, microarrays are applied on synchronously divising cells

  5. Motivation • Identification of cycling genes in a genome-wide scale is difficult: • Cell synchronization loss • Intrinsic microarray noise • Computational methods are developed to filter out the noise. • Even noise is reduced, the result still have low reproducibility

  6. Problem Definition • Input: • A microarray time series dataset with n time sample. • Each dataset has m genes • Each gene expression profile is represented in a vector in n dimension • Output: • A cyclicity score for each gene. (m scores in total) • Range: [0, 1]

  7. Proposed Solution • Solution proposed by authors: • LEON (LEarning and OptimizatioN) algorithm: • Learning Step • Identify the cycling gene by a pre-trained SVM classifier • Feature used: Intensity of expression levels at different time samples • Optimization Step • Approximate a function of expression level intensity in terms of time for each gene by optimization • Evaluate cyclicity based on time between minima and maxima • Used to address cell synchronization problem • Combine results above to calculate the final cyclicity score

  8. Details of learning step • Step 1: Train the SVM: • Instance used in training: • Positive Instance: 50 literature known cycling gene • Regulated the whole cell cycle by change in mRNA level • Verified by traditional experimental methods • Negative Instance: 50 genes with a random time shuffling expression level profile • Feature used in training: • 17 expression level recorded at 17 different time

  9. Details of learning step(1) • Software package used: • LIBSVM (A SVM library developed by the Machine Learning Group at National Taiwan University) • Dimension Space Conversion Kernel: • Radial basis kernel. • k-fold cross-validation (k=5) and grid search are used to prevent over-fitting problems • Step 2: Evaluate the gene by the SVM above • If gene is classified as cycling, its partial cyclicityscore is 1, otherwise its score is 0;

  10. Details of learning step(2)

  11. Details of optimization step • How cyclicity can be measured based on the expression data profile: • Naive approach: First Fourier coefficient • Problem on this approach: • The profile may not be a pure sinusoid. • Not robust cell synchronization error • No consistent function shape between two subsequent cycles of the same transcript. • This approach is not good

  12. Details of optimization step • The new approach author proposed: • Plot the expression level into chart • Measure the 2 feature below: • Distance in time dmin between two subsequent minima • Distance in time dmax between two subsequent maxima • If dmin and dmax is close to duplication period, then gene has a high chance to be cyclicity. • Duplication period is measured by flow cytometry (Fluorescence-Activated Cell Sorting) analysis

  13. Details of optimization step

  14. Details of optimization step • Step1: Data is preprocessed to: • Reduce noise • Extrapolate values outside sampling time range • Step2: Plotting the chart • Approximate expression level function as there are not enough data points • Express the expression level function in terms of linear combination of Radial basis functions

  15. Details of optimization step • Minimize error in the function approximation by formulating the regularized problem below: • Solve the optimization problem by PRICE algorithm • Search the optical parameters that give the best ROC curve. • dmin and dmax are measured. • Step3: Evaluate cyclicity:

  16. Details of optimization step

  17. Final Cyclicity Score Calculation • Final cyclicity score of a gene can be calculated by: • Where: • c is the partial cyclicity score calculated by learning step • c is the partial cyclicity score calculated by optimization step

  18. Validation Experiment • To valid their approach, the data below are used: • Synthetic data • Generated using the algorithm developed by Zhao et al • The genes are transcribed at one invariant time • The cell de-synchronized and the peaks are smoothen over time • Having multiplicative white Gaussian noise with noise standard deviation : 10% for positive samples and 20% for negative samples • Real data • Microarray data from Bar-Joseph et al. • Cell line used: Synchronized foreskin fibroblast cells • Synchronized using double-thymidine block arrest

  19. Validation Experiment (Synthetic) • Dataset generated: • Positive Instance: • 1000 synthetic time courses covering two cell cycles • Negative Instance: • 1000 randomly fluctuating profiles, obtained by random time shuffling of cyclic data • The building of SVM in learning step has used: • 50 extra positive examples • 50 extra negative examples

  20. Validation Experiment (Synthetic) • Evaluation on the results on synthetic data • Ratios of genes scored > 0.5 and < 0.5: • Cyclic: 1 / 0.997 = 1.003 • Non-cyclic: 1/0.993 = 1.007 • Conclusion • A high differentiating power is observed • Cyclic genes is slightly favored.

  21. Validation Experiment (Real) • 480 literature reported cycling genes are considered: • Their pcomb score are calculated and the distribution are plotted below:

  22. Validation Experiment (Real) • From the graph in the previous slide: • Frequency of flat profile genes are uniformly distributed • p score doesn’t add information and consistent with c score. • Score of fluctuating profile genes skewed toward max. value • p score provides additional information about cyclicity. • The combination of the two scores performs better then each single one. • These proved author’s approach performs well

  23. Validation Experiment (Real) • Database Cyclebase is used: • Classified gene as cyclic by experiments on HeLacells • Among 91 low pcomb genes, they are classified: • 18 as cyclic, 44 as non-cyclic, 29 as not classified • Not considering the unclassified genes, 71%of the genes classified by Cyclebase as non-cyclic: • This match with the analysis of the authors. • A gene PRKD1 is confirmed to be non-cyclic

  24. Discovery • 50 genes have the highest pcomb are selected • 5 known cycling genes were excluded [Bar-Joseph et al. (2008)] • 9 of them are chosen to be validated by experiment • Experiment on cell cycle-dependent expression: • Human fibroblasts are prepared from human foreskin • They were grown to 50% confluence • They were synchronized in G0 by serum deprivation. • Cultures were then released from arrest • Sampling(RT-PCR) is done regularly to cover a cell cycle.

  25. Discovery(2) • In the experiment: • All 9 genes are cell cycle-regulated • Their expression level are maximized on S phase for six and on G1 phase for four of them respectively. • NCOR1 and EDF-1 are already known to be cell cycle-regulated an other literature. • The predictive power of the LEON algorithm is confirmed

  26. Discovery(3) • Four low combined score gene are considered: • Their expression is not regulated during the cell cycle • LEON algorithm identifies cell cycle gene expression only. • Further analysis on another 4 genes are performed: • Cyclin A and B1 gens as positive • GAPDH and aldolase genes as negative • CyclinA and B1 are cell cycle dependent • The expression of GAPDH and aldolase genes is constant • Cell cycingegene expressed along cell division only. • These results demonstrate that our approach is successful in identifying cell cycle-regulated genes.

  27. END

More Related