Genomic Signal Processing: Issues in Engineering Molecular ...

1. 10/21/2011 http://gsp.tamu.edu 1 Genomic Signal Processing: Issues in Engineering Molecular Medicine Edward R. Dougherty Department of Electrical and Computer Engineering, Texas A&M University Division of Computational Biology, Translational Genomics Research Institute Department of Pathology, University of Texas, M.D. Anderson Cancer Center

2. 10/21/2011 http://gsp.tamu.edu 2

3. 10/21/2011 http://gsp.tamu.edu 3 Goals of Translational Genomics Screen for key genes and gene families that explain specific cellular phenotypes (disease). Use genomic signals to classify disease on a molecular level. Build model networks to study dynamical genome behavior and derive intervention strategies to alter undesirable behavior.

4. 10/21/2011 http://gsp.tamu.edu 4 Translational Genomics Tools Signal Processing Pattern Recognition Information Theory Control Theory Network Theory Communication Theory

5. 10/21/2011 http://gsp.tamu.edu 5 Genomic Signal Processing GSP: The analysis, processing, and use of genomic signals for gaining biological knowledge and translation of that knowledge into systems-based applications. Signals generated by the genome must be processed to characterize their regulatory effects and their relationship to changes at both the genotypic and phenotypic levels.

6. 10/21/2011 http://gsp.tamu.edu 6 Central Dogma of Molecular Biology

7. 10/21/2011 http://gsp.tamu.edu 7 Transcription Factors



10. 10/21/2011 http://gsp.tamu.edu 10 Expression microarrays result from a complex biochemical-optical system incorporating robotic spotting and computer image formation and analysis. They facilitate large-scale surveys of gene expression in which transcript levels can be determined for thousands of genes simultaneously. cDNA Arrays: Expressed Sequence Tags (ESTs). Oligo Arrays: Synthetic oligonucleotides. Involve image processing and signal extraction. Microarrays

11. 10/21/2011 http://gsp.tamu.edu 11 Microarray Process

12. 10/21/2011 http://gsp.tamu.edu 12 Classification of Diseases Find a feature set of expression profiles to classify disease. Diagnose cancer Type Stage Prognosis

13. 10/21/2011 http://gsp.tamu.edu 13 BRCA Classification

14. 10/21/2011 http://gsp.tamu.edu 14

15. 10/21/2011 http://gsp.tamu.edu 15 Classifier Design From a sample form an estimate ?n of ?opt. Design cost: ?n = ?n ? ?opt Key issue: good design often requires large samples and it is often impossible to get large enough samples to sufficiently reduce E[?n].

16. 10/21/2011 http://gsp.tamu.edu 16 Overfitting If we apply a complex classification rule to a small sample, the rule is likely to conform to the data too closely. We constrain classifier complexity to avoid overfitting, thereby restricting ourselves to �easy� problems.

17. 10/21/2011 http://gsp.tamu.edu 17 Constraint To lower design cost, optimization is constrained to a subclass C. Constraint cost: ?C = ?C ? ?d. The savings in design error must exceed the cost of constraint. Key problem: find appropriate constraints. A constraint may be defined in accordance with a model, or maybe experience has shown a certain constraint works well in a given setting.

18. 10/21/2011 http://gsp.tamu.edu 18 Classifier Design Error

19. 10/21/2011 http://gsp.tamu.edu 19

20. 10/21/2011 http://gsp.tamu.edu 20

21. 10/21/2011 http://gsp.tamu.edu 21

22. 10/21/2011 http://gsp.tamu.edu 22

23. 10/21/2011 http://gsp.tamu.edu 23 Bolstered Error Estimation Estimate classifier error by spreading the data via Bolstering Kernels Error estimate results from integrating kernels over the domain to which points should not be included. Braga-Neto, U. M., and E. R. Dougherty, �Bolstered Error Estimation,� Pattern Recognition, 37 (6), 1267-1281, 2004.

24. 10/21/2011 http://gsp.tamu.edu 24 Bolstering Properties Error can be computed via integration with closed form for LDA and Monte Carlo integration otherwise. Choosing variance of bolstering kernel is key because it affects both bias and variance of the bolstered estimator. A method for choosing the variance has been proposed. Resubstitution results from zero bolstering variance.

25. 10/21/2011 http://gsp.tamu.edu 25 Deviation Distributions: CART, 5 Genes

26. 10/21/2011 http://gsp.tamu.edu 26 Feature Selection Impacts Cross-Validation Feature selection increases the already large deviation variance of cross-validation. Coefficient of Relative Increase in Deviation Dispersion ?opt : true error using best features. ?cv : true error using selected features. Xiao, Y., Hua, J. and E. R. Dougherty, �Quantification of the Impact of Feature Selection on Cross-validation Error Estimation,� EURASIP J. Bioinformatics and Systems Biology, 2007.

27. 10/21/2011 http://gsp.tamu.edu 27 How Many Features? Peaking Phenomenon: Overfitting.

28. 10/21/2011 http://gsp.tamu.edu 28 Feature-Selection Problem

29. 10/21/2011 http://gsp.tamu.edu 29 Optimal Number of Features Optimal number of features depends on sample size, classification rule and feature-label distribution. Top: LDA, linear model, slightly correlated features. Bottom: LDA, linear model, highly correlated features. Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R. Dougherty, �Optimal Number of Features as a Function of Sample Size for Various Classification Rules�, Bioinformatics, 21(8), 1509-1515, 2005.

30. 10/21/2011 http://gsp.tamu.edu 30 Peaking Phenomenon is Nontrivial Peaking can be later for smaller samples. Top: 3NN, nonlinear model, modestly correlated features. Bottom: Linear SVM, nonlinear model, modestly correlated features. Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R. Dougherty, �Optimal Number of Features as a Function of Sample Size for Various Classification Rules�, Bioinformatics, 21(8), 1509-1515, 2005.

31. 10/21/2011 http://gsp.tamu.edu 31 Impact of Error Estimation on Feature Selection Choice of error estimator can be more important than choice of algorithm. LDA, Gaussian model, n = 50, 5 features from 20. Sima, C., Attoor, S., Braga-Neto, U., Lowey, J., Suh, E., and E. R. Dougherty, �Impact of Error Estimation on Feature-Selection Algorithms,� Pattern Recognition, 38 (12), 2472-2482, 2005.

32. 10/21/2011 http://gsp.tamu.edu 32 What Can We Expect from Feature Selection? Top: Regression of selected FS error on best FS error. Bottom: Regression of best FS error on selected FS error. Sima, C., and E. R. Dougherty, �What Should One Expect from Feature Selection in Small-Sample Settings,� Bioinformatics, 22 (19), 2430-2436, 2006.

33. 10/21/2011 http://gsp.tamu.edu 33 Decorrelation of True and Estimated Errors With feature selection, the problem is decorrelation of the error estimate from the true error, not increased estimator variance. Selecting 5 features from 200 with sample size 50. With feature selection Without feature selection Hanczar, B., Hua, J., and E. R. Dougherty, �Is There Correlation between the Estimated and True Classification Errors in Small-Sample Settings?� IEEE Statistical Signal Processing Workshop, Madison, August, 2007.

34. 10/21/2011 http://gsp.tamu.edu 34 Error Bounds Distribution-free bounds exist on the RMS between the error and error estimate. Typically, they are useless for small samples. For n = 100, RMS ? 0.435.

35. 10/21/2011 http://gsp.tamu.edu 35 Salient Points for Small Samples Beware of complex classifiers. Keep feature sets small. Avoid cross-validation � where possible. Recognize the heavy influence of the feature-label distribution and classification rule. Report a list of classifiers and feature sets for analysis. Issues: Analysis of classifier and feature-selection performance Better error estimation Mathematical analysis of error estimators Braga-Neto, U., and E. R. Dougherty, �Exact Performance of Error Estimators for Discrete Classifiers,� Pattern Recognition, 38 (11) 1799-1814, 2005.

36. 10/21/2011 http://gsp.tamu.edu 36

37. 10/21/2011 http://gsp.tamu.edu 37 Apparent Clusters in Microarray Data

38. 10/21/2011 http://gsp.tamu.edu 38 Example: 2 or 3 clusters? What is the best separation?

39. 10/21/2011 http://gsp.tamu.edu 39 The Clustering Problem Apply a clustering algorithm to data and form clusters, as every clustering algorithm does. Say, �Gee Whiz!� There are known related genes in a cluster. Where is the possibility for verification by prediction? Indeed, what is to be verified?

40. 10/21/2011 http://gsp.tamu.edu 40

41. 10/21/2011 http://gsp.tamu.edu 41 Probabilistic Theory of Clustering Clustering theory in the context of random sets Probabilistic error measure based on points being clustered correctly Bayes clusterer (optimal clustering algorithm) Learning theory for clustering algorithms Dougherty, E. R., and M. Brun, �A Probabilistic Theory of Clustering,� Pattern Recognition,� 37 (5), 917-925, 2004.

42. 10/21/2011 http://gsp.tamu.edu 42 Example of Clustering Error Left: Realization of point process Right: Output of hierarchical clustering Error: 40%

43. 10/21/2011 http://gsp.tamu.edu 43

44. 10/21/2011 http://gsp.tamu.edu 44 Kendall�s Correlation for Indices Top: Realization of point process Bottom: Kendall�s correlation for different indices across different clustering algorithms

45. 10/21/2011 http://gsp.tamu.edu 45 Regulatory Modeling Find analytical tools for genomic data that can detect multivariate influences on decision-making produced by complex genetic networks. Construct the minimal complexity network that can model sufficient information transfer to achieve goal. Less computation Less data required for inference Given a model, discover ways to intervene in its dynamics to obtain desired behavior.

46. 10/21/2011 http://gsp.tamu.edu 46 Gene Interaction Genes interact via multi-protein complexes, feedback regulation, and pathway networks. Complex molecular networks underlie biological function. Most diseases do not result from a single gene product. These interrelationships among genes constitute gene regulatory networks.

47. 10/21/2011 http://gsp.tamu.edu 47 Muscle Network (Drosophila) A gene network shows regulatory interaction. msp-300 is a hub gene that regulates genes encoding motor proteins responsible for muscle contraction. Zhao, W., Serpedin, E., and E. R. Dougherty, �Inferring Gene Regulatory Networks from Time Series Data Using the Minimum Description Length Principle,� Bioinformatics, 22 (17, 2129-2135, 2006.

48. 10/21/2011 http://gsp.tamu.edu 48 Desirable Model Properties Incorporate rule-based dependencies between genes. Rule-based dependencies may constitute important biological information. Allow systematic study of global network dynamics. In particular, individual gene effects on long-run network behavior. Cope with uncertainty. Small sample size, noisy measurements, robustness System must be open to external latent variables

49. 10/21/2011 http://gsp.tamu.edu 49 Infer Regulatory Genetic Function?

50. 10/21/2011 http://gsp.tamu.edu 50 Inference From Data Key issues Complex model Limited data Lack of appropriate time-course data for dynamics Fundamental Principle: Use simplest model that provides sufficient information to accomplish the task at hand and which is compatible with the data. Formalize inference by postulating criteria that constitute a solution space for the inverse problem. Constraint criteria are composed of restrictions on the form of the network � biological, complexity. Operational criteria are composed of relations that must be satisfied between the model and the data.

51. 10/21/2011 http://gsp.tamu.edu 51 Regulatory Logic Jacques Monod: �The logic of biological regulatory systems abides�like the workings of computers, by the propositional algebra of George Boole.� Shmulevich I., and E. R. Dougherty, Genomic Signal Processing, Princeton University Press, Princeton, 2007.

52. 10/21/2011 http://gsp.tamu.edu 52 Boolean Predictive Relationships Boolean Relationships in the NCI 60 ACDS (Anti-Cancer Drug Screen). MRC1 = VSNL1 ? HTR2C SCYA7 = CASR ? MU5SAC Capture switch-like (ON/OFF) behavior. Pal, R., Datta, A., Fornace, A. J., Bittner, M. L., and E. R. Dougherty, �Boolean Relationships Among Genes Responsive to Ionizing Radiation in the NCI 60 ACDS,� Bioinformatics, 21(8), 1542-1549, 2005.

53. 10/21/2011 http://gsp.tamu.edu 53 Basic Structure of Boolean Networks

54. 10/21/2011 http://gsp.tamu.edu 54 Network Dynamics

55. 10/21/2011 http://gsp.tamu.edu 55 State Space of Boolean Networks Similar GAPs lie close together. There is an inherent directionality in the state space. Some states are attractors (or limit-cycle attractors). The system may alternate between several attractors. Other states are transient.

56. 10/21/2011 http://gsp.tamu.edu 56

57. 10/21/2011 http://gsp.tamu.edu 57

58. 10/21/2011 http://gsp.tamu.edu 58 Properties of PBNs Share the rule-based properties of Boolean networks. Models uncertainty. Dynamic behavior studied via Markov Chains. Close relationship to Bayesian networks. Attractors of a PBN are the attractors of the constituent BNs. Can leave a BN attractor cycle when BN switches. Brun, M., Dougherty, E. R., and I. Shmulevich, �Steady-State Probabilities for Attractors in Probabilistic Boolean Networks,� Signal Processing, in press, 2005. Lahdesmaki, H., Hautaniemi, S., Shmulevich, I., and Yli-Harja, O., �Relationships Between Probabilistic Boolean Networks and Dynamic Bayesian Networks as Models of Gene Regulatory Networks,� Signal Processing, in press, 2005.

59. 10/21/2011 http://gsp.tamu.edu 59 Various Design Methods Proposed Find genes with predictive capability for target gene (CoD). Use mutual-information to find related genes. Use MDL principle. Optimize connectivity in a Bayesian framework relative to the gene profiles in the data. Find networks satisfying biologically related constraints such as limited attractor structure, transient time, and connectivity. Assuming steady-state data, require data states to be attractors. Assuming biological determinism within a given cellular context, design a PBN under the assumption that constituent BNs produce consistent data subsets in the sample data.

60. 10/21/2011 http://gsp.tamu.edu 60

61. 10/21/2011 http://gsp.tamu.edu 61 Intervention A key goal of network modeling is to determine intervention targets (genes) such that the network can be �persuaded� to transition into desired states. We desire genes that are the best potential �lever points� in the sense of having the greatest possible impact on desired network behavior. Shmulevich, I., Dougherty, E. R., and W. Zhang, �Gene Perturbation and Intervention in Probabilistic Boolean Networks,� Bioinformatics, 18, 1319-1331, 2002.

62. 10/21/2011 http://gsp.tamu.edu 62 Dynamics Dynamics of PBNs can be studied using Markov Chain theory. We can ask the question: �In the long run, what is the probability that some given gene(s) will be ON/OFF?�

63. 10/21/2011 http://gsp.tamu.edu 63 Medical Benefits of Network Intervention Prediction of new targets based on pathway context. Stress and toxic response mechanisms. Off-target effects of therapeutic compounds. Characterization of disease states by dynamic behavior. Gene- and protein-expression signatures for diagnostics. Regulatory analysis for therapeutic intervention.

64. 10/21/2011 http://gsp.tamu.edu 64 Possible Intervention Goals Minimize the mean first passage time to a desirable state. Maximize the probability of reaching a desirable state before a certain fixed time. Minimize the time needed to reach a desirable state with a given fixed probability. Shmulevich, I., Dougherty, E. R., and W. Zhang, �Gene Perturbation and Intervention in Probabilistic Boolean Networks,� Bioinformatics, Vol. 18, 1319-1331, 2002. Shmulevich, I., Dougherty, E. R., and W. Zhang, �Control of Stationary Behavior in Probabilistic Boolean Networks by Means of Structural Intervention,� Biological Systems, Vol. 10, 431-446, 2002.

65. 10/21/2011 http://gsp.tamu.edu 65 Where and How to Intervene?

66. 10/21/2011 http://gsp.tamu.edu 66

67. 10/21/2011 http://gsp.tamu.edu 67 Optimal Control Key Objective : Optimally manipulate the external controls to move the GAP from an undesirable pattern to a desirable pattern. Use available information, e.g., phenotypic responses, tumor size, etc. Require a paradigm for modeling the evolution of the GAP under different controls. PBN is one such paradigm. Use the associated Markov chain.

68. 10/21/2011 http://gsp.tamu.edu 68 Control in PBN�s Transition Probabilities depend on external control inputs e.g. chemotherapy, radiation, etc. Assume m control inputs u1,u2......, um. Each input can take on the values 0 ( not applied) or 1 (applied). The values of the control inputs can be changed with time.

69. 10/21/2011 http://gsp.tamu.edu 69 Control Setting Control input vector at time k: [u1(k) ,......,um(k)] Change both GAP and control vector to integers, z(k) and v(k). Then z(k) and v(k) can take on 2m values. We have a system w(k + 1)= w(k)A(v(k)) A is a stochastic matrix dependent on the control input. We have a controlled homogeneous Markov chain.

70. 10/21/2011 http://gsp.tamu.edu 70 Costs of Applying Control Choose v(0), v(1), ..... to minimize a particular cost function. Choice of cost function? Consider finite treatment horizon: k =0, 1,�,M � 1. Let Ck(z(k),v(k)) denote the cost of applying control v(k) at state z(k). (Input from biologists) Cost of control over M � 1 time steps:

71. 10/21/2011 http://gsp.tamu.edu 71 Terminal Costs Net result of control action: ends up in z(M). Penalize z(M) in the cost to reduce chances of ending up in an undesirable state. Define CM(z(M)) to be the terminal cost of ending up in state z(M). Partition states into equivalence classes. Assign higher penalties to states associated with rapid cell proliferation or reduced apoptosis and lower penalties for states associated with normal cell cycle. (Input from biologists)

72. 10/21/2011 http://gsp.tamu.edu 72 Total Cost

73. 10/21/2011 http://gsp.tamu.edu 73 Optimal Control Problem

74. 10/21/2011 http://gsp.tamu.edu 74 WNT5A Network Up-regulated WNT5A associated with increased metastasis. Cost function penalizes WNT5A being up-regulated. Optimal control policy with Pirin as control gene.

75. 10/21/2011 http://gsp.tamu.edu 75 Shift of Steady-State Distribution Optimal (infinite horizon) control with pirin has shifted the steady-state distribution to states with WNT5A down-regulated: (a) with control; (b) without control.

76. 10/21/2011 http://gsp.tamu.edu 76

77. 10/21/2011 http://gsp.tamu.edu 77

78. 10/21/2011 http://gsp.tamu.edu 78

79. 10/21/2011 http://gsp.tamu.edu 79

Genomic Signal Processing: Issues in Engineering Molecular ...

Genomic Signal Processing: Issues in Engineering Molecular ...

Presentation Transcript

DIGITAL SIGNAL PROCESSING

EEE404/591 - Real-Time Digital Signal Processing

EE381V: Genomic Signal Processing

Signal processing in neurons

Applications of Molecular Cytogenetics

Lecture 6: Signal Processing III

Digital Signal Processing 2 Les 2: Inleiding 2

Lecture 4: Signal Processing

Signal Processing (time-based effects)

Digital Signal Processing Laboratory Work 521280S

Integrative Analysis of multiple large-scale molecular biological data

CHAPTER 10 Applications of Digital Signal Processing

Digital Signal Processing

Molecular and Genomic Pathology at UNC

Qin Yan Communication & Multimedia Signal Processing Group

NCBI Molecular Biology Resources

BPM Signal Processing

Telecommunications and Signal Processing at UT Austin

Biomedical Signal Processing An introduction

Digital Signal Processing

Nonlinear and Time Variant Signal Processing

Digital signal Processing ECI-3-832

Genomic Signal Processing: Issues in Engineering Molecular ...