1 / 92

Exploratory Failure Time Analysis and Copy Number Variation Inference

Exploratory Failure Time Analysis and Copy Number Variation Inference. Cheng Cheng Department of Biostatistics St. Jude Children’s Research Hospital. Outline. Part I Background Part II Exploratory Failure Time Analysis Part III Copy Number Variation Inference. I. Background.

Download Presentation

Exploratory Failure Time Analysis and Copy Number Variation Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploratory Failure Time Analysisand Copy Number Variation Inference Cheng Cheng Department of Biostatistics St. Jude Children’s Research Hospital

  2. Outline Part I Background Part II Exploratory Failure Time Analysis Part III Copy Number Variation Inference

  3. I. Background • Nucleus, nucleotides, DNA, chromosomes, SNP • SNP arrays • Genome Wide Association Study (GWAS) • Multiple tests • Cause-specific failure and Competing risk • Cumulative incidence function, Gray's test, Fine-Gray hazard rate regression model • Censor at time competing event: OK for testing stochastic independence, biased for estimation

  4. Animal CellOrganelles Nucleus Nucleolus Endoplasmic Reticulum Centriole Centrosome Golgi Cytoskeleton Cytosol Mitochondrion Secretory Vesicle Lysosome Peroxisome Vacuole

  5. Nucleus Functions The cell nucleus is an organelle that forms the package for our genes and their controlling factors. • Store genes on chromosomes • Organize genes into chromosomes to allow cell division. • Transport regulatory factors & gene products via nuclear pores • Produce messages (messenger Ribonucleic acid or mRNA) that code for proteins • Produce ribosomes in the nucleolus • Organize the uncoiling of DNA to replicate key genes

  6. Chromosome inside nucleus DNA = deoxyribonucleic acid • What is a chromosome? • In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. • Each chromosome is made up of DNA tightly coiled many times around proteins called histones that support its structure.

  7. Human chromosomes • In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. • Twenty-two of these pairs, called autosomes, look the same in both males and females. • The 23rd pair, the sex chromosomes, differ between males and females. • Females have two copies of the X chromosome • males have one X and one Y chromosome.

  8. Chromosome Structure • Each chromosome has a constriction point called the centromere, which divides the chromosome into two sections, or “arms.” • The short arm of the chromosome is labeled the “p arm.” The long arm of the chromosome is labeled the “q arm.” • Each chromosome has two chromatids as a result of duplication of the DNA which took place during interphase. The two chromatids are linked together at a centromere.

  9. DNA structure DNA is a double-stranded molecule twisted into a helix (think of a spiral staircase). Each spiraling strand, comprised of a sugar-phosphate backbone and attached bases, is connected to a complementary strand by non-covalent hydrogen bonding between paired bases. The bases are adenine (A), thymine (T), cytosine (C) and guanine (G).

  10. Genetic codeis specified by the four nucleotide "letters"A(adenine),C(cytosine),T(thymine), and G (guanine). A Single Nucleotide Polymorphism (SNP) is a change of a single nucleotide, such as an T, replaces one of the other three nucleotide letters -- A, C, or G, within a person's DNA sequence. SNPs occur in human DNA at a frequency of one every 1,000 bases. These variations can be used to track inheritance in families.

  11. SNP probe = 25 bases Perfect Match Allele ‘A’ Mismatch Perfect Match Allele ‘B’ Mismatch Quartet SNP Array Design SNP T/G 5´ 3´ Genomic Sequence

  12. Hundreds of Millions of Pixel Intensities…..

  13. Genotype Calling AA AB BB

  14. Genome Wide Association Study (GWAS) Typically 400,000 to 900,000 SNPs are investigated in a single study Number of subjects in a study typically ranges from a few hundreds to 20,000 Each SNP takes three possible (generic) values “AA”, “AB”, “BB”, often coded as 0, 1, 2 Each SNP in each individual has a unique value, which is one of 0, 1, or 2 A small number of phenotypes: disease status (yes/no), or quantitative trait This lecture: time to a cause-specific failure n subjects, n observed trait values Y1, …, Yn, n observed SNP values for the ith SNP Xi1, …, Xin Inference (Test) for stochastic dependence of the ith SNP with the trait based on the dataset (Xij, Yj), j=1,…,n; do this for each SNP; thus many tests of the null hypothesis of stochastic independence.

  15. Massive Multiple Tests “Genome-wide significance” Bonferroni-type adjustment: Declare statistical significance if P≤10-7 (0.05/500K) FDR and q value Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57, 289–300. Storey, J. D., Taylor, J. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach JRSS-B, 66, 187–205. Profile information criteria Cheng, C., Pounds, S., Boyett, J. M. et al (2004). Statistical significance threshold criteria for analysis of microarray gene expression data. Statistical Applications in Genetics and Molecular Biology 3, Article 36. URL //www.bepress.com/sagmb/vol3/iss1/art36 Cheng, C (2006) An adaptive significance threshold criterion for massive multiple hypotheses testing. IMS Lecture Notes - Monograph Series 2nd Lehmann Symposium – Optimality49, 51–76

  16. Relapse Failure type 1 (of interest) 2nd Cancer Failure type 2 (competing risk/event) Alive Die in remission Failure type 3 (competing risk) Cause-specific failure and competing risk Klein, J. P. (2010) Competing risks. WIREs Comp Stat, www.wiley.com/wires/compstats, DOI: 10.1002/wics.83

  17. Cumulative incidence function (CIN) (T, δ); Fj(t)=Pr(T ≤ t and δ=j) Gray’s test: Compare CIN across K groups Analog of weighted log-rank test Gray, R. J. (1988) A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Statist. 16, 1141-1154. Fine-Gray’s CIN hazard rate regression model Analog of Cox’s hazard rate regression model Fine, J. P., Gary, R.J. (1999) A proportional hazards model for the subdistribution of a competing risk. JASA, 94, 496-509. Censor at the time of competing event

  18. II. Exploratory Failure Time Analysis • Large-scale Genomic Association Analysis • Feature (variable) screening and feature extraction • A Motivating Example from a GWAS • Correlation Profile Test (CPT) • Hypotheses • Correlation profile function • CPT statistic • Hybrid permutation test of significance • A Simulation Study: Strength and Weakness • Example: Analysis of SNPs on Chromosome 9 • Summary and Remarks • Feature Extraction (sparse regression) • Example: “Prognostic” Gene (RNA) expression • Summary and remarks

  19. Large-scale Genomic Association Analysis • Feature (variable) screening • Find individual genomic features (factor/predictor variables) associated with one or more phenotypes (response variables) • GWAS • Association: stochastic dependence • Parametric/semi-parametric approaches: linear models, GLMs, hazard rate (Cox) regression • Feature extraction • Find (linear) combinations (or sets) of genomic features (variables) associated with one or more phenotypes • Determine sets of variables using biological knowledge (gene signaling pathways, functional/ontology groups, etc.): GSEA • Variable/Model selection methods: ridge regression, LASSO, SCAD, SEAMLESS, sparse regression

  20. A Motivating Example • GWAS to screen SNP markers for risk of relapse in childhood leukemia patients

  21. A Motivating Example Need: a more omnibus and algorithmically robust test procedure

  22. Correlation Profile Test (CPT) • Model, Null and alternative hypotheses (classical survival setting)

  23. Correlation Profile Test (CPT) • Sample correlation profile function observed event point process of individual i Can do rank transformation for continuous X

  24. Correlation Profile Test (CPT)

  25. Correlation Profile Test (CPT) • CPT statistic, hybrid permutation test

  26. Back to the SNP Example

  27. A Simulation Study • A model mimicking the SNP example Generate X: Pr(X=0)=0.98, Pr(X=1)=0.015, Pr(X=2)=0.005 Generate Censor Time TC ~ Exp(0.2) Generate failure indicator IF|X ~ Bernoulli(πF); πF = 0.2exp{-θ(X-2)} If IF = 1, generate Failure Time TF|X ~ LogNormal(βX,1) else set TF = ∞ Generate competing risk indicator IR ~ Bernoulli(0.1) If IR = 1, generate Competing Failure Time TR ~ Unif(0,7) else set TR= ∞ Observed Failure Time T = min{TCTF TR} Repeat the above n times to simulate n individuals

  28. A Simulation Study • A model mimicking the SNP example Pwr est. s.e.

  29. A Simulation Study Exact Proportional Hazard, continuous predicator

  30. A Simulation Study Exact Proportional Hazard, continuous predicator

  31. A Simulation Study Continuous predictor, deviation from proportional hazard

  32. AA AB BB A Simulation Study Ordinal predictor, deviation from proportional hazard Opposite scenario of the SNP example

  33. Relapse Failure type 1 (of interest) 2nd Cancer Failure type 2 (competing risk/event) Alive Die in remission Failure type 3 (competing risk) Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic Leukemia (ALL) 21,909 SNPs on Chr 9 obtained by Affy 100K and 500K SNP arrays were tested for association with relapse of childhood ALL

  34. Example: Germline SNPs on Chr 9 and risk of relapse in childhood Acute Lymphoblastic Leukemia (ALL) n=707 subjects from two most recent clinical trial at SJCRH 21,909 SNPs CPT test performed on each SNP, with 200 permutations in the hybrid permutation test Significance determined by the profile info criteria Ip (Cheng et al. 2000); 200 SNPs were considered statistically significant, estimated FDR=48.7%

  35. ρ^(tj), j=1, …, J=9 Test stat = -3.478

  36. AA 5.1% AB 28.7% BB 66.2% P Gary’s test 0.0451 Fine-Gray regression 0.0380; coeff=-0.3905

  37. ABL1 Gene Germline SNP AA AB BB Tot 36 (0.051) 201 (0.287) 464 (0.662) 701 (1.00) A 273 (0.195) B 1129 (0.805) AA AB BB T13B intermediate/high risk 12 27 75 (0.152) T13B Low risk 7 33 67 (0.065) T15 standard/high risk 11 74 161 (0.047) T15 Low risk 6 67 161 (0.026)

  38. Extension to Recurrent Events Multiple event times • Model, Null and alternative hypotheses # events occurred ≤ t

  39. N = # events occurred ≤ t N Extension to Recurrent Events

  40. Summary and Remarks • Correlation Profile Test: • Computationally more robust • More omnibus: covers certain deviations from the semi-parametric hazard regression model • Highly competitive with other non-parametric procedures (Gray’s test, Jung’s test) • Relative deficiency vs. Cox model under PH ?? • Extension to recurrent-event phenotypes • Informative censoring in the presence of competing risk

  41. Feature Extraction (Sparse regression) • Identify (linear) combinations of covariate variables that are associated with the failure phenotype

  42. Feature Extraction (Sparse regression) • Sparse regression by the General Path seeking (GPS) algorithm (Friedman 2008) • Exploratory failure time analysis by weighted least square -- the association criteria • The modified GPS algorithm to find a solution • A small simulation study • Example: Gene (RNA) expression “prognostic” for relapse of childhood ALL

  43. Lasso (Tibshirani 1996), grouped lasso (Yuan and Lin 2006), SCAD (Fan and Li 2001) Elastic net (Zuo and Hastie 2005) SEAL (Xihong Lin, 2009 JSM) Sparse Regression by General Path Seeking (GPS, Friedman 2008)http://www-stat.stanford.edu/~jhf//ftp/GPSpub.pdfGeneral Setup

  44. Feature Extraction (Sparse regression) The general GPS algorithm

  45. Feature Extraction (Sparse regression) • Exploratory failure time analysis: setup

  46. Feature Extraction (Sparse regression) • Association criteria: Penalized weighted least square

  47. Feature Extraction (Sparse regression) • The power penalty function |β|γ, 0<γ≤1 γ=0.0001 γ=0.5 γ=1

More Related