1 / 51

Real data and GWAS Case Study

Real data and GWAS Case Study. CSCI2820 – Medical Bioinformatics. Outline. Introduction to Biology Introduction to CS Data Generation Data Acquisition and Databases A closer look: Linkage Disequilibrium GWAS Case Study. DNA.

clancy
Download Presentation

Real data and GWAS Case Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real data and GWAS Case Study CSCI2820 – Medical Bioinformatics

  2. Outline • Introduction to Biology • Introduction to CS • Data Generation • Data Acquisition and Databases • A closer look: Linkage Disequilibrium • GWAS Case Study

  3. DNA DNA:  the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms.

  4. DNA Organization in the Human Genome Genome facts The pair of sex chromosomes determine gender. 2 copies of each autosome~3.2 billion base pairs Around 2.9 billion bases organized into scaffolds Only about 90% of the genome has been sequenced!

  5. Gene • A gene is the functional and physical unit of heredity passed from parent to offspring. • Genes are pieces of DNA, and most genes contains the information for making a specific protein. http://en.wikipedia.org/wiki/Gene

  6. Central Dogma http://www.dnalc.org/resources/3d/ • gene • Unit of inheritance • Transcribed into mRNA • mRNA • messenger RNA • blueprint for protein • proteins • Essential molecules that are active in practically all cellular processes • Genes – RNA – Proteins • Useful video: http://www.dnalc.org/resources/3d/central-dogma.html

  7. Variation • Single base mutation, indels • Structural Variation • Deletion • Duplication • Translocation • Inversion • Recombination http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna

  8. Recombination

  9. Intro to CS • Algorithm • “a procedure for solving a mathematical problem in a finite number of steps…” • Input-> Computation -> Output • E.g. sorting n numbers • Theory • Analysis of algorithms • Application • For biologists: mathematica programming! • Mathematica demo http://commons.wikimedia.org/wiki/File:Selection-Sort-Animation.gif

  10. Bioinformatics • Regardless of your profession, it is important to study both the biological and computational aspects of the problem • Understanding the biology may help computational researchers create more accurate models, more accurate solutions, help identify biases, etc… • Understanding the computation may help biologists compute better results, create a better study design, develop fine-tuned solutions to unresolved problems, etc…

  11. Data Generation • Types of Data • Variation (SNPs, structural) • Genotype • Haplotype • Sequence Reads • Protein Structure • Genes • Technologies • SNP Array • Sequencing Genotype {A,C} C {G,T} {C,T} {C,A} {T,A} ACGCCT TGCGGA Haplotype ACGCCT CCTTAA CCTTAA GGAATT Algorithmic Opportunity! Input: Genotypes Output: Haplotypes

  12. Haplotype Phasing • Haplotype phasing: separate an individual’s paired chromosomes (genotypes) into the maternal and paternal chromosomes (haplotypes) explanation 1 explanation 2 genotype hap 1 hap 1 100111100 100101000 100111000 100101100 100121200 hap 2 hap 2

  13. SNP Arrays SNP array intensity allele calls Probe intensities Allele 0 Allele 1 http://www.sanger.ac.uk/resources/software/illuminus/ http://www-microarrays.u-strasbg.fr/base.php?page=affySNPsE.php

  14. Sanger Sequencing Long reads: ~500-1000bp Low error rates Very slow

  15. High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error than Sanger Extremely fast and parallel

  16. NCBI • http://www.ncbi.nlm.nih.gov/

  17. EBI • http://www.ebi.ac.uk/

  18. HapMap • http://hapmap.ncbi.nlm.nih.gov/

  19. GWAS Data • International Multiple Sclerosis Genetics Consortium • MS Data: • 931 Trios (Mother-Father Infected Child) • ~350k SNPs • Wellcome Trust Case-Control Consortium • Covers many diseases • dbGaP • Repository for association studies

  20. 1000 Genomes • Aims to sequence the genomes of 1000 individuals • Many individuals taken from HapMap samples • Data available from 3 pilot studies • High coverage, full genome sequencing of 2 trios • Low coverage, genome sequencing on several individuals • High coverage, exome sequencing on several individuals

  21. Protein Data Bank

  22. PDB File HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION COMPND MOL_ID: 1; COMPND 2 MOLECULE: UBIQUITIN; COMPND 3 CHAIN: A; … … …ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C

  23. Linkage Disequilibrium • D’ in real data • HLA-DRA: Chromosome 6 bases 32515-32520kb • Surrounding area: 32400-32600kb • LD in different populations • LD in different phasings • LD in different regions of the genome

  24. Linkage Disequilibrium heat maps. • The markers are distributed along the x-axis. • Each cell represents two SNPs, the darker the red color the higher the LD between the markers. • CEU = Utah residents of northern and western European ancestry • YRI = 30 trios from Ibadan, Nigeria

  25. A GWAS Case Study: Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study

  26. The Biology of Multiple Sclerosis • A chronic inflammatory disease of the central nervous system (CNS), the brain and the spinal cord. • A malfunction of the immune system which leads to attacks against, and causes destruction of the myelin sheath. • Symptoms range from mild muscle weakness to partial or complete paralysis.

  27. Previous Associations • In 1972, the association between multiple sclerosis and the HLA region of the genome was established. • HLA-DRB1 gene on chromosome 6p21 was identified. The human leukocyte antigen system (HLA) is the name of the human major histocompatibility complex (MHC). This group of genes resides on chromosome 6, and encodes cell-surface antigen-presenting proteins and many other genes. The major HLA antigens are essential elements in immune function

  28. Genome-wide Association Studies (GWAS) • GWAS Goal • Identify patterns of polymorphisms that vary systematically between individuals with different disease states (in particular, healthy and disease) and could therefore represent the effect of risk-enhancing or protective alleles. • Let’s follow the paper Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study

  29. GWAS Workflow

  30. GWAS Workflow

  31. Genotypes • Critical Issues • SNP tagging • Include other versions of polymorphism? • microsatellites • copy number variation • How is the data collected? • What types of data? Sequncing? SNP array? Which platform? • MS Study • 334,923 single-nucleotide polymorphisms • 931 trios (screening phase)

  32. GWAS Workflow

  33. Quality Control • Critical Issues • Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ2 or Fisher exact test • Sampling Bias? • Population stratification (substructure) • Genotyping efficiency (missing data)? • Inference of missing data • MS Study • 72 trios removed • Around 150k SNPs not used • STRUCTURE used to remove individuals with non-European ancestry

  34. Quality Control MAF: Minor Allele Frequency HW: Hardy Weinberg Equilibrium ME: Mendelian Errors

  35. Population Substructure Example Individual Locus 1 Locus 2 Locus 3 Locus 4 1 A,A A,A A,C A,A 2 A,B A,A A,B A,A 3 B,B A,B A,A A,A 4 C,C D,E D,E B,C 5 C,C C,D D,D B,D 6 B,C E,E A,E C,E 7 A,C D,D C,D A,D {A,B,C,D,E} are labels for the different gene alleles for 4 different loci These genotypes might suggest that individuals 1,2,3 draw their alleles from a different gene pool than do individuals 4,5,6,7, suggesting the presence of 2 distinct populations.

  36. GWAS Workflow

  37. Statistical Analysis • Critical Issues • Inference of phase and missing data • Single SNP test of association • Multi SNP test of association • What if individual SNPs do not contribute additively to disease? • MS Study • TDT • UNPHASED program used for genetic association analysis with missing data and unknown phase

  38. MS Study Statistics • P values (shown as –log values) for results of transmission disequilibrium testing are plotted across the genome. • The classic HLA-DR risk locus on chromosome 6p21 stands out with strong statistical significance (P<1×10−81).

  39. Screening Analysis WTCCC: Wellcome Trust Case Control Consortium NIMH: National Institute of Mental Health IMSGC: International Multiple Sclerosis Genetics Consortium

  40. GWAS Workflow

  41. Rankings, Filter, Results • Critical Issues • Multiple Testing Correction • SNP Arrays • The hope is that by typing a dense set of markers, we will observe markers in direct association with unobserved causal locus, and in indirect association with disease phenotypes. • Is the common-disease common-variant the correct model for this disease? • MS Study • SNPs in loci: HLA-DRA, IL2RA, IL2RA, IL7R

  42. GWAS Workflow

  43. Analysis • Critical Issues • Alleles of IL2RA and IL7RA and those in the HLA locus are identified as heritable risk factors for multiple sclerosis • Environmental factors? • Where are the associative SNPs found? • MS Study • Association found and LD used to identify markers • More trios and controls recruited for replication (targeted SNPs)

  44. The Biology: IL2RA and IL7RA • Both are important in are important in T-cell mediated immunity • IL2RA • The interleukin-2 receptor (IL-2R) is heterotrimeric protein expressed on the surface of certain immune cells that binds and responds to a cytokine called interleukin 2. • Linked to two other autoimmune diseases: type 1 diabetes and autoimmune thyroid disease. • IL7RA • The protein encoded by this gene is a receptor for interleukine 7 • Helps to control the activity of a class of immune cells called regulatory T cells. • IL7RA variant indicate an effect on gene expression with a change in the ratio of soluble to cell-bound interleukin-7 receptor

  45. Replication Analysis

  46. Odds Ratios • Measure of effect size • Proportion of people in case group with allele divided by the proportion of people in control group with allele • Example 100 cases, 100 controls • 75 cases with allele 0 • 25 controls with allele 0 • Odds ratio = (75/100)/(25/100)=3.00 • Very few studies have implicated SNPs with odds ratios > 3

  47. Regional Plots for Associations in IL2RA

  48. Regional Plots for Associations in IL7RA

More Related