Genetic analysis of human disorders

Genetic analysis ofhuman disorders Tom Scerri Quantitative association analysis and next generation sequencing

Quantitative association exercise and Exercise 4 • Results in the folder “answers”

n-2 Quantitative association analysis: singletons • How to calculate the association? • Consider Marker 1: r = -1.56 / (6.9 × 0.424)0.5tn-2 = -0.91205 × ((10 - 2) / (1 - 0.912052))0.5b = -1.56 / 6.9 = -0.91205 = -6.29066 = -0.2261 r2 = 0.83183 p = 0.000235

Quantitative association analysis: singletons • How to calculate the association? • Consider Marker 2: n-2 r = -0.04 / (6.9 × 0.424)0.5tn-2 = -0.02339 × ((10 - 2) / (1 - 0.023392))0.5b = -0.04 / 6.9 = -0.02339 = -0.06616 = -0.0058 r2 = 0.00055 p = 0.948871

Marker 1 Marker 2 Exercise 4a: Performing quantitative association analysis of singletons with PLINK • Confirming our calculations made during the lecture using PLINK • Example command lines: • plink --ped CC_quant.ped --map CC_quant.ped --out CC_quant --assoc • plink --ped CC_quant.ped --map CC_quant.ped --out CC_quant --assoc --qt-means • PLINK output:

Marker 1 Marker 2 Exercise 4a: Performing quantitative association analysis of singletons with PLINK The difference in sign (+/-) depends on the allele you are referencing. • Do you agree we got the same results? • Our estimates: • Marker 1: • Marker 2: • PLINK output: r2 = 0.83183 t = -6.29066 p = 0.000235 b = -0.2261 r2 = 0.00055 t = -0.06616 p = 0.948871 b = -0.0058

Exercise 4b: Selecting SNPs for genotyping using HaploView and Tagger • The task was to use Tagger to select SNPs for genotyping with these criteria: • from base-pair position: • 13,500,000 to 13,980,000 • HWE p-value > 0.002 • minimum genotype > 80% • minimum minor allele frequency > 0.05 • r2 threshold > 0.9 • Pairwise tagging

Exercise 4b: Selecting SNPs for genotyping using HaploView and Tagger • The task was to use Tagger to select SNPs for genotyping with these criteria: • from base-pair position: • 13,500,000 to 13,980,000 • HWE p-value > 0.002 • minimum genotype > 80% • minimum minor allele frequency > 0.05 • r2 threshold > 0.9 • Pairwise tagging Go!

Exercise 4b: Selecting SNPs for genotyping using HaploView and Tagger • Results: Assuming you were to use the file I provided, the output would be something like: 138 SNPs to genotype This 1 SNP “tags” all 11 of these SNPs (including itself with an r2 > 0.9). This means we can genotype 1 SNP and still get majority of the information about the genetic variability of 11 SNPs in total. Scale this up to a whole genome, and we can genotype 1,000,000 SNPs and get the information about ~10,000,000 SNPs! Only ~10% of the time, resources and cost!

Exercise 4b: Selecting SNPs for genotyping using HaploView and Tagger • Questions and answers from the exercise: • How many SNPs need you genotype to cover the region? Answer: 138 • How many SNPs will they capture in total? Answer: 247 • Compare your results with your neighbour. Answer: They are different. • Email your results to clicker@well.ox.ac.uk Reply: Find 2 alternative tagging SNPs. • Wait for a confirmation reply before going home. For example, this SNP can not be genotype using my genotyping technology, e.g. due to close proximity to another SNP or issues with repetitive DNA in the region. Therefore, we can genotype this other SNP instead which will still capture the same genetic variability, and then perform an association study.

Exercise 4c: Results using in-silico PCR Must “check” this box. Why? • First primer pair: • ATAATTAAAAGGCTAATCAAGTGTGCAT • TTGCCATAGGTCTCATAATAGCCTAAC

Exercise 4c: Results using in-silico PCR • First primer pair: Very good program for designing primers. In our exercise, I did not use this program, hence my primers are not very good! For a good PCR, these temperatures should be in the range of ~52-58°C and should be similar in value (e.g. within ~5° of each other). Forward primer Reverse primer A SNP in the primer sequence - this is not good. Two SNPs that should be amplified by our PCR reaction and so can be genotyped.

Exercise 4c: Results using in-silico PCR • Second primer pair: • TGCCCGGCTACTCATTTTTTAAAATGTG • GTAATACCTTTAAAACATTTTTGCATTTTTT A SNP in the primer sequence. The presence of multiple SNPs in the PCR product might be problematic for genotyping - this depends on the genotyping platform (e.g. Sequenom or Illumina). Repetitive DNA indicates our primer is not specific to this region of the genome.

Exercise 4c: Results using in-silico PCR • Third primer pair: • GGTTGGTCTTTCAAAATGATCAGTAGA • ATTATAAAGAATTATAAATGAATTATTAAA For a good PCR, these temperatures should be in the range of ~52-58°C and should be similar in value (e.g. within ~5° of each other).

Exploring the UCSC Genome Browser further… association studies

Genetic association studies… SNPs and genes associated to diseases

Something very cool…

Neanderthal DNA Neanderthal sequence aligning to Human DNA DNA sequence from six Neanderthal samples

Neanderthal DNA A difference between Human DNA and Neanderthal DNA

Quantitative association analysis:family-based • Collect 100’s or 1000’s of family samples: • Including mothers, fathers and all available children. • Measure these for a quantitative trait (everyone or just the children). • Genotype samples. • Test for association, e.g. with QTDT. • QTDT: • A program that can perform many different types of quantitative association analyses, e.g.: • Within families • Between families. • This is more powerful but is prone to population stratification. • Can use a variance components (VC) framework to allow for different components • environmental • polygenic • additive major gene

Using QTDT • File formats are essentially the same as Merlin. • However, rather than a map file, QTDT uses an IBD file. • IBD = identidy-by-descent • The IBD file is created using Merlin.

SNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 SNP 7 SNP 8 A A A A A A A A G G G G G G G G G G G G G G G G G G G G G G G G A A A A A A A A G G G G G G G G Genotype problems:individual errors • Hypothetical example: • designed to be easily understood • in reality is more complicated • usually require a compute to solve these issues • e.g. with Merlin • Identified through: • simple Mendelian inheritance • double recombinants True genotypes SNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 SNP 7 SNP 8 A A A A A A A A G G G G G G G G G G G G G G G G G G G G G G G G A A A A A A A A G G G G G G G G G G G G G G G G G G G G G G G G A A A A A A A A G G G G G G G G G G G G G G G G G G G G G G G G A A A A A A A A G G G G A G G G A A A A A A A A G G G G G G G G A A A A G A A A G G G G G G G G Genotype error Easily detected. True genotypes Genotype error Detected because 2 recombination events occurring so close to each other is highly unlikely

Exercise 5a: Using MERLIN to look for double recombinants • Data and 5th lecture available here: • www.well.ox.ac.uk/~clicker/Bologna/Lecture5/ • Use pedstats to check for connectivity of the family members and for simple Mendelian inheritance errors: • pedstats -p family_quant.ped -d family_quant.dat • Use pedstats to check for Hardy-Weinberg equilibrium of the markers • pedstats -p family_quant.ped -d family_quant.dat --hardyWeinberg • Use MERLIN to check the families for double-recombinants. • merlin -p family_quant.ped -d family_quant.dat -m family_quant.map --error • How many individuals have unlikely genotypes?

Exercise 5b: Using MERLIN to create an IBD file for QTDT, and performing an association analysis with QTDT • Use PEDWIPE to “wipe” or “cleanse” the pedfile - i.e. it will remove any unlikely genotypes: • pedwipe -p family_quant.ped -d family_quant.dat • Use MERLIN to create an IBD file using the new “wiped” data files: • merlin -p wiped.ped -d wiped.dat -m family_quant.map --ibd --markerNames • Run QTDT to perform the quantitative association analysis: • qtdt -p wiped.ped -d wiped.dat -i merlin.ibd -at -wega > at_wega.txt • Two files are generated: • at_wega.txt - this contains the p-values for each SNP and trait combination • regress.tbl - this contains the “effect size” of any associations

DNA Sequencing • Old method: • Sanger sequencing • Next generation sequencing: • 454 Life Sciences pyrosequencing • Illumina/Solexa sequencing

454 Life Sciences pyrosequencing • Sequencing by synthesis. • Clonally amplified DNA fragments in a water-in-oil mixture. • Obtain millions of long sequence reads (400 - 1000 bp). • http://454.com/products-solutions/how-it-works/index.asp • …I can try to explain it if you like, just ask!

Illumina/Solexa sequencing • Sequencing by synthesis. • Clonally amplified DNA fragments on a flow cell. • Uses 4 different fluorescently labelled nucleotides. • Obtain millions of short sequence reads (30 - 100 bp). • Short sequence reads need aligning to a reference genome. • Enables very deep coverage (x hundreds). • Enables sequencing multiple individuals/samples simultaneously. • http://www.illumina.com/

Illumina/Solexa sequencing Figure taken from the Illumina .pdf (it’s in the Lecture 5 folder)

Illumina/Solexa sequencing • Types of sequening strategies: • DNAseq • Whole genome • Low coverage • Exome sequencing • High coverage • Paired-end • Identify genomic structural variations • CNVs • Insertions/Deletions • Inversions • Transcriptomes • RNAseq

Illumina/Solexa sequencing • Generic workflow: Sequencing machine Save raw data (gigabytes!) qseq file Filter data (e.g. on quality scores), e.g. using Novalign FASTQ file Align to reference genome, e.g. using e.g. Casava, MAQ, Bowtie, Novalign or BWA SAM file SAM = Sequence Alignment/Map format Compress file using, e.g. SAMtools BAM file BAM is a binary version of the SAM file

Exercise 5c: Looking at Next Generation Sequence Data • Warning: This is a very crude way of looking at the data and is by no means the best way. It’s simply for demonstration only. • For this exercise it will be easier if I log you into a server in Oxford. To do this, you need to use either PuTTY or SSH which should be easy to download and install on to you computers. They are also already on the computers in the lab. • Otherwise, you could try installing the sequence analysis programs called “novoalign” and “samtools” onto your own computers (but don’t ask me for help!). • In the Lecture 5 folder are two files: • phi.fa = reference genome for phi • s_4_1_0045_qseq.txt = next generation sequence data in qseq format • Now, you could write your own simple perl or awk scripts to convert the file s_4_1_0045_qseq.txt from a qseq format file format into a fastq format file. • It would involve shifting columns and data around. • However, due to time constraints I have generated for you the fastq file: • s_4_1_0045_fastq.txt • Compare the qseq and fastq files using the “more” command (hit “q” to quit) .

Exercise 5c: Looking at Next Generation Sequence Data • Step 1: Run “novoindex” to index the reference file: • Type: novoindex - you can see options available • Type: novoindex newindex.nix phi.fa - this will create the file newindex.nix • Step 2: Run “novoalign” to align your sequences to the reference sequence: • Type: novoalign - you can see options available • Type: novoalign -d newindex.nix -f s_4_1_0045_fastq.txt -F ILMFQ -o SAM -l 12 > sam_file.sam • Step 3: Run “samtools” convert your SAM file into a BAM file: • Type: samtools - you can see options available • Type: samtools view - you can see new options available • Type: samtools view -bS sam_file.sam > bam_file.bam

Exercise 5c: Looking at Next Generation Sequence Data • Step 4: Run “samtools” to sort your BAM file. • Type: samtools sort - you can see new options available • Type: samtools sort bam_file.bam bam_sorted - this will create the file bam_sorted.bam • Step 5: Run “samtools” to index you sorted BAM file. • Type: samtools index - you can see new options available • Type: samtools index bam_sorted.bam • Step 6: Run “samtools” to view your sequences. • Type: samtools tview - you can see new options available • Type: samtools tview bam_sorted.bam phi.fa • Type “?” for help, • Type “.” to toggle the sequences. • Type “q” to quit.

Genetic analysis of human disorders