Genomics Lecture Topics Genetic mapping studies: two approaches

Genomics Lecture Topics • Genetic mapping studies: two approaches • Classical linkage map/genome-wide association study • Physical map • Cloning and isolating genes the old-fashioned way using positional cloning • Search for the cystic fibrosis gene • Modern genome sequencing • Shotgun sequencing an entire genome • Sequencing the human genome • Functional/comparative genomics

Classical genetic linkage studies: Genetic linkage mapping involves determining the statistical association of specific traits with genetic markers on chromosomes using pedigrees and crosses. • Use recombination frequencies to determine a relative distances between markers on a chromosome. • Genome–wide Association Studies = GWAS • Humans require 24 different maps, one for each of the 22 autosomes and one each for the X and Y chromosomes. • Marker alleles are used to determine the rate of recombination (e.g., crossing over) between linked genes. • Linked = on the same chromosome or genome • The unit measured for each linkage map is the recombination frequency = # recombinants/total progeny • Reported as map units (mu) or centiMorgans (cM). • Will talk about this more starting with chapter 11.

Different types of markers used in genetic mapping: • Genes can be used as genetic markers, but they are not ideal choices because they occur infrequently (ca. every 100 kb in humans). • Greater marker density is usually required. • 3 major types of markers are used: • RFLPs = substitutions at a restriction site. • Microsatellites (STR) = short tandem repeats • Single nucleotide polymorphisms (SNPs) • STS = sequence tagged site

Genome-wide mapping: • High density genetic mapping was revolutionized in the 1980s by the discovery of abundant polymorphic genetic markers like microsatellites. • Research teams collaborated and added to a common data base. • By 1994, human genetic map had localized: 5,264 microsatellites to 2,335 chromosome loci (average density of one marker every 599 kb) • In the process, thousands of sequence tagged site (STS) identified. • STS = couple hundred base pairs of known sequence

High-density genetic map of 5,264 microsatellites localized to each of 23 chromosomes.

From genome-wide mapping to genome sequencing… • For many species with small genomes, such a map would have provided enough landmarks to begin sequencing the entire genome using a conventional map and sequence approach. • Human map still lacked resolution, large stretches of uncharted DNA remained. • Average distance between markers was 600 kb. • Physical mapping was required to assist with the sequencing. Physical map = map of physically identifiable regions of genomic DNA constructed without recombination analysis. • Time and effort could be minimized by targeting sequencing efforts to a specific chromosome (or smaller regions).

Two types of physical maps useful for sequencing a genome: 1. Low Resolution-Cytogenetic/FISH maps • Stained chromosomes produce banding patterns composed of bands that average 6 Mb. • Regions are designated by their position relative to the centromere. “q” = long arm “p” = short arm Numbered from the centromere starting with “1” • Genes and other sequences are localized to chromosome maps with probes and by using a technique called fluorescent in situ hybridization (FISH) • Various types of radioactive probes and stains also can be used to mark specific regions of chromosomes. • Provides a physical map of the overall structure of each chromosome/region.

http://www.mun.ca/biology/scarr/FISH_chromosome_painting.htm http://www.euchromatin.org/E09.htm

Two types of physical maps useful for sequencing a genome: 2. High Resolution-YAC/BAC Clone Contig Maps • Mechanically shear or partially digest genomic DNA with restriction enzymes and clone large 200-500 kb overlapping fragments to YACs or BACs. • An entire genome or single chromosome can be represented in a YAC or BAC clone library (depends on starting point). • Overlapping YAC/BAC clones can be assembled into a scaffold without sequencing by DNA fingerprinting using markers like microsatellites. • BAC vectors with a capacity of 300 kb and ability to replicate in E. coli have become popular for genome sequencing (now routinely sequenced using the shotgun approach).

Fig. 10.1 2nd edition, YAC contig physical map assembled by microsatellite mapping (combination YACs + microsatellite mapping)

Cloning, isolating,and sequencing genes: • Locating a gene is easy if the gene product (protein) is identified. • Create a cDNA library using an expression vector. • Probe with antibodies that bind the gene product. • Isolate and sequence positive clones. • If the gene product is unknown, locating and sequencing a gene is more difficult. • Identify a marker (microsatellite, RFLP, SNP) that is: • Shows a strong statistical association with the disease phenotype in test crosses or genome-wide association study (GWAS). • Physically linked to the gene on the same chromosome. • Use linkage map + physical map and a technique called positional cloning to home in on gene and actually sequence it. e.g., cloning and discovery of the cystic fibrosis (CF) gene.

Positional Cloning- identification the cystic fibrosis (CF) gene: • Most common lethal genetic disease in the U.S. (~1 in 2,000). • First human gene identified by positional cloning. • Required 4 years and the work of many laboratories.

Overview of cystic fibrosis: CF results from defect in protein that regulates the movement of salt and water in and out of cells. Causes thick mucus secretions in the lungs, pancreas, and intestines. Causes lung disease and organ failure, patients experience chronic bacterial infections. Life expectancy is abut 40 years.

First steps to identifying the CF gene by positional cloning: Many hundreds of individuals with CF pedigrees were screened with a large number of RFLPs. A single recurring RFLP showed weak linkage (statistical association) to the cystic fibrosis trait. CF gene was next localized to chromosome 7 using a labeled RFLP probe and in situ hybridization to condensed chromosomes. All other known RFLPs from chromosome 7 were simultaneously screened for linkage to CF. Two more linked RFLPs were discovered on a 500,000 bp subregion (31-32) of the long arm of chromosome 7 (7q31-q32). The data indicated CF locus is within a 500,000 bp region of chromosome 7.

Steps to identifying the CF gene (cont.): • Section (500 kb) of chromosome 7 containing the CF gene was cut, cloned, and mapped using a technique called chromosome walking. • End of a cloned sequence is used as a probe to find adjacent overlapping fragments in a genomic library. • Clones that overlap are mapped with RFLPs to determine the extent of overlap. • A new labeled probe designed for the second clone is used to screen the library once again. • Repeat… • Technique does not work well with highly repetitive sequence that is scattered throughout the genome. • Length of each step in the chromosome walk is limited by the size of inserts in the library and the size of the overlap.

Fig. 9.10, 2nd edition Illustration showing how chromosome walking was used to identify a candidate gene for a disease like cystic fibrosis.

Technique called chromosome jumping also was used: Use partial restriction digestion to cut a large section of chromosomal DNA into large overlapping fragments. Circularize fragments with DNA ligase, bringing ends of DNAs thatpreviously were distant close together. Cut the circles with a restriction enzyme yet again to release the junction region (ends are now inverted). Clone junction regions to form a jumping library. Subclone a small fragment of DNA and use as a probe to find the next junction fragment occurring in the library (same technique as chromosome walking). Repeat… and/or start chromosome walking. Chromosome jumping reaches the target gene faster than walking. Similar technique called “mate pair” is used in today’s next-generation sequencing.

Chromosome jumping

Preparation of next-generation mate-pair library: http://www.investigativegenetics.com

Summary of the search for the CF gene: • 7 chromosome jumps were made for CF. • Chromosome walks were made from each jump site to identify overlapping clones. • Clones spanning a total 500 kb eventually were characterized. • Next, cloned DNA was used as a probe against other species using a restriction digest + Southern blot. *Genes are more conserved than non-coding sequences and similar sequences should be found in other species. • Five subclones (or candidates) hybridized with other organisms. • Two of the subclones were ruled out by linkage analysis, and a third was a pseudogene (gene-like sequence lacking expression signals). • Remaining two clones were hybridized with mRNA on a Northern blot to test whether their sequences are transcribed. • One more candidate was eliminated, and the 5th candidate was sequenced…

Characteristics of the CF gene: cDNA (mature mRNA of same size) is 6,500 bp. Genomic DNA: CF gene spans 250 kb and contains 24 exons. 68% of Caucasians with cystic fibrosis show a 3-bp deletion that results in the loss of phenylalanine (Phe). Sixty other mutations described. Fig. 4.13, CFTR Structure Cystic Fibrosis Transmembrane Conductance Regulator Protein

Shotgun DNA sequencing: • Sequence the entire genome rapidly. • No requirement for a high resolution linkage or physical map. • Just break the genome up into small pieces, sequence it, and find the gene of interest/do the analysis later. • Reverses the way genetic studies proceeds. • It used to be we had to find the gene first to study the cause of the disease. • Now we can study effects of genes we didn’t even knew exist.

Shotgun DNA sequencing---dideoxy method: • Begin with genomic DNA and/or 200-300 kb BAC clone library. • Mechanically shear DNA into ~2 kb bp overlapping fragments. • Isolate on agarose, purify, and clone into standard plasmid vectors. • Sequence ~500 bp from each end of each 2 kb insert. • Sequence from the middle 1,000 bp of each insert is obtained from overlapping clones. • Repeat the process so that 4-5x the total length of the genome is sequenced (dideoxy sequencing is 99.99% accurate). • Results in a contig library with ~97% genome coverage (the missing 3% is composed mostly of repeated DNA sequence). • Assemble hundreds of thousands of overlapping ~500 bp sequences with fast computers operating in parallel (supercomputer).

2 kb clones present a problem, solved with 10 kb clones: • Many repeated sequences in the genome are in regions spanning ~5 kb in size. • So many 2 kb clones contain entirely repeated DNA. • Results in a dead stop in the assembly, because there is ambiguity about where each clone goes. • Repeated sequences occur all over the genome. • On average, 10 kb clones contain less repeated DNA sequence. • Solution is to create and sequence a 10 kb clone library derived from the same genomic DNA or BAC library. • Complete genome coverage requires combining the sequences from the 2 kb & 10 kb libraries.

Fig. 8.13, Shotgun sequencing a genome

Sequencing the human genome: Two major players: Human Genome Project (HGP): • Publicly funded international consortium (NIH, DOE, etc.) • Francis Collins, National Human Genome Res. Inst. (NHGRI) • Began in U.S. in 1990 with a goal of 15 years • Genetic and physical mapping approach + dideoxy sequencing Celera Genomics Corporation (CRA): • Spin-off of Applied Biosystems (ABI) • J. Craig Venter, CEO • Created in 1998 with a goal of 3 years • Direct shotgun approach + dideoxy sequencing (+ HGP’s maps for validation) • Both groups collected blood and sperm samples from anonymous male and female donors of different ethnic backgrounds.

J. Craig Venter Celera Genomics Francis Collins Human Genome Project

Milestone: 26 June 2000 - White House press conference with Bill Clinton: HGP: Started 1990 ~22.1 billion nucleotides of sequence data 7-fold coverage Unfinished (24% completely finished, 50% near-finished) Celera: Started 1998 ~14.5 billion nucleotides of sequence data 4.6-fold coverage Complete assembled genome with >99% coverage First assembled draft of human genome was simultaneously published in Nature & Science 15 & 16 February 2001 (Nature published 1 day earlier).

How did Celera et al. assemble the sequences? Two methods: Method A: Assembly of 26.4 million 550 bp sequences  4.6-fold coverage, without reference to a physical map of any kind. Covered >99% of the genome. 500 million trillion base-to-base comparisons. 20,000 CPU hours (833 CPU days) on a supercomputer. Method B: Used BAC clone scaffold (combined lots of smaller maps) to validate the whole genome direct shotgun assembly approach. Also helped resolved ambiguities resulting from the assembly of short repeated DNA fragments.

Features of the human genome: • 32,000 genes estimated (50,000-100,000 were predicted). • Not many more genes than Drosophila, and only 50% more genes than Caenorhabditiselegans (nematode worm). • Only 1-1.5% of the genome codes for protein. • 50% of the sequence is repeated DNA. • Humans share 223 genes found in bacteria, but not yeast, nematodes, or fruit flies.

Next-generation genome sequencing: • The shotgun method is fundamentally the same. • The throughput has increased and the cost has decreased. • Not uncommon to assemble trillions of sequence reads. • Some things to consider: • If error rates are high (454, Illumina) 30-50x genome sequencing is required. • If error rates are low (SOLiD, Ion Torrent) 4-5x coverage is sufficient. • Costs are falling from $10K to $1K.

Sequencing is no longer the primary need; data storage/retrieval and computational needs are outpacing everything else. How much data storage does 1 human genome require? • About 1.5 GB (2 CDs) if your stored only one copy of each letter. • For the raw format 2-30 TB are required. • Less accurate platforms with 30-50x accuracy require more data storage capacity.

Post-genome sequencing era is very different: • Classical genetics studies started with a phenotype and set out to identify the gene. • But we now have the ability to start with a gene sequence and sets out to identify the phenotype. • Large data sets required many mathematical tools, which has given rise to the field of bioinformatics. • Lots of applications: • Identify genes within genomic DNA sequences. • Align and match homologous gene sequences in databases and seek to determine function. • Predict structure of gene products. • Describe interactions between genes and gene products. • Study gene expression.

1. Identifying genes in DNA sequences: • First step is annotation = identification and description of putative genes and other important sequences. • Open reading frames (ORFs) ORF = potential protein coding sequence that begins with a start codon and ends with a stop codon. • ORFs come in all sizes. • Not all ORFs encode proteins (6-7% do not in yeast). • ORFs with introns can require sophisticated computer algorithms to detect.

2. Homology searches to assign gene function: • Homology search = identify gene function by searching database. • Similarities reflect evolutionary relationships and shared function. • Homology searches are performed for nucleotides and amino acids. • GenBank’s BLAST search: http://www.ncbi.nlm.nih.gov/BLAST/ • Example, human mtDNA control region sequence: • TTCTCTGTTCTTCATGGGGAAGCAGATTTGGGTACCACCCAAGTATTGACTCACCCACAACAACCGCTATGTATTTCGTACATTACTGCCAGCCACCATGAATATTGCACGGTACCATAAATACTTGACCACCTGTAGTACATAAAAACCCAATCCACATCAAAA

Fig. 9.2, Summary of genes in the yeast genome.

3. Gene function can be identified and studied in other ways: • Gene knockout approach = systematically delete different genes and observe the phenotypes (PCR + cloning is one method). • Study the transcriptome = complete set of mRNAs in a cell • mRNAs are not stable, but types and levels change with different experimental conditions. • Sample mRNA at experimental intervals and convert to cDNA using reverse transcriptase. • Probe unknown cDNAs with DNA microarray of PCR-generated ORF sequences (requires known sequence for each probe). • Or sequence the entire transcriptome using Next Generation Sequencing (e.g., Pyrosequencing).

Fig. 9.7b, Microarray study of gene expression

“Proteomics”: Proteome = complete set of expressed proteins in a cell Major goals of proteomics: • Identify every protein. • Determine the sequence and structure of each protein (and its function). • Create a database with the sequence of each protein. • Analyze protein levels and interactions in different cell types, at different times, and at different stages of development. Rationale: • Genes are two-steps removed from disease (DNA  mRNA  protein). • Most gene products involved in disease are composed of protein. • Understanding protein means understanding disease.

Genomics Lecture Topics Genetic mapping studies: two approaches

Genomics Lecture Topics Genetic mapping studies: two approaches

Presentation Transcript

Statistical Bioinformatics

Lecture 36 GENETIC ALGORITHM (1)

IMPLEMENTATION OF e-PROCUREMENT IN CSIR

InGenious HyperCare Integrating genomics, clinical research and care in hypertension Genetic, genomics and proteomics o

Week 11: Mapping

Is Genomics the Cure for Disparities?

Special Topics in Genomics Lecture 1: Introduction

Special Topics in Genomics

Genetic Approaches to Thinking, Moving and Feeling

Mapping Theories

Mapping and Map Changes

Mapping and Map Changes

Chp 3 Genomics, Proteomics, and Related Approaches to Physiology

GENETIC MAPPING III

Impact of Genomics on Genetic Improvement

Topic #10 Genomewide Association Studies

Genomics in Society: Genomics, Preventive Medicine, and Society

Genetic mapping studies - Asthma and allergy

Microbial Genomics

Functional genomics approaches to disease genomics

Preview: Some illustrations of graphs in Integrative Genomics

Molecular ecology, quantitative genetic and genomics