TrAPSS Transcript Annotation Prioritization and Screening System

TrAPSSTranscript Annotation Prioritization and Screening System Terry Braun1,2,5, Todd Scheetz2,4, Hakeem Abdulkawy5, Bart Brown5, Steve Davis5, Matt Kemp5, Brian O’Leary5, John Ritchison5, Mike Smith5, Suma Shankar2, Abe Clark6, Val Sheffield3, Edwin Stone2, Thomas Casavant1,4,5 1Department of Biomedical Engineering, 2Department of Ophthalmology, 3Department of Pediatrics, 4Center for Bioinformatics and Computational Biology, 5Coordinated Laboratory for Computational Genomics. The University of Iowa, Iowa City, Iowa, USA. 52242. 6Alcon Research, Fort Worth, Texas. http://putt.eng.uiowa.edu/trapss_research.html

Outline: Disease gene identification: A case study of approaches and system design • Mutation identification and disease gene discovery • markers and PCR, linkage, linkage disequilibrium • candidates and intervals • TrAPSS overview • Hypothesis • History, design, and architecture, data models • Results • Machine learning and optimization • Future (coding and sequence based) • Demos

Genetic Marker • A genetic marker allows for the observation of the genetic state at a particular genomic location (locus). • A genotype is the measured state of a genetic marker. • A tool for observing inheritance patterns (Mendel's rules and meiosis) • May never be feasible to sequence cases directly (however, with the current decreasing rate of cost, in 10 years it may cost about $1000 to do a genome) • An “informative” marker is often “heterogenous", or “polymorphic” and enables the observation of the inheritance of genetic material.

Example -- genotypes Pedigree male female parents offspring 1 2 3 4 1 1 1 1 2 4 1 4 1 1 1 1 uninformative heterogeneous These labels (markers) are a measure of the genetic state of each individual. Recall from "Rule of Segregation", offspring get one gene from each parent. Markers are not genes, but they are regions on chromosomes (meiosis).

What a marker looks like in the Genome Geneticists assign numerical values to different versions of markers

“SNPs” • Single-Nucleotide Polymorphisms • 1 every 1000 bp (estimated) • 2,972,052 SNPs submitted to dbSNP • dbSNP summary link • 50% of all SNPs are in question • 10% of UTRs have SNPs • 100,000 - 500,000 SNPs needed • Why don’t we do this? • $$$

Duplicating DNA – to Use Markers to "Probe" Genomes of Individuals • mitosis is process that copies DNA in biology • the first step is to "unzip" the 2 DNA strands of the double helix • an enzyme called DNA polymerase makes a copy by using each strand as a template • two other components • nucleotides (A, G, T, C) (A-T, G-C, etc) • a short stretch of DNA called a "primer" (to prime the process)

PCR – Polymerase Chain Reaction • PCR is a process that copies DNA exponentially • mimics the process by organisms, but in vitro (in a test tube) • relies on the ability of DNA-copying enzymes to remain stable at high temperatures • Necessary components (in a vial) • piece of DNA to be copied • large quantities of four nucleotides • large quantities of primer sequence • DNA polymerase (Taq – named for Thermus aquaticus, a bacterium that lives in hot springs) Cetus Corporation

PCR Reaction • The reaction can be carried out entirely in a vial simply by changing the temperature • separate the 2 DNA strands • heat to 75-90 C (165 F) for 30 seconds • this "melts" the DNA apart – the base pairing comes undone • "anneal" the primers • primers cannot bind to the template strands at such high temp – cooled to 55 C for 20 seconds • make complete copy of template (and thus new templates for the next cycle) • Taq polymerase works best at 75 C (hot springs) • nucleotides are added (complement – if template has A, T is added, etc)

PCR Reaction • Three steps • separation of strands • annealing of primers to template • synthesis of new strands • Takes approx. 2 minutes • Each reaction is carried out in the same vial, and after every cycle, each piece of DNA is duplicated (exponential copying) • Cycle can be repeated 30 times (2^30 = 1,073,741,824) • 1 million copies can be made in approximately 3 hours from a single copy of DNA • this is why very minute samples can be used to identify individuals (e.g. crime scene investigations • Valuable tool to multiply unique regions of DNA so they can be detected in LARGE genomes • Note, we need to know the flanking sequence to be able to design primers • Also, this flanking sequence needs to be unique otherwise the reaction could amplify sequence from multiple regions of the genome

Marker D15S643, Genotypes, and primers – (min, max)143,159; CEPH 153,145; 149;149 AAATTGCTCTGAGTTCTGAGGC (forward primer) >chr15:72,091,076-72,091,409 CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG TTCTGAGGC (22) TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT (41) CTCAAG (6) acacacacacacacacacacacacacacacacacacacacacacacacac(50) TCCTCAAGC (9) CGTTAGACTCCATTCCCATGTAGTA (25) TCCAAATAAGTTTTACAGCAAGACACACTGGAGAGATTGAAGCT (reverse primer) TACTACATGGGAATGGAGTCTAACG (actual sequence) ATGATGTACCCTTACCTCAGATTGC

D15S160 >chr15:72091076-72091409 CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA GTCGACTAGAAATCCTTTGTAAATCCCCCTCCTCCGTGAGGAAAGTTTAT ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA TGGAAAGAAATCTGTCCAAAGACTAGACTAAGTTCCGGTGTAGGACCGGT TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG AGACCAAAGACATTGAGTCTCTTAATGACGAGGACTA TTTAACGAGACTC TTCTGAGGC TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT AAGACTGGC ATGACGACAGTATAACGTAAGAGGCTGGTAAAAGGTCCAGA CTCAAGacacacacacacacacacacacacacacacacacacacacacac GAGTTCtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtg acacacTCCTCAAGC CGTTAGACTCCATTCCCATGTAGTA TCCAAATAAG tgtgtgAGGAGTTCG GCAATCTGAGGTAAGGGTACATCAT AGGTTTATTC TTTTACAGCAAGACACACTGGAGAGATTGAAGCT AAAATGTCGTTCTGTGTGACCTCTCTAACTTCGA

Basis for Inheritance of Disease: Examples Aa Pedigree male female parents offspring Aa Aa Aa AA AA Aa 1/2 1/2 A a A from mom/dad? a from mom/dad? P(AA) = 1/4 P(Aa) = 1/2 P(aa) = 1/4 1/2 1/4 1/4 A AA Aa 1/2 1/4 1/4 Aa aa a

Examples • 232 • 234 • 236 • 238 • 240 • 242 234 238 232 238 1 4 2 4 234, 238 238, 232 234, 232 234, 238 238, 238 2 4 If you "genotype" an individual at enough markers, you can calculate the probability of uniquely identifying an individual. Note that the lawyers for OJ Simpson argued that "recoded" allele numbers increased the likelihood of contamination and false identification.

Examples Affected individuals

Examples Dominant model Geneticists then look for genes that mimic this pattern of inheritance

Example Recessive model. Very unlikely, because "founders" marrying in also carry the disease, which by definition is a rare genetic disorder.

BBS4 Pedigree

Monogenic and Polygenic Diseases • monogenic (Mendelian) -- one gene • “simple” (dominant and recessive) Mendelian inheritance • direct correspondence between one gene mutation and one disorder • majority of disease genes found are monogenic (~1400 - OMIM 2005) • polygenic -- (complex) multiple genes • heterogeneity – disease caused by multiple genes • epistasis – disease caused by multiple interacting genes • combinatorics • no longer have direct correspondence between one gene and disorder • majority of disorders are probably polygenic • complexity of organisms and observed pathways

...Mongenic and Polygenic Diseases • phenocopy • reduced penetrance • Example -- sickle cell anemia • “classic” recessive disorder • defect in red blood cells (hemoglobin) • but… infant hemoglobin gene can “leak” • wide range of phenotypes

Bardet-Biedl Syndrome (BBS) • Obesity • Diabetes/ hypertension • Retinopathy • Hypogenitalism • Polydactyly • Mental Retardation • Renal Anomalies • Heart defects Rare disorder, but common phenotypes

Molecular Analysis of BBS • BBS1 - 11q13 Novel* • BBS2 - 16q22 Novel* • BBS3 - 3p13 • BBS4 - 15q21 Novel†, TPR Repeats • BBS5 - 2q31 • BBS6 - 20p12 Type II Chaperonins • BBS7 - 4q27 Novel* • BBS8 - 14q31 Novel†, TPR Repeats *,† - Some Similarity

Some Useful Properties of DNA • fragments of DNA have a minute negative charge • if you apply an electric field to DNA in a matrix, it will migrate to the positive pole • DNA is a linear molecule, but it tends to fold up (similar to a knot) • this bound up molecule of DNA will have a unique cross-sectional area profile that is dependent on its sequence • Gel electrophoresis – DNA is placed in a polyacrylamide gel and a voltage is applied • polyacrylamide gel and pool analogy • applied charge will cause DNA to migrate dependent on its size, and its sequence

Differential Display(http://www.genhunter.com/products/rnaimage/schematic.gif)

BBS4 Deletion (by PCR)Example of Usage exons3 4

Gene Structure and Primers

Disease Gene Identification • SSCP -- single strand conformational polymorphism • PCR -- polymerase chain reaction • primers amplify template sequence • direct sequencing • BBS2 (Bardet-Biedl Syndrome)

Homozygosity Mapping

BBS2 genetic mapping C16 1 2 3 4 5 6 7 8 9 10 11 12

BBS2 genetic mapping unaffected affected C16 1 2 3 4 5 6 7 8 9 10 11 12

BBS4 Gene (Direct Sequencing) (Hs.26471)

BBS4 Deletion (by PCR) exons3 4

BBS4 Mutations (direct sequencing) (R295P)

Identifying disease-causing variations • To find mutations -- you search in individuals with disease for sequence variations • You compare those variations to normal individuals for validation • Find enough to be convincing. • Some are easy -- stops and deletions, some are hard -- AMD

TrAPSS TrAPSS (Transcript Annotation Prioritization and Screening System) is an integrated system that utilizes annotated features to accelerate the mutation screening process by predicting the potential of gene sub-sequences to contain disease-causing mutations. Evolutionarily conserved functional domains and motifs across species are potentially the most powerful information available to determine the functional significance of gene sequences. This research explores the contributions of several types of annotation. These include: functional domains, secondary structure prediction, similarity to model organism sequences, SNPs, and conserved intronic and regulatory elements. TrAPSS utilizes annotation to prioritize both candidate genes and sub-regions of genes. Genes may be prioritized with respect to each other, and also between sub-regions within each gene. Screening resources may then be spread across multiple genes in parallel, while still covering entire genes when mutations are detected.

TrAPSS • TrAPSS is a system that was built to accelerate the process of disease gene identification • Hypothesis is that sequence-based annotation may be used to direct where to search for mutations • conserved protein functional domains • transmembrane domains • secondary structure prediction • SNPs • repeats • GC content • …

Shopping List • homology (model organism and gene family) • other secondary structures • dinucleotides and codon bias • 3D structure info • pseudo genes • exonic splice enhancers • splice site consensus deviations (5’, 3’, branch point) • GC content and failed primer info • TFBS and promoter regions (prediction and homology) • repeats • N vs heterozygosity • large scale repeats (Alus) • small scale repeats • absence of repeats • conservation of exonic structure • CPG regions in promoters • existence of alternative transcripts • linkage information • expression (ubiquitously expressed, not ubiquitously expressed, selectively expressed) • expression (hybridizations, differential display) • ontological info (pathways) • Soares eye libraries • pathways

Problem Statement The Search Space in the context of mutation identification describes three dimensions: • the number of genes • the number of individuals • the number of assays (typically PCR-based reactions for Single Stranded Conformational Polymorphism – SSCP, or for direct sequencing). The Mutation Screening Dilemmastates that the search space needs to be large enough to ensure that the variations of interest are located, but small enough that these variations can be found with a realistic expenditure of time and reagents. The multi-dimensional nature of the search space requires that most experiments must be constrained to a relatively small number in at least one dimension to make disease gene identification feasible.

Problem Statement Entire Search Space Mutations Screening Units Number of Genes Constrained Search Space Number of Subjects Modest Search Space Example 100 genes * 100 Subjects * 20 primer pairs = 200,000 assays

The Solution • TrAPSS Transcript Annotation Prioritization and Screening System • Primary Purpose:To Accelerate Mutation Identification

History (prior to genome browsers) • ~1997 • Used known UniGene clusters to identify recently sequenced genomic fragments (and then search with ESTs) • ~1997 (Interval Search) • Acquire mapped UniGene clusters within genomic intervals (and linkage analysis) • ~1998 • Candidate genes via differential display • Cut spots out of gels and sequence • BLAST sequence in attempt to identify UniGene clusters and recently sequenced genomic fragments • ~1999 • Key word search of UniGene tissue library sources (heart, brain, limb, adipose, eye, kidney)

Interval Search -- finding UniGene clusters in raw sequence

Identified UniGene Clusters

Interval Search

Knowledge Discovery Manual “Pipeline” Full Length Transcript Identified MGC Genomic Contigs BLAST Manually Entered ESTs/ accessions Transcript UniGene BLAST Gene Structure RPS-BLAST NT PfamA, SMART Flank, and Intron Identification Domains Manually Select Primers /w Primer3 Sequence Verified Shift Order Primers SSCP 1 Person-day per Gene Manual Layout of Genomic, Gene Structure, Domains, and Primers Manual Curation of SSCP Results and Sequence

GeneScreen (TrAPSS V1.0)

TrAPSS (current version) • Automate • disease gene identification and mutation screening • acquisition of gene structure and genomic context • identification of domains, secondary structures, SNPs, repeats, cross-species homologies • prioritization of genes • prioritization of sub-regions of genes • selection of assay reagents (ex. primers) • data management (ex. track genes, primers, annotation) • Determine which types of information and heuristics may improve the ability to predict regions of genes that would be more likely to harbor phenotype-altering variations

TrAPSS Transcript Annotation Prioritization and Screening System

TrAPSS Transcript Annotation Prioritization and Screening System

Presentation Transcript

Distributed Annotation System Version 2

HyLighter Interactive Annotation System

Critical Thinking and prioritization

Project Selection and Prioritization

Regional Transit System Prioritization Process

Prioritization and Delegation

Transcript

TRANSCRIPT

Prioritization

Transcript Maps

B. Target specific screening – Target prioritization

Planning and Prioritization Strategies

Prioritization of Avian GO Annotation

Prioritization

Proteomics and annotation

Regional Transit System Prioritization Process

Transcript Presentation

Physical and transcript mapping

Transcript Evaluation

Prioritization AND Disciplined Process

Annotation and Visualization