Gene Structure and Function

Gene Structure and Function

Genetic Code • All genomes, from virus to humans, are designed around linear sequences of nucleotides, share a universal code. • An mRNA specify amino acid sequence through the genetic code. • We know one amino acid only could specify one nucleotide. • Two nucleotide combinations could only specify 16 amino acids. • Three nucleotides (64 possibilities), called a codon, is enough to specify each amino acid. • Each 3 nucleotide code for one amino acid. • The first codon is the start codon, and usually coincides with the Amino Acid Methionine. (M which has codon code ‘ATG’) • The last codon is the stop codon and does NOT code for an amino acid. It is sometimes represented by ‘*’ to indicate the ‘STOP’ codon. • A coding region (abbreviation CDS) starts at the START codon and ends at the STOP codon.

Codon table • Each amino acid might have up to six codons that specify it. • A handful of species vary from the codon association described above, and use different codons for different amino acids.

RNA • RNA consists of a sugar-phosphatebackbone, with nucleotides attached to the 1' carbon of the sugar. • The differences between DNA and RNA are that: • RNA has a hydroxyl group on the 2' carbon of the sugar. • Not like DNA uses thymine (T), RNA uses uracil (U). • Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure. • RNA molecule can form many different stable three-dimensional tertiary structures, because it is not restricted to a rigid double helix.

Open Reading Frames (ORF) On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or - strand and on any of 3 possible frames Frame 1: 1st base of start codon can either start at base 1,4,7,10,... Frame 2: 1st base of start codon can either start at base 2,5,8,11,... Frame 3: 1st base of start codon can either start at base 3,6,9,12,... (frame –1,-2,-3 are on minus strand) An open reading frames starts with ATG in most species, and ends with a stop codon (TAA, TAG or TGA) A program called SIXFRAME, you can visit the site directly http://searchlauncher.bcm.tmc.edu/seq-util/Options/sixframe.html

Eukaryotic Nuclear Gene Structure • Gene prediction for Pol II transcribed genes. • Upstream Enhancer elements. • Upstream Promoter elements. • GC box (-90nt) (20bp), CAAT box (-75 nt)(22bp) • TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990)) • 14-20 nt spacer DNA • CAP site (8 bp) • Transcription Initiation. • Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon. • polyA signal (AATAAA 99%,other)

Introns • Transcript region, interrupted by introns. Each introns • starts with a donor site consensus (G100T100A62A68G84T63..) • Has a branch site near 3’ end of intron (one not very conserved consensus UACUAAC) • ends with an acceptor site consensus. (12Py..NC65A100G100) UG UACUAAC AG

Exons • The exons of the transcript region are composed of: • 5’UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome) • AUG (or other start codon) • Remainder of coding region • Stop Codon • 3’ UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)

Non-Coding Eukaryotic DNA • Untranslated regions (UTR’s) • introns (can be genes within introns of another gene!) • intergenic regions. • - repetitive elements • - pseudogenes (dead genes that may(or not) have been retroposed back in the genome as a single-exon “gene”)

Repeats • Each repeat family has many subfamilies. • ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site. • Retroposons. ( can get copied back into genome) • LINEs (Long INtersped Elements) L1 1-7kb long, 50000 copies • SINEs (Short Intersped Elements)

Low-Complexity Elements • When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation! • Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. • Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified. • The low-complexity sequence can also be hidden at the translated protein level.

Structure of the Eukaryotic Genome • ~6-12% of human DNA encodes proteins. • ~10% of human DNA codes for UTR • ~90% of human DNA is non-coding.

Masking • To avoid finding spurious matches in alignment programs, you should always mask out the query sequence. • Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs). • Before running blastn against a genomic record, you must mask out the repeats. • Most used Programs: • GenScan:http://genes.mit.edu/GENSCAN.html • Repeat Masker: • http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

Chromosomal structure • Located in the nucleus • Each chromosome consists of a single molecule of DNA and its associated proteins • The DNA and protein complex found in eukaryotic chromosomes is called chromatin • 1/3 DNA and 2/3 protein • Complex interactions between proteins and nucleic acids in the chromosomes regulate gene and chromosomal function

Ideogram • Diagramatic representation • of a karyotype • Individual chromsomes are recognized by • -arm lengths • p, short • q, long • -centromere position • metacentric • sub-metacentric • acrocentric • telocentric • -staining (banding) patterns From Miller & Therman (2001) Human Chromosomes, Springer

Q (quinicrine) & G (Giemsa) banding preferentially stain AT rich regions R (reverse banding) preferentially stains GC-rich regions C-banding (denaturation & staining) preferentially stains constitutive heterochromatin, found in the centromere regions and distal Yq Chromsome banding

June 26, 2000 at the Whitehouse

Initial Analysis of the Human Genome

http://www.sanger.ac.uk/HGP/draft2000/gfx/fig2.gif

Genome Mapping STS – sequence-tagged sites (short segments of unique DNA on every chromosome – defined by a pair of PCR primers that amplified only one segment of the genome) BAC – Bacterial artificial chromosome, 100-400kb YAC – Yeast artificial chromosome, 150kb-1.5Mb Contig – assembled contiguous overlapping segments of DNA from BACs and YACs ESTs – Expressed Sequence Tags UniGene Database – a database for ESTs

Shotgun Sequencing Concepts in Biochemistry, 2nd Ed., R. Boyer • Segments are short ~2kb • Problem with repeated segments or genes

History of the Human Genome Project 1956 Physical map. 24 types and total set of 46 chromosomes 1977 Sanger publishes dideoxy sequencing method 1980 Botstein proposes human genetic map using RFLPs 1987 US DOE publishes report discussing HGP 1988 HUGO is established 1990 Official start of HGP with 3 billion $ and a 15 year horizon. 1991 Genome Database GB is established 1992 Genethon publishes map based on microsatelites. 1995 Lander et al. detailed map based on sequence tagged sites. 1998 Comprehensive map based on gene markers. 1999 Sanger Centre publishes chromosome 22 2001 Draft Genome published: Celera & Public 2003 Completion (almost) of Human Genome Strachan and Read, HMG3 p213

The Human Genome I 1 2 3 X 6 16 7 mitochondria 11 4 19 20 8 5 9 10 17 18 12 13 22 15 21 14 Y .016 45 66 72 48 51 104 3.2*109 bp 86 88 100 107 163 118 148 143 142 140 176 163 148 Myoglobin 221 a globin 279 198 197 *5.000 251 b-globin (chromosome 11) 6*104 bp Exon 3 Exon 1 Exon 2 *20 5’ flanking 3*103 bp 3’ flanking *103 ATTGCCATGTCGATAATTGGACTATTTGGA DNA: 30 bp Protein: aa aa aa aa aa aa aa aa aa aa aa http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245

The Human Genome II Gene families Clustered a-globins (7), growth hormone (5), Class I HLA heavy chain (20),…. Dispersed Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),.. Clustered and Dispersed HOX (38 – 4), Histones (61 – 2), Olfactory receptors (>900 – 25),… Strachan and Read (2004) Chapter 9 + Lander et al.(2001), http://www.sanger.ac.uk/HGP/

Human Genes and Gene Structures I • Presently estimated Gene Number: 24.000 (reference: ) • Average Gene Size: 27 kb • The largest gene: Dystrophin 2.4 Mb - 0.6% coding – 16 hours to transcribe. • The shortest gene: tRNATYR 100% coding • Largest exon: ApoB exon 26 is 7.6 kb Smallest: <10bp • Average exon number: 9 • Largest exon number: Titin 363 Smallest: 1 • Largest intron: WWOX intron 8 is 800 kb Smallest: 10s of bp • Largest polypeptide: Titin 38.138 smallest: tens – small hormones. • Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones,.. Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9

How do we differ? – Let me count the ways TGCATTGCGTAGGC TGCATTCCGTAGGC TGCATT---TAGGC TGCATTCCGTAGGC • Single nucleotide polymorphisms • 1 every few hundred bp, mutation rate* ≈ 10-9 • Short indels (=insertion/deletion) • 1 every few kb, mutation rate v. variable • Microsatellite (STR) repeat number • 1 every few kb, mutation rate ≤10-3 • Minisatellites • 1 every few kb, mutation rate ≤ 10-1 • Repeated genes • rRNA, histones • Large inversions, deletions • Rare, e.g. Y chromosome TGCTCATCATCATCAGC TGCTCATCA------GC ≤100bp 1-5kb *per generation

Gene Number • Walter Gilbert [1980s] 100k • Antequera & Bird [1993] 70-80k • John Quackenbush et al. (TIGR) [2000] 120k • Ewing & Green [2000] 30k • Tetraodon analysis [2001] 35k • Human Genome Project (public) [2001] ~ 31k • Human Genome Project (Celera) [2001] 24-40k • Mouse Genome Project (public) [2002] 25k -30k • Lee Rowen [2003] 25,947

? ? Gene finding • Rules • ATG • TAA, TGA, TAG • GT…..AG • Compositional features • Exon lengths • Intron lengths • Codon bias • General genomic properties • Homology

Gene Structure and Function

Gene Structure and Function

Presentation Transcript

Tissue Structure and Function

Gene function analysis

Structure and Function

Structure and Function

Gene Function

Structure and Function

Gene finding and gene structure prediction

Structure and Function

Structure and Function

Part 7: Gene Structure and Function

Gene Structure and Function Jo Field Thursday 10th December 2009

Gene Structure and Gene Expression

Structure and Function

Gene Expression, Function, and Regulation

Gene Function and DNA Dynamics

Lecture 7 Gene Structure and Transposable Elements Gene structure Gene families

Gene Structure and Function

Gene Structure and Identification

Gene Structure and Identification

GENE,STRUCTURE,FUNCTION