Skip this Video
Download Presentation
Assigned reading: Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503.

Loading in 2 Seconds...

play fullscreen
1 / 36

Assigned reading: Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503. DHaeseleer, P. - PowerPoint PPT Presentation

  • Uploaded on

Unit 2.5: Genome Annotation, Gene Prediction, and DNA motifs Objectives: -learn what is meant by “genome annotation” and why this is important -understand why some genomic features are relatively easy to annotate, and others are not

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Assigned reading: Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503. DHaeseleer, P. ' - iman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Unit 2.5: Genome Annotation, Gene Prediction, and DNA motifs


-learn what is meant by “genome annotation” and why this is important

-understand why some genomic features are relatively easy to annotate, and others are not

-learn the major ways of representing transcription factor binding sites and the advantages and disadvantages of each

Assigned reading:

Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503.

D\'Haeseleer, P. 2006. What are DNA sequence motifs? Nat Biotechnol 24: 423-425.


Genome Sequencing

(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08

(1128)(1496)(2680) (3825) genome projects:

199(274)(579)827complete (includes (28) (36)(49)94 eukaryotes)

508 (728)(1285) 1932 prokaryotic genomes in progress

421(494)(721) 936eukaryotic genomes in progress

small: archaebacterium Nanoarchaeum equitans 500 kb

Bacillus anthracis (anthrax)5228 kb

S. cerivisiae (yeast)12,069 kb

Arabidopsis thaliana 115,428 kb

Drosophila melanogaster (fruit fly)137,000 kb

Anopheles gambiae (malaria mosquito)278,000 kb

Oryza sativa (rice)420,000 kb

Mus musculus (mouse)2,493,000 kb

Homo sapiens (human)2,900,000 kb


Genome sequencing helps in:

    • identifying new genes (“gene discovery”)
    • looking at chromosome organization and structure
    • finding gene regulatory sequences
    • comparative genomics
  • These in turn lead to advances in:
    • medicine
    • agriculture
    • biotechnology
    • understanding evolution and other basic science questions

Because of the vast amounts of data that are generated, we need new approaches

  • high throughput assays
  • robotics
  • high speed computing
  • statistics
  • bioinformatics

Understanding the genome

We know the sequence—but can we understand it?

Anna Pavlovna\'s drawing room was gradually filling. The highest Petersburg society was assembled there: people differing widely in age and character but alike in the social circle to which they belonged. Prince Vasili\'s daughter, the beautiful Helene, came to take her father to the ambassador\'s entertainment; she wore a ball dress and her badge as maid of honor. The youthful little Princess Bolkonskaya, known as la femme la plus seduisante de Petersbourg, was also there. She had been married during the previous winter, and being pregnant did not go to any large gatherings, but only to small receptions. Prince Vasili\'s son, Hippolyte, had come with Mortemart, whom he introduced. The Abbe Morio and many others had also come.

To each new arrival Anna Pavlovna said, "You have not yet seen my aunt," or "You do not know my aunt?" and very gravely conducted him or her to a little old lady, wearing large bows of ribbon in her cap, who had come sailing in from another room as soon as the guests began to arrive; and slowly turning her eyes from the visitor to her aunt, Anna Pavlovna mentioned each one\'s name and then left them.

--Tolstoy, War and Peace


Understanding the genome

We don’t know the language:

Гостиная Анны Павловны начала понемногу наполняться. Приехала высшая знать Петербурга, люди самые разнородные по возрастам и характерам, но одинаковые по обществу, в каком все жили; приехала дочь князя Василия, красавица Элен, заехавшая за отцом, чтобы с ним вместе ехать на праздник посланника. Она была в шифре и бальном платье. Приехала и известная, как la femme la plus séduisante de Pétersbourg1, молодая, маленькая княгиня Болконская, прошлую зиму вышедшая замуж и теперь не выезжавшая в большой свет по причине своей беременности, но ездившая еще на небольшие вечера. Приехал князь Ипполит, сын князя Василия, с Мортемаром, которого он представил; приехал и аббат Морио и многие другие.

— Вы не видали еще, — или: — вы не знакомы с ma tante?2 — говорила Анна Павловна приезжавшим гостям и весьма серьезно подводила их к маленькой старушке в высоких бантах, выплывшей из другой комнаты, как скоро стали приезжать гости, называла их по имени, медленно переводя глаза с гостя на ma tante, и потом отходила.

Все гости совершали обряд приветствования никому не известной, никому не интересной и не нужной тетушки. Анна Павловна с грустным, торжественным участием следила за их приветствиями, молчаливо одобряя их. Ma tante каждому говорила в одних и тех же выражениях о его здоровье, о своем здоровье и о здоровье ее величества, которое нынче было, слава Богу, лучше. Все подходившие, из приличия не выказывая поспешности, с чувством облегчения исполненной тяжелой обязанности отходили от старушки, чтоб уж весь вечер ни

--Tolstoy, War and Peace


Understanding the genome

Even if we did, we don’t know the grammar, or punctuation:


--Tolstoy, War and Peace


In order to make use of the genome sequence, we need to understand all of its components. Assigning identities and functions to sequences within the genome is called genome annotation.


“With the complete human genome sequence now in hand, we face the enormous challenge of interpreting it and learning how to use that information to understand the biology of human health and disease. The ENCyclopedia Of DNA Elements (ENCODE) Project is predicated on the belief that a comprehensive catalog of the structural and functional components encoded in the human genome sequence will be critical for understanding human biology well enough to address those fundamental aims of biomedical research. Such a complete catalog, or "parts list," would include protein-coding genes, non–protein-coding genes, transcriptional regulatory elements, and sequences that mediate chromosome structure and dynamics; undoubtedly, additional, yet-to-be-defined types of functional sequences will also need to be included.”


What’s in a genome?

  • Genes (i.e., protein coding)
  • But. . . only <2% of the human genome encodes proteins
  • Other than protein coding genes, what is there?
    • genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)
    • structural sequences (scaffold attachment regions)
    • regulatory sequences
    • non-functional “junk” ?
  • It’s still uncertain/controversial how much of the genome is composed of any of these classes
  • The answers will come from experimentation and bioinformatics.

Current human genome annotations can be viewed using the UCSC genome browser, as we saw in Unit 2-4.


The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence.

  • pilot phase focused on 30 Mb (~ 1%) of the genome
  • international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function
  • now in its second phase, extending study to entire human genome

Functional genomic elements being identified by the ENCODE pilot phase

The ENCODE Project Consortium Science 306, 636 -640 (2004)

Published by AAAS


protein-coding genes, non–protein-coding genes

  • easier to find than other functional elements
  • why?
  • genes are transcribed—which means that we can identify them by looking at RNA
    • traditionally this has been done by cDNA or EST sequencing, more recently by microarray, SAGE, MPSS, etc.

protein-coding genes, non–protein-coding genes

  • we can also find genes ab initio using computational methods
  • this is most suited to protein-coding genes
  • why?
  • protein-coding genes have recognizable features
    • open reading frames (ORFs)
    • codon bias
    • known transcription and translational start and stop motifs (promoters, 3’ poly-A sites)
    • splice consensus sequences at intron-exon boundaries

ab initio gene discovery

  • Protein-coding genes have recognizable features
  • We can design software to scan the genome and identify these features
  • Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes
  • It’s a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.
    • We tend to do OK finding protein coding regions, but miss a lot of non-coding 5’ exons and the like

ab initio gene discovery—validating predictions and refining gene models

  • Standard types of evidence for validation of predictions include:
    • match to previously annotated cDNA
    • match to EST from same organism
    • similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank
      • (translation works better—why?)
    • protein structure prediction match to a PFAM domain
    • associated with recognized promoter sequences, ie TATA box, CpG island
    • known phenotype from mutation of the locus

Finding non–protein-coding genes

  • e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs
  • Harder to find than protein-coding genes
  • Why?
    • often not poly-A tailed—don’t end up in cDNA libraries
    • no ORF
    • constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect
  • So, how do we find these?

Finding non–protein-coding genes

  • secondary structure
  • homology, especially alignment of related species
  • experimentally
    • isolation through non-polyA dependent cloning methods
    • microarrays

ab initio gene discovery—approaches

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.

Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are artificial neural networks (ANNs) and hidden Markov models (HMMs).


begin gene region











end gene region

ab initio gene discovery—HMMs

An example state diagram for an HMM for gene discovery is this simplified version of one used by Genescan:





5’ UTR


3’ UTR



single exon

Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the probability that a stretch of sequence is a gene.

adapted from Gibson and Muse, A Primer of Genome Science


What about other genomic features?

  • Other than protein coding genes, what is there?
    • genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)
    • structural sequences (scaffold attachment regions)
    • regulatory sequences
    • non-functional “junk” ?
  • We can begin to annotate regulatory sequences such as transcription factor binding sites and cis-regulatory modules.

Remember from Unit 2-2:

Control of Gene Expression—Transcription Factors

Transcription factors (TFs) are proteins that bind to the DNA and help to control gene expression. We call the sequences to which they bind transcription factor binding sites (TFBSs), which are a type of cis-regulatory sequence.


Isalan et al. Biochemistry 37:12026

Transcription factors bind to specific DNA sequences

Usually, binding sites are first determined empirically.

Most transcription factors can bind to a range of similar sequences. We can represent these in either of two ways, as a consensus sequence, or as a position weight matrix (PWM).

Once we know the binding site, we can search the genome to find all of the (predicted) binding sites.


Wasserman, W. W. and A. Sandelin (2004). Nat Rev Genet5(4): 276-287.

Control of Gene Expression—Transcription Factors

Most transcription factors can bind to a range of similar sequences. We call this a binding “motif.”

We can represent these motifs either as a consensus sequence or as a frequency (or weight) matrix.


Binding site (motif) representations








7 characterized binding sites for a certain transcription factor:


consensus sequence:

A 111007200

T 302000502

G 110770060

C 254000015

Frequency matrix and its graphical depiction, a sequence logo:


Binding site (motif) representations

A consensus sequence is a one-line description of the TFBS, based on a column-by-column alignment of the individual known binding sites. The usual rule is:

A single base is shown if it occurs in more than half

the sites and at least twice as often as the second

most frequent base. Otherwise, a double degenerate

symbol (e.g., G/C= S) is used if two bases occur

in more than 75% of the sites, or a triple degenerate

symbol when one base does not occur at all.

A frequency matrix shows the actual frequencies of each base in each column. This can be easily converted to a position weight matrix (PWM), which is a normalized version of the frequency matrix that is therefore not dependent on the number of sites in the alignment.


Finding binding sites in the genome


Consensus sequences make searching easy—it’s a simple text search that can even be done using a word processor, or very simply programmed in a computer language such as Perl:


if ($_ =~ /[T|C]C[T|C]GGATGC/)

{do something;}


All positions in the motif are treated the same.


Identifying transcription factor binding sites

  • But PWMs are generally more useful:
    • they allow us to assign more importance to more invariant positions
    • they are related to the binding energy of the DNA-protein interaction
    • we can compare PWMs and we can score PWMs
  • Scores are based on the probability of a given nucleotide being in a given position.

Identifying transcription factor binding sites

A 1 1 1 0 0 7 2 0 0

T 3 0 2 0 0 0 5 0 2

G 1 1 0 7 7 0 0 6 0

C 2 5 4 0 0 0 0 1 5





Example 1:

TCCGGAAGC scores higher than TCCGGAACT scores higher than TCCGGAAAA as GC > CT > AA in the last two positions. Note that the latter two sequences would score the same if using only the consensus representation.


Identifying transcription factor binding sites

A 111007200

T 302000502

G 110770060

C 254000015


Example 2:

TCGGGAAGC and TCCAGATCT both have a single mismatch compared to the consensus. But the first is a much better binding site when scored using a PWM due to the strong conservation of the G in position 4 versus the weak requirement for the C or T in position 3.


Issues with finding binding sites in the genome

But it’s important to use caution: just because a sequence in the genome is a reasonable match to a known TFBS, this doesn’t necessarily mean that the TF is binding there in vivo. By crude calculation:

The probability of finding a 7 bp motif is 4-7 = 1/16,384

i.e., expect only about 1 motif every 16 kb.

So in human genome, this sequence should be present over 183,000 times! (>7x per gene!) Even in a 10 Mb genome, the sequence would occur over 600 times.

And this calculation does not even take into account motif degeneracy!

So we need to consider additional factors in deciding what predicited binding sites are important—such as how regulatory regions are organized


Empirical methods, such as ChIP-chip (see Unit 2-3) are a good alternative for looking at in vivo binding; bioinformatics methods can be combined with this to determine the transcription factor binding motifs.


Genome Annotation—Transcription Factors

Because of the difficulty in accurately predicting bona fide, functional TFBSs, most current genome annotation focuses on empirically determined sites. Several databases curate these data, e.g. the Open Regulatory Annotation database (ORegAnno) and the Regulatory Element Database for Drosophila (REDfly). Tracks displaying these data can be found in the UCSC Genome Browser. These databases also curate cis-regulatory module sequences, which at present can only reliably be determined by empirical methods.


Genome Annotation—much work remains

Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.