GENERAL STUFF
Download
1 / 39

collegeokt162007part1f - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

GENERAL STUFF. subject: Genome-based Functional Annotation (bacteria) workload: 14 hrs - 2 hrs lecture - 12 hrs assignment (in 4 parts; so on average 3 hrs per part; not ready yet ) hand in: rtf-file, pdf-file or ppt-file before 8 November (later -1 point per day)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'collegeokt162007part1f' - Gideon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

GENERAL STUFF

subject: Genome-based Functional Annotation (bacteria)

workload: 14 hrs

- 2 hrs lecture

- 12 hrs assignment

(in 4 parts; so on average 3 hrs per part; not ready yet)

hand in: rtf-file, pdf-file or ppt-file before 8 November

(later -1 point per day)

Christof Francke (Post-Doc/Scientist; TI Food and Nutrition)


Slide2 l.jpg

Genome sequence annotation

From DNA to function

Bioinformatics Seminar, Nijmegen 16 10 2007

Christof Francke

(Jos Boekhorst/ Michiel Wels)


Slide3 l.jpg

Promised you a miracle

promises, promises


Slide4 l.jpg

Answering biological questions

Why does Bacillus anthracis kill humans? (anthrax = miltvuur)

B. anthracis

We have the genomes, so now we know............?


Slide5 l.jpg

When we have the genome sequenced, what do we know then/ what can we do then?

Inventory:

- predict functionality of encoded proteins

- defects in genes (disease)

- lineage

-

-

-

-

-

-

-

-


Slide6 l.jpg

The quest for an appropriate translation of sequence to knowledge

DNA

sequencing (assembly)

identifying genes

Part I

protein

function prediction

function

reconstructionmodeling

biology


Slide7 l.jpg

Bacterial Genomics in Nijmegen knowledge

Biological questions in the interest of Dutch Food Industry

How can we improve the cell as a factory?

- produce compounds

- improve taste

How can we prevent spoilage?

- spores, biofilms, fungi

How can we improve health?

- interaction between bacteria and host (probiotics)



Slide9 l.jpg

The organization of genetic information in bacteria knowledge

Most Open Reading Frames are preceded by regulatory elements

(cis-acting elements).

promoter

ORF

AACGTTGACTGACGTGTCACGTCCCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATAGC

A

R

-

+

RNA polymerase

transcription

mRNA

RNA polymerase binding is affected by regulatory proteins

(trans-acting elements; Activation, Repression).


Slide10 l.jpg

The organization of genetic information in bacteria knowledge

Operon

Gene 2

Gene 3

Gene 1

mRNA

Translation start

Multiple Operons

Regulated by the same Transcription Factor:

Regulon

Protein 1

Protein 2

Protein 3


Slide11 l.jpg

DNA sequencing knowledge


Slide12 l.jpg

Whole genome shotgun sequencing knowledge

Fraser et al, Nature 2000

406: 799-803.


Slide13 l.jpg

Wet lab knowledge

Raw Data Production

4 x ABI 3700 sequencer

>1.5 million nucleotides

per day

Bio-informatics

Genome assembly

Automated genome

annotation

In-house database,

>5000 Blasts / Day

I) The sequencing and assembly process

Data Transfer


Slide14 l.jpg

Genome assembly knowledge

initially there are a lot of gaps


Slide15 l.jpg

Methods for mapping contigs knowledge

Figure 3 Sources of linking information between contigs. (A) overlaps, (B) clone mates, (C) alignments to reference genome, (D) alignments to physical maps, (E) conservation of gene synteny.


Slide16 l.jpg

The first Dutch bacterial genome-sequence knowledge

(2003) Proc Natl Acad Sci USA 100,1990


Slide17 l.jpg

New technology: 454 sequencing knowledge

Advantage: relatively fast, reliable and no sequence preference

Disadvantage: short reads, difficult assembly

Nowadays most sequencing efforts are hybrid


Slide18 l.jpg

Identifying genes knowledge

AGCGGTGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATATGCTATAGC


Slide19 l.jpg

The identification of knowledgeOpen Reading Frames

AGCGGTGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCGATATGCTATAGC

TGTCGATCGGCGCTATAGCGCATGCGTATAGCGTATATCGATGTCGTAGCTGATGGCGCGAAATCGATCGGTCGATATAGCGGCCGGATATCGCATATGCTATAGCACGTTTG

Different visualization: look at possible reading frames


Slide20 l.jpg

Coding sequences characterized by: knowledge

a) the Lack of stop codons


Slide21 l.jpg

Leu knowledge : Ala : Trp

random 6 : 4 : 1

coding 7 : 7 : 1

Characteristics of coding sequences:

b) Codon usage

In addition: codon bias!


Slide22 l.jpg

Coding sequences characterized by: knowledge

c) Signals in the promoter region

Translation start:

ATG (GTG, CTG)

Ribosome Binding Site:

GGGAAGG


Slide23 l.jpg

GI_000001 knowledge

GI_000002

Problems associate with Coding sequence recognition

Problems:

- many small putative CDS (cut-off)

- deviations in start site

- sequencing errors

frameshifts


Slide24 l.jpg

Strategies to find Coding sequences knowledge

In practice, most gene finding programs use HMMs to predict protein encoding genes.

  • Train on a set of known genes:

  • Genes with a good database hit

  • Large genes with no overlap

  • Experimentally identified genes


Slide25 l.jpg

Strategies to find Coding sequences knowledge

Many different tools available:

Glimmer2, GeneMark, EasyGene, FrameD, ……

“Protein-coding regions in the genome sequence were identified using a combination of software tools including EasyGene [42], Glimmer [43] and FrameD [44].”



Slide27 l.jpg

What is function? knowledge

Inventory:

- What can it do?

- which conversions are catalized

- which metabolites are transported

- relates to physiology

- depends on environment

- with which component can it interact

-

-

-

-

-


Slide28 l.jpg

The attribute function is ambiguous knowledge

context independent(molecular function or properties)

- catalyze certain reactions

- interact with certain proteins

- bind to a specific DNA sequence

context dependent (role)

- act in a certain pathway

- be a member of a certain protein complex(es)

- act as a transcription factor

(Chemistry/physics)

(Biology/ physiology)


Slide29 l.jpg

Gene knowledge

Ontology

Descriptors of molecular function

Enzymatic conversions: EC-number (IUPAC)

Transport: TC-number (Saier)

Annotation using a controlled vocabulary (ontologies)

In library and information science controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.

Biopax


Slide30 l.jpg

Genome Sequence and how it relates to function knowledge

There are several properties of the translated and non-translated genome sequence that are identifiers of the function/role of a protein

  • Evolutionary conservation of sequence

  • Operon composition

  • Regulatory connections

  • Connections in the cellular network

(molecular function)

(biological role)


Slide31 l.jpg

A1 knowledge

B1

C1

A2

B2

C2a

C2b

Evolutionary conservation of sequence

Homology as an indicator of functional similarity

Orthologs: supposed identical molecular function

Paralogs: supposed similar molecular function

In-Paralogs: diverged (similar molecular function)

homologs


Slide32 l.jpg

Evolutionary conservation of sequence knowledge

Strategy: to transfer annotation from experimentally verified ortholog/equivalent

-> identify orthologs/equivalents


Slide33 l.jpg

Determining evolutionary relations: knowledge

Retrieving homologs

BLAST: will yield similar

sequences from database

Example:

map2 of L. plantarum

In a simple case: one good hit per genome


Slide34 l.jpg

Determining evolutionary relations knowledge

Procedure:

#Collect sequences and make multiple sequence alignment

MUSCLE: muscle -in FASTA.txt –out FASTA.aln


Slide35 l.jpg

Determining evolutionary relations: knowledge

Alignments and Trees

#Visualize multiple sequence alignment in CLUSTAL-X

And check homogeneity (conserved features, little gaps)

#Create bootstrapped NJ-tree (corrected for multiple substitutions)


Slide36 l.jpg

Determining evolutionary relations: knowledge

Use tree and gene context to infer orthology/equivalency

Example: Lactobacillus plantarum has 4 maltose phosphorylase homologs

kojibiose (Chaen et al. J. appl Glycosci 1999)

trehalose (Inoue et al. Biosci. Biotechnol. Biochem 2002)

maltose (Huwel et al. Enzyme Microb. Techn. 1997)

maltose (Inoue et al. Biosci. Biotechnol. Biochem. 2001)

LOFT R. vd Heijden et al. BMC Bioinformatics


Slide37 l.jpg

P2 knowledge

A

S

P1

Lactobacillus plantarum

0175

0180

map2

172

173

0445

0443

Lactobacillus gasseri

448

Bacillus subttilis

3456

map2/3

0606

Bacillus licheniformis

map2/3

lacI

PGPH

Lactobacillus plantarum

1729

map3

0415

Lactobacillus brevis

365

Pediococcus pentosaceus

0536

0535

537

Leuconostoc mesenteroides

0017

0016

0144

0145

Leuconostoc mesenteroides

142

143

Evolutionary conservation of sequence

Gene order conservation to identify functional equivalents


Slide38 l.jpg

Molecular function versus Biological role knowledge

Map2 and 3 identical molecular function

But distinct biological roles


Slide39 l.jpg

Coffee Break knowledge

DNA

sequencing (assembly)

identifying genes

Part I

protein

function prediction

function

reconstructionmodeling

biology


ad