Biological motivation gene finding
Download
1 / 27

Biological Motivation Gene Finding - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Biological Motivation Gene Finding. Anne R. Haake Rhys Price Jones. Gene Finding. Why do it? Find and annotate all the genes within the large volume of DNA sequence data how many genes in an organism? homologies? Gain understanding of problems in basic science

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Biological Motivation Gene Finding' - chione


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Biological motivation gene finding

Biological MotivationGene Finding

Anne R. Haake

Rhys Price Jones


Gene finding
Gene Finding

Why do it?

  • Find and annotate all the genes within the large volume of DNA sequence data

    • how many genes in an organism? homologies?

  • Gain understanding of problems in basic science

    • e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc?

  • Different emphasis in these goals has some effect on the design of computational approaches for gene finding.


Gene finding by biological methods
Gene Finding by Biological Methods:

  • Extract mRNA reverse

    transcribe cDNA

    Label cDNA

    Detecting by using cDNA probe

    Gene found

DNA library


Gene finding by computational methods
Gene Finding by Computational Methods

  • Dependent on good experimental data to build reliable predictive models

  • Various aspects of gene structure/function provide information used in gene finding programs


Figure 12 3
Figure 12.3

Figure 12.3


The informatics view of genes
The Informatics View of Genes

  • Genes are character strings embedded in much larger strings called the genome

  • Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.


Gene finding1
Gene Finding

  • Cells recognize genes from DNA sequence

    • find genes via their bioprocesses

  • Not so easy for us..


CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATATCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...


GCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT

CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...


Types of genes
Types of GenesCTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT

  • Protein coding

    • most genes

  • RNA genes

    • rRNA

    • tRNA

    • snRNA (small nuclear RNA)

    • snoRNA (small nucleolar RNA)


3 major categories of information used in gene finding programs
3 Major Categories of Information used in Gene Finding Programs

  • Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands

  • Content/composition -statistical properties of coding vs. non-coding regions.

    • e.g. codon-bias; length of ORFs in prokaryotes;GC content

  • Similarity-compare DNA sequence to known sequences in database

    • Not only known proteins but also ESTs, cDNAs


Looking for protein coding genes
Looking for Protein Coding Genes Programs

  • Look for ORF (begins with start codon, ends with stop codon, no internal stops!)

    • long (usually > 60-100 aa)

    • If homologous to “known” protein more likely

  • Look for basal signals

    • Transcription, splicing, translation

  • Look for regulatory signals

    • Depends on organism

      • Prokaryotes vs Eukaryotes

      • Vertebrate vs fungi


Easier problem gene finding in bacterial genomes
Easier problem: ProgramsGene Finding in Bacterial Genomes

Why?

  • Dense Genomes

  • Short intergenic regions

  • Uninterrupted ORFs

  • Conserved signals

  • Abundant comparative information

    • Complete Genomes available for many


What do prokaryotic genes look like
What do Prokaryotic Genes look like? Programs

5’

3’

Open Reading Frame

Promoter region (maybe)

Ribosome binding site (maybe)

Termination sequence (maybe)

Start codon / Stop Codon


Prokaryotic gene expression

Ribosome, tRNAs, Programs

Protein Factors

Translation

Prokaryotic Gene Expression

Promoter

Cistron1

Cistron2

CistronN

Terminator

Transcription

RNA Polymerase

mRNA 5’

3’

1

2

N

SD in polycistronic message

N

N

C

N

C

C

1

2

3

Polypeptides

Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt


Open reading frame orf
Open Reading Frame (ORF) Programs

  • Any stretch of DNA that potentially encodes a protein

  • The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene


Open reading frames
Open Reading Frames Programs

Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand.

A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)

A C G T A A C T G A C T A G G T G A A T

CGT AAC TGA CTA GGT GAA

GTA ACT GAC TAG GTG AAT


Orfs as gene candidates
ORFs as gene candidates Programs

  • An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)

  • Most prokaryotic genes code for proteins that are 60 or more amino acids in length

  • The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n

  • When n is 50, there is a probability of 92% that the random sequence contains a stop codon

  • When n is 100, this probability exceeds 99%


Codon bias
Codon Bias Programs

  • Genetic code degenerate

    • Equivalent triplet codons code for the same amino acid

    • http://www.pangloss.com/seidel/Protocols/codon.html

  • Codon usage varies

    • organism to organism

    • gene to gene

  • Biological basis

    • Avoidance of codons similar to stop

    • Preference for codons that correspond to abundant tRNAs within the organism


Codon bias gene differences
Codon Bias ProgramsGene Differences

GAL4 ADH1

Gly GGG 0.21 0

Gly GGA 0.17 0

Gly GGT 0.38 0.93

Gly GGC 0.24 0.07

Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt


Codon bias organism differences
Codon Bias ProgramsOrganism differences

  • Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)

  • Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)

  • Complete set of codon usage biases can be found at:

http://www.kazusa.or.jp/codon/


Gc content
GC content Programs

  • GC relative to AT is a distinguishing factor of bacterial genomes

  • Varies dramatically across species

    • Serves as a means to identify bacterial species

  • For various biological reasons

    • Mutational bias of particular DNA polymerases

    • DNA repair mechanisms

    • horizontal gene transfer (transformation, transduction, conjugation)


Gc content1
GC Content Programs

  • GC content may be different in recently acquired genes than elsewhere

  • This can lead to variations in the frequency of codon usage within coding regions

    • There may be significant differences in codon bias within different genes of a single bacterium’s genome


Ribosome binding sites
Ribosome Binding Sites Programs

  • RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)

  • Usually found within 4-18 nucleotides of the start codon of a true gene


Shine dalgarno sequence
Shine-Dalgarno Sequence Programs

  • Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.

  • This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.

  • If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.


Bacterial promoter
Bacterial Promoter Programs

-35

T82T84G78A65C54A45…

(16-18 bp)…

T80A95T45A60A50T96…(A,G)

-10 +1

Not so simple: remember, these are

consensus sequences


Termination sequences
Termination Sequences Programs

  • 3’-U tail

  • Stem/loop

    • Inverted repeat immediately preceding the runs of uracil

      Termination sequence


ad