Computational genomics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 149

Computational Genomics PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Computational Genomics. Izabela Makalowska July 15, 2006. The main task in modern biology is to find out how this…. TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

Download Presentation

Computational Genomics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computational genomics

Computational Genomics

Izabela Makalowska

July 15,2006


The main task in modern biology is to find out how this

The main task in modern biology is to find out how this…

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG

TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG

CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA

GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC

TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT

AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT

GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA

AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT

AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC

CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT

AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC


Becomes this

…becomes this


Dna sequence contains all information but we need to decipher it

DNA sequence contains all information but we need to decipher it.

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG

TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG

CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA

GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC

TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT

AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT

GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA

AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT

AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC

CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT

AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC


Program for life

Program for life

  • DNA in our cells store information in a way that is very similar to the way computers do.

  • Instead of being a binary memory, where everything is either 0 or 1, DNA is a 4 letter alphabet: A, C, G, T

  • Using computer metaphor we can say that:

    • Plant cell do not look like a mouse cell because their “programs” are different

    • Liver cells work differently than lung cells because of different input to the program

    • Children look like parents because their program is a “revision” of parents program

    • Many diseases are caused by “bugs” in program:

      • Tay-Sach’s disease: A simple mistake in one line of code

      • Huntington’s disease: A “line” of code gets repeated a bunch of times by accident

  • Different ways to solve the same problem:

    • Plants: photosynthesis = turn light into sugar

    • Animals: eat plant or other animals


What exactly are we looking for in the dna sequence

What exactly are we looking for in the DNA sequence?

  • Genes

    • Protein coding

    • RNA genes

    • Retrogenes

  • Regulatory elements

    • Promotors

    • Enhancers

    • siRNA

  • Repetitive elements

    • LINES

    • SINES

    • Simple repeats


Are genes just protein or rna coding elements

Are genes just protein or RNA coding elements?

Makorin1-p1 is a non-coding pseudogene of Makorin1. Makorin1-p1 regulates the expression of its related coding gene. It acts by stabilizing the Makorin1 gene by blocking of a cis-acting RNA decay element within the 5’ region of Makorin1.


Are repeats just a junk dna

Are repeats just a junk DNA?

Translation of mRNA containing Alu-cassette results in soluble form of the protein

Caras, I.W., Davitz, M.A., Rhee, L., Weddell, G., Martin Jr., D.W., Namba, T., Sugimoto, Y., Negishi, M., Irie, A., Ushikubi, F., Kaki-Nussenzweig,V., Cloning of decay-accelerating factor suggests novel use of splicing to generate two proteins.

Nature. 1987 Feb 5-11;325(6104):545-9.


Getting all genes

Getting all genes

  • The most direct way to identify a gene is to document the transcription of a fragment of the genome -EST sequencing

    • Requires less sequencing since it is focused on coding sequence only

    • Small rate of false positives, although even 10% of EST sequences could be artifacts

    • Genes with very restricted expression may newer be discovered

    • In most cases gives only partial sequences

  • Genome sequencing

    • Access to entire genome, allows to learn more about genome organization

    • Regulatory elements

    • Only small percentage of the genome codes for genes

    • Hard to identify less typical genes

    • High rate of false positives


Constructing est

Constructing EST

Cell or tissue

Isolate mRNA and

reverse transcribe into cDNA

Analyze

Clone cDNA into a vector

to make a cDNA library

5' EST

3' EST

cDNA

Sequence the

5' and 3' ends

of cDNA inserts

Pick individual

clones

vector


Problems with est data

Problems with EST data

  • Contamination

  • Low quality – the error rates are high in individual ESTs

  • Highly redundant, for highly expressed genes we can have hundreds of ESTs representing a single gene

  • The databases are skewed for sequences near 3’ end of mRNA

  • For most ESTs there is no indication as to the gene from which it was derived

  • Overlapping genes

  • Splice variants


Chromatogram

Chromatogram

An example of a good chromatogram showing well-resolved peaks and no ambiguities

This is a region of a chromatogram fairly far along the sequence where some bases in runs of 2 or more are no longer visible as single peaks. Many peaks are beginning to broaden and smear into one another, interpretation of the peaks has become more difficult, and the basecalling software has begun to use 'N's.

This is a region of a chromatogram where the traces have become too ambiguous for accurate basecalling. While some parts of this region of the chromatogram can be useful for linking to existing sequences following manual editing, it should not be considered accurate.


Sequence quality screening

Sequence quality screening

  • ABI sequencing software contains a program for quality screening

  • PHRED - reads DNA sequence data, calls bases, and writes the base calls and quality values to output files


Quality file

Quality file


Phred quality scores

Phred quality scores


Contamination

Contamination

  • Vectors - DNA/cDNAs from the biological source organism/organelle are usually inserted into a cloning vector so that they can be cloned, propagated, and manipulated. Sequencing of such constructs frequently produces raw sequences that include segments derived from vector.

  • Adapters, linkers, and PCR primers - Various oligonucleotides can be attached to the DNA/RNA under investigation as part of the cloning or amplification process.

  • Impurities in the DNA/RNA - Nucleic acid preparations may contain DNA/RNA from sources other than the intended one.

    • nucleic acids from an organelle

    • mRNA/DNA present in a reagent used in the isolation, purification, or cloning procedures

    • nucleic acids from other organisms present in the material from which the DNA/RNA was isolated

    • other DNAs/RNAs used in the laboratory (e.g., from accidental mixing of samples or cross contamination from dirty pipettes, tips, tubes, or equipment)


Consequences of contamination

Consequences of contamination

  • Time and effort wasted on meaningless analyses

  • Erroneous conclusions drawn about the biological significance of the sequence

  • Misassembly of sequence contigs and false clustering of Expressed Sequence Tags (ESTs)


Why contamination is causing problems

Assembly

Contigs

Why contamination is causing problems

Gene A

Gene B


Computational genomics

VecScreen

http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen_docs.html


Cross match

Cross_match

  • Cross_match is a general purpose application for comparing any two DNA sequence sets. For example, it can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads. It is slower but more sensitive than BLAST.

    GCACGCACAACCAGACCATGCTCGGACGACCCGCTGTACATCGGCCTGCGGCAGAGGCGCGTGCGCGGCGCCGCG

    TACGACGAGTTCGTCGACGAGTTCATGCAGGCGGTCGTCAAGCGCTTCGGGCAGAACTGCCTCATACAGTTCGAG

    GACTTCGCCAACGCGAACGCGTTCCGCCTGCTCGAGAAGTACCGCGGCAGGTACTGCACGTTCAACGACGACCTC

    CAGGGCACGGCGGCGGTGGCGGTGGCCGGGCTGCTCGCGTCGCTGCGCATCACCGGCAAGCGGCTCTCCGACAAC

    GTGTTCGTGTTCCAGGGAGCCGGCGAGGCATCTCTGGGTATCGCCGAGCTGTGCGTGATGGCGATGAAGAACGAG

    GGTACATCGGACGCCGATGCCCGCTGCAAGATTTGGATGGTGGACTCCAAGGGTCTCATCGTGAAGAACCGTCCT

    GAAGGTGGACTGAACGAACACAAGGAGAAGTTTGCCCAGAACTGCTCCCCCATTCGGACACTTGCCGAAGTTATA

    AATGTTGCTAAGCCTTCTGTACTGATTGGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XXXXXXXXXXXXXXXXXXXXXXXXXXXGCTCGACGGCGGAAGGAAAA


Qscreen

Qscreen

  • Sequencing data management and quality checking system

    • Quality screening

    • Relational database for sequence data management and archive

    • Easy data access via web interface

    • Easy data sharing

    • Sequence and trace view

    • Project statistics

    • Password protected


Qscreen project and user management

Qscreen: project and user management


Gene discovery strategy

Gene discovery strategy

  • Cluster and assemble EST sequences to lower redundancy and to increase the length of transcripts

  • Find coding regions and reading frame

  • Use deducted protein to search databases and assign function to the gene


Cluster and assemble est resources

Cluster and assemble EST - resources

  • Assembly tools:

    • TGICL http://www.tigr.org/tdb/tgi/software/

    • Cap 3 http://genome.cs.mtu.edu/sas.html

    • Phrap http://www.phrap.org/

  • EST clusters databases

    • UniGenehttp://www.ncbi.nlm.nih.gov/

    • TGIhttp://www.tigr.org/

  • EST analysis pipeline

    • SMAPihttp://smapi.cbio.psu.edu


Assembling ests

  • Assembling inside each cluster

Assembling ESTs

  • Two stage process:

    • Clustering ESTs based on the similarity and clone ID


Important parameters

Important parameters

  • Criteria too stringent = many ESTs will not be assembled and genes will stay fragmented

  • Criteria too loose = ESTs from genes from the same family will be assembled into one gene

Length and similarity level

of overlapping fragments

Length of overhanging fragments


Contig quality

Contig quality

ATGTCTCTNTCACTGA

TCTGTCCC-CAGTCACGATCGAN

ATGTCTCTGTCNCTNAGTCACGATCGAN

ATGTCTCTNTCACTGA

TCTGTCCC-CAGTCACGATCGAN

ATGTCTCGGTCAC-CAGTCACGATCGAT

ATGTCTCGGTCAC-CAGTCACGATCGAT

TTGTCTGGGTCAC-CTCC

GGTGGC-CAGTCACGATNGAN

ATGTCTCGGTCAC-CAGTCACGATCGAT

ATGTCTCTGTCNCTNAGTCACGATCGAN

ATGTCTCGGTCAC-CAGTCACGATCGAT


One gene one cluster

Joined based on clone ID

Joined based on similarity

Joined based on similarity

One gene one cluster?

3’ESTs

5’ESTs


One cluster one gene

One cluster one gene?


Computational genomics

Cap3

  • Use of forward-reverse constraints to correct assembly errors and link contigs.

  • Use of base quality values in alignment of sequence reads.

  • Automatic clipping of 5' and 3' poor regions of reads.

  • Generation of assembly results in ‘ace’ file format for Consed.


Input files

Input files

  • CAP3 takes as input a file of sequence reads in FASTA format. CAP3 takes two optional files: a file of quality values in FASTA format and a file of forward-reverse constraints. The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file. CAP3 uses the same format of a quality file as Phrap.


Web interface

http://bio.ifom-firc.it/ASSEMBLY/assemble.html

Web interface


Output files

Output files


Tgicl tigr gene indices clustering tool

TGICL – TIGR Gene Indices clustering tool

  • Clustering – uses modified megablast program to cluster sequences together

  • Assembly – CAP3 is used to assemble sequences inside each cluster


Tgicl tigr gene indices clustering tool1

TGICL – TIGR Gene Indices clustering tool

  • Sequences need to be cleaned before using TGICL (Lucy, UniVEc, SeqClean)

  • mRNA sequences may be used as ‘seeds’ for clustering. Caution: partial mRNAs mislabeled as complete can prevent cluster extension beyond the seed.

  • Difficulty with highly expressed genes that have several thousand ESTs in a single cluster (assembly program may run out of memory)


Phrap

Phrap

  • part of the Phred/Phrap/Consed

  • program for assembling shotgun DNA sequence data.

  • allows use of the entire read and not just the trimmed high quality part

  • uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy

  • constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus

  • provides extensive assembly information to assist in trouble-shooting assembly problems

  • handles large datasets

  • It is strongly recommended that phrap be used in conjunction with the base calls and base quality values produced by the basecaller, phred; and with the sequence editor/assembly viewer, consed.


Phrap output in consed

Phrap output in Consed


Apple est assembly

Apple EST assembly

183, 732 ESTs

  • Phrap:

    • 24,199 contigs; 9,765 singletons

  • CAP3

    • 19,791 contigs; 18,927 singletons

  • TGICL

    • 22,481 contigs; 28,279 singletons


Ncbi unigene

NCBI UniGene


Clustering at ncbi

Clustering at NCBI

  • EST must have at least 100bp after removing contaminants

  • The overlap between similar ESTs must be at least 70 bp

  • Similarity between overlapping area must be at least 96% over the 70% of overlapping region

>100bp

>70% of overlap with at

least 96% of similarity

>100bp

>70bp


Tigr gene indices

TIGR Gene Indices

www.tigr.org


What are tigr gene indices

What are TIGR gene indices?

  • Clustered and assembled ESTs and mRNA sequences

  • Each gene and each splice variant is represented by a single consensus sequence

  • Provides ORF annotation, genome mapping, expression profiles, domain annotation, unique oligomers

>40bp of overlap with at

least 95% of similarity

<30bp

<30bp


Tgi annotations

TGI annotations


Tgi sequence annotations

TGI sequence annotations


Blast search

BLAST search


Smapi s equence m anagement and a nalysis pi peline

  • The Parasitic Plant Genome Project

    • Comparative genomics of related parasitic genera

    • Host-parasite lateral transfer

SMAPi – Sequence Management and Analysis Pipeline

  • Online access to data

  • Easy sequence management

  • Custom as well as automated annotation

  • Sharing among researchers

  • Integrated usage of various analysis tools (BLAST, Phred/Phrap.)

Supported by grant from

Pennsylvania Greenhouse for Life Sciences


Ncbi orf finder

NCBI ORF finder


Mapping contigs to the genome

Mapping contigs to the genome

  • Splign http://www.ncbi.nlm.nih.gov/sutils/splign/


Genome sequencing

Genome sequencing

  • Access to entire genome

  • Regulatory elements

  • Only small percentage of the genome codes for genes

  • Hard to identify less typical genes

  • High rate of false positives

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG

TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG

CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA

GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC

TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT

AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT

GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA

AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT

AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC

CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT

AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC


Genome sequencing strategies

Genome sequencing strategies


Gene identification methods

Gene identification methods

  • Molecular techniques

    • Very laborious

    • Time consuming

    • Expensive

    • Low rate of false positives

  • Computational methods

    • Fast

    • Relatively low cost

    • High rate of false positives

    • Poor performance on less typical genes


Before we start analysis

Before we start analysis…

  • We have to:

    • Check sequences quality

    • Remove contamination

    • Assembly sequence reads into longer contigs

    • Close gaps (in perfect situation)


Gene finding strategies

Gene finding strategies

Genomic Sequence

Model based

Comparative

Site-Based

Content-Based

  • Inferences based on sequence homology:

    • Protein sequence with similarity to translated product of query

    • Similarity to EST sequences

    • Conserved regions between species (comparative genomics)

  • Absolute properties of

  • sequence:

    • Consensus sequences

    • Donor and acceptor splice sites

    • Transcription factor binding sites

    • Polyadenylation signals

    • Start codon

    • Stop codon

  • Bulk properties of

  • sequence:

    • Open reading frame

    • Codon usage

    • Repeat periodicity

    • Compositional complexity


Gene finding methods classification

Gene finding methods classification

Similarity based predictors: make use of similarity to already known genes and proteins coded by these genes as well as expression data including sequences from cDNAs and data from hybridization experiments (tiling arrays for example)

Dual- and multi-genome predictors: rely on the fact that functional regions of a genome sequence are more conserved during evolution

Model based predictors: use a single genome sequence and exon/intron structure is predicted based on absolute and bulk properties of the sequence


Comparative genomics multipipmaker

Comparative genomics - MultiPipmaker

http://pipmaker.bx.psu.edu/pipmaker/


Model based methods

Model based methods

  • We take advantage of what we already learned about gene structures and features of coding sequences. Based on this knowledge we can build theoretical model, develop an algorithm to search for important features, train it on known data and use to search for coding sequences in anonymous genomic fragments


Pattern recognition and matching

Pattern recognition and matching

  • The ability of a program to compare novel and known patterns and determine the degree of similarity forms the basis of sequence analysis including gene identification. In similarity based methods we search the genome directly for nucleotide or amino acid pattern observed in one or more already known genes; in comparative genomics we look for similar sequence pattern in two or more genomes, and in method based prediction we look for patterns in sequence composition and signals.

  • One of the major challenges associated with using pattern matching is in that, in most cases, we need to identify patterns that are ‘similar’ to a target pattern, but the concept of similarity isn’t well defined from programmatic and biological sense. Also, only already known pattern may be used for searches, therefore genes with unusual patterns may not be discovered using these methods.


Coding regions in prokaryotes

INTERGENIC REGION

START

CODING SEQUENCE

STOP

Coding regions in Prokaryotes


Genetic code

Genetic code


Gene model in prokaryotes

Gene Model in Prokaryotes

T

A

A

A

T

1

2

G

3

T

A

G

A

G

T

START

CODON

STOP


Open reading frames

Open Reading Frames

+1

+2

+3

-1

-2

-3


Properties of the sequence

Properties of the sequence

  • We can check if sequence in particular ORF has some other features which could tell us if this is a putative coding sequence or the ORF is false positive. We can look at the sequence content and compare it with known coding sequence and non-coding sequence and check to which of these two the ORF sequence is more similar to.


Markov chain

Markov chain

  • A Markov chain describes at successive times the states of a system. At these times the system may have changed from the state it was in the moment before to another or stayed in the same state. The changes of state are called transitions. The Markov property means the system is memoryless, i.e. it does not "remember" the states it was in before, just "knows" its present state, and hence bases its "decision" to which future state it will transit purely on the present, not considering the past.


Examples of markov chain

Examples of Markov chain

  • A game of Monopoly, snakes and ladders or any other game whose moves are determined entirely by dice is a Markov chain. This is in contrast to card games such as poker or blackjack, where the cards represent a 'memory' of the past moves. To see the difference, consider the probability for a certain event in the game. In the above mentioned dice games, the only thing that matters is the current state of the board. The next state of the board depends on the current state, and the next roll of the dice. It doesn't depend on how things got to their current state. In a game such as poker or blackjack, a player can gain an advantage by remembering which hands have already been played (and hence which cards are no longer in the deck), so the next state (or hand) of the game is not independent of the past states.


Markov chain weather model

0.1

0.9 0.1

0.5 0.5

P =

0.9

0.5

0.5

Markov chain – weather model

  • The matrix P represents the weather model in which a sunny day is 90% likely to be followed by another sunny day, and a rainy day is 50% likely to be followed by another rainy day.


Markov chain sequence properties

Markov chain – sequence properties

A

G

C

T


How can we do this

How can we do this?

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCTATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

20 G

7 GA

1 GG

5 GT

7 GC

For non-coding sequence we assume that probability of each is equal. The more ‘popular’ in coding sequence transition, the higher probability the sequence is coding

7/20

1/20

5/20

7/20


Markov chains

Markov Chains

GCGCTAGCGCCGATCATCTACTCG

}

GCGCTAGCGCCGATCATCTACTCG

first order

GCGCTAGCGCCGATCATCTACTCG

}

GCGCTAGCGCCGATCATCTACTCG

second order

GCGCTAGCGCCGATCATCTACTCG

}

GCGCTAGCGCCGATCATCTACTCG

fifth order

GCGCTAGCGCCGATCATCTACTCG


How far can we go

How far can we go?

  • Order of our model will have influence on specificity and sensitivity of our program. Too short sequences may not be specific enough and program may return a lot of false positives. Long chains may be to specific and our program will not be sensitive enough and will return a lot of false negatives.


Probability matrix

Probability matrix

3

2+1

4

= 4 =64

4

3+1

4

= 4 =256

first order Markov Model - matrix of 16 probabilities

p(A/A),p(A/T),p(A/C),p(A/G)

p(T/A),p(T/T),p(T/C),p(T/G)

p(C/A),p(C/T),p(C/C),p(C/G)

p(G/A),p(G/T),p(G/C),p(G/G)

K+1

4

2

1+1

4

= 4 =16


Probabilities in different reading frames

Probabilities in different reading frames

GCGCTAGCGCCGATCATCTACTCG

GCG CTA GCG CCG ATC ATC TAC TCG

G CGC TAG CGC CGA TCA TCT ACT CG

GC GCT AGC GCC GAT CAT CTA CTC G

Does G more often follow GC when it is in third

position in codon or if it is in second or first

position?


Number of probabilities

Number of probabilities

2

1+1

4

= 4 = 16

2

1+1

4

= 4 = 3x16 = 48


Estimating coding potential

Estimating coding potential

To estimate if the sequence is coding we have to calculate the probability that the sequence is coding and the probability the sequence is non-coding. Next we calculate the logarithm of the ratio of these two probability values

i

P (S)

LP (S) = log

P (S)

0

If the calculated value is > 0 the likelihood that the sequence is coding is higher than the sequence is not coding. If the value is < 0 there is higher likelihood that sequence is not coding.


Markov models probabilities

Markov Models - probabilities

i

P (S)

LP(S) = log

P (S)

0

S=AGGACG

1

1

1

2

2

3

P(S) =f(A,1)F(G,A)F(G,G)F(A,G)F(C,A)F(G,C)

P(S) = 0.27 x 0.19 x 0.27 x 0.24 x 0.21 x 0.12 = 0.00008377

P(S) = 0.25 x 0.25 x 0.25 x 0.25 x 0.25 x 0.25 =0.0002441

LP(S) = log(0.00008377/0.0002441) = -0.4644


Calculating lp

Calculating LP

i

P (S)

LP(S) = log

P (S)

0

0.27

0.21

0.27

0.19

0.24

0.12

LP(S) = log + log + log + log + log + log

0.25

0.25

0.25

0.25

0.25

0.25

LP(S) = log 1.08 + log 0.76 + log 1.08 + log 0.96 + log 0.84 + log 0.48

LP(S) = 0.0334 + (-0.1191) +0.0334 + (-0.0177) + (-0.0757) + (-0.3187)

LP(S) = -0.4644


Glimmer

GLIMMER

  • Gene finding program for Prokaryotes

  • Saltzberg et. al, 1998

  • For prediction uses:

    • Start

    • Stop

    • Sequence composition

    • Interpolated Markov Models


The glimmer system

The GLIMMER system

  • Part 1 – Program is trained for given data set (species)

  • Part 2 – Program identifies putative genes in the genomic sequence

    • Identify all ORFs longer than a threshold

    • Score each ORF in each reading frame and select these which gets highest scores in correct reading frame

    • Score overlapping genes in each frame separately to see which frame score the highest


Running the program

Running the program

  • First run build-imm on a set of sequences to make the Markov models (long ORFs from the same or closely related species)

    • build-imm train.seq

  • Then run GLIMMER to find genes in your sequence

    • glimmer your.seq train.seq <options>


Glimmer options

GLIMMER options

  • -gset minimum gene length

  • -oset minimum overlap

  • -p set minimum overlap percentage

  • +r/-r independent probability score ON/OFF

  • -tset threshold score for calling as gene


Glimmer output

GLIMMER output

Minimum gene length = 180

Minimum overlap length = 30

Minimum overlap percent = 10.0%

Threshold score = 90

Use independent scores = True

Use first start codon = True

Orf Gene Lengths Gene -- Frame Scores - Indep

ID# Fr Start Start End Orf Gene Score F1 F2 F3 R1 R2 R3 Score

F2 302 305 616 315 312 0 _ 0 _ 99 _ _ 0

1 R1 660 633 220 441 414 99 _ _ _ 99 _ _ 0

F2 620 650 901 282 252 0 _ 0 _ _ _ 99 0

2 R3 1114 1105 638 477 468 99 _ _ _ _ _ 99 0

F3 1119 1140 1466 348 327 0 _ _ 0 _ _ 99 0

3 R3 2026 1999 1118 909 882 99 _ _ _ _ _ 99 0

4 F3 1815 1830 2054 240 225 99 _ _ 99 _ _ _ 0

*** Overlaps #3 by 170 Overlap Region Scores: _ _ 0 _ _ 99 0

Reject #4

5 R2 2600 2597 1935 666 663 99 _ _ _ _ 99 _ 0

*** Overlaps #4 by 120 Overlap Region Scores: _ _ 0 _ 99 _ 0

6 F1 2710 2719 3399 690 681 99 99 _ _ _ _ _ 0

R3 4153 4153 3962 192 192 0 99 _ _ _ _ 0 0

7 F1 3403 3403 4230 828 828 99 99 _ _ _ _ _ 0

R2 4700 4679 4455 246 225 0 _ _ _ _ 0 _ 99

R2 68906 68897 68670 237 228 13 _ _ _ _ 13 _ 86

8 R1 101574 101544 101296 279 249 96 _ _ _ 96 _ _ 3

R3 193228 193204 193022 207 183 56 _ _ _ _ _ 56 43


List of putative genes

List of putative genes

Putative Genes:

1 633 220

2 1105 638

3 1999 1118

5 2597 1935

6 2719 3399

7 3403 4230

...

39 38472 38741 [Shorter 40 80 74]

40 38662 39450 [Bad Overlap 39 80 25]

...

482 464206 464424 [Shadowed by 483]

...

636 616213 615965 [Delay by 33 637 50 0]


Output description

Output description

39 38472 38741 [Shorter 40 80 74]

40 38662 39450 [Bad Overlap 39 80 25]

[Bad Overlap a b c] means that gene number a overlapped this one and

was shorter but scored higher on the overlap region. b is the length

of the overlap region and c is the score of *this* gene on the overlap

region. There should be a [Shorter ...] notation with gene a

giving its score.

[Shorter a b c] means that gene number a overlapped this one and

was longer but scored lower on the overlap region. b is the length

of the overlap region and c is the score of *this* gene on the overlap

region. There should be a [Bad overlap ...] notation with gene a

giving its score.


Output description 2

Output description - 2

482 464206 464424 [Shadowed by 483]

...

636 616213 615965 [Delay by 33 637 50 0]

[Shadowed by a] means that this gene was completed contained as part

of gene a 's region, but in another frame.

[Delay by a b c d] means that this gene was tentatively rejected

because of an overlap with gene b , but if the start codon is postponed

by a positions, then this would be a valid gene. The start position

reported for this gene includes the delay. c is the length of the overlap

region that caused the rejection and d is the score in this gene's frame

on that overlap region.


Prokaryotic vs eukaryotic genes

Prokaryotes

small genomes

high gene density

no introns (or splicing)

no RNA processing

similar promoters

terminators important

overlapping genes

Eukaryotes

large genomes

low gene density

introns (splicing)

RNA processing

heterogeneous promoters

terminators not important

polyadenylation

Prokaryotic vs. Eukaryotic Genes


Coding regions in prokaryotes1

INTERGENIC REGION

START

CODING SEQUENCE

STOP

Coding regions in Prokaryotes


From dna to protein

From DNA to protein


Gene structure

Gene structure


Pseudogenes and repetitive elements

Pseudogenes and repetitive elements


Complicated gene structures

Complicated gene structures


Overlapping genes

Overlapping genes


Eukaryotic gene structure

Eukaryotic gene structure


Computational genomics

3’ UTR

E0

E1

E2

Eterm

Einit

I2

I0

I1

5’ UTR

3’ UTR

Single exon gene

promoter

Poly A

Signal

Intergenic region

5’ UTR

G T

AG

Markov Models


Searching for coding sequences using markov chain

Searching for coding sequences using Markov chain

In this case we do not want check if given sequence fragment is coding or not but we rather want to identify coding fragments in a long sequence. In most cases this is done by calculating statistics in overlapping windows.

AGTACGATATTAGCGGCAATCGTATGACTACGTCTTGCTACGTCTTCTCTCGTCTGCTCTAG

Windows coding potentials are used to create profiles. This example shows a profile for a sequence analyzed using 120bp windows with 10 bp distance between them.


Codon usage

Codon usage


Codon usage1

Codon usage


Probability that sequence is coding

Probability that sequence is coding


Probability that sequence is non coding

Probability that sequence is non-coding


Log likelihood ratio

Log-likelihood ratio


Rule based method

Rule based method

Neural network


Grail

GRAIL

  • Gene recognition and Analysis Internet Link

  • Uberbacher and Mural, 1991

  • First gene prediction program

  • GRAIL1

    • Neural network recognizing coding potential within fixed-size window (100 bp)

    • Evaluates coding potential without looking for additional features (e.g. splice junctions, start and stop codons)

  • GRAIL2

    • Variable size of windows

    • Incorporated genomic context information (splice sites, start and stop, polyadenylation signals)

  • GRAILEXP

  • http://compbio.ornl.gov/grailexp/


Grailexp

GRAILEXP


Computational genomics

MZEF

  • M. Zhang 1997

  • Predicts exons only, does not build gene structure

  • Uses ‘quadratic discriminant analysis”

  • Variable measures:

    • Exon length

    • Intron-exon transition

    • Branch site scores

  • http://rulai.cshl.org/tools/genefinder/


Computational genomics

MZEF


Fgenes

FGENES

  • Solovyev et al., 1994

  • Predicts internal exons

  • Linear discriminant analysis

    • Donor and acceptor splice sites

    • Putative coding regions

    • Intronic regions both 5’ and 3’ to the putative exon

    • Passes results to a dynamic programming algorithm to come up with coherent gene model

  • http://www.softberry.com/berry.phtml


Fgenesh

FGENESH


Fgenesh1

FGENESH


Genscan

GENSCAN

  • Burge and Karlin

  • Search for general and specific compositional properties of distinct functional units in eukaryotic genes

  • General fifth-order markov model of coding regions

  • Analyze both DNA strands

  • Sequences may contain multiple and/or partial genes

  • http://genes.mit.edu/GENSCAN


Genscan options

GENSCAN options


Genscan output

GENSCAN output


Graphical output

Graphical output


Are predictions the same

Are predictions the same?

FGENESH

GRAILEXP

MZEF

GenScan


Evaluation statistic

Evaluation statistic

TN

TP

TP

FN

FP

TP - true positive

FP - false positive

FN - false negative

TN - true negative


Experimental validation of predicted genes

Experimental validation of predicted genes

  • 20 not annotated human BAC clones

    • 3 finished

    • 17 unfinished

  • Genes that had at least two exons, each predicted by at least two programs

    • the overlap of the predicted exons did not have to be perfect

    • similarity to ESTs or known genes was used as supporting evidence but was not required

    • 40 genes (number of exons 2-11)

  • Six single exons predicted by three or four programs

  • Three two-exon genes predicted by one program only but strongly supported by similarities to EST sequences

  • 12 genes were eliminated from further studies as they contained repetitive elements and were most likely false positives

  • Total: 37 putative transcripts


Prediction programs performance

Prediction programs performance

37 genes were tested, 16 of them (43%) were confirmed.

At the exon level there were 159 exons and 58 (36%)

were found to be real

predicted exons

specificity

sensitivity

MZEF

34

0.51

0.56

GRAIL

11

0.48

0.19

GENSCAN

52

0.46

0.91

FGENES

45

0.37

0.75


Gene structure and alternative splicing

Gene structure and alternative splicing


Repetitive elements

Repetitive elements

12 genes containing large fragments of repetitive elements were eliminated from experimental validation. However some proteins are partially coded by Alu or MER (Makalowski 1994, 2000)

Gene gm121 contains MER at 5' end of cDNA and this is experimentally confirmed gene.

Gene gm124 was eliminated from our experimental validation based on the presence of repetitive element. Further analysis showed that this gene is a part of transcript FLJ23129, GenBank accession AK026782


Repetitive elements1

Repetitive elements

  • 50% of mammalian genome are repeats:

    • DNA transposons

    • retrotransposons

    • LINEs

    • SINEs

    • tandem repeats

  • masking before similarity search - helps avoid getting similarities caused by the presence of repetitive elements, not because of sequences homology

  • predicted gene with repetitive elements are less likely to be real, although sometimes repeats are true parts of coding sequence


Repeatmasker

RepeatMasker

  • Searches for Alu, MIR, LINE, LTR and other repeats by comparison to sequences in RepBase library

  • RepBase is a database of repetitive DNA sequence elements found in a variety of eukaryotic organisms including primates, rodents, cow, dog, chicken, fugu, drosophila, arabidopsis, rice

  • Accepts local databases with repetitive elements

  • Repetitive elements are masked by replacing nucleotides with string of letters N

  • Masking can be limited to certain types of repeats only


Computational genomics

RepeatMasker

http://www.repeatmasker.org/


Repeatmasker outputs

RepeatMasker outputs


Functional annotation gene ontology

Functional annotation- Gene Ontology

  • Gene Ontology: A controlled vocabulary to describe gene products - proteins and RNA - in any organism.

  • Gene is described in three categories:

    • Cellular component: where a gene

      product acts

    • Molecular function: activities or “jobs”

      of a gene product

    • Biological process: a commonly

      recognized series of events


Annotation source

Annotation source

ISSInferred from Sequence/Structural Similarity

IDAInferred from Direct Assay

IPIInferred from Physical Interaction

TASTraceable Author Statement

NASNon-traceable Author Statement

IMPInferred from Mutant Phenotype

IGIInferred from Genetic Interaction

IEPInferred from Expression Pattern

ICInferred by Curator

NDNo Data available

IEAInferred from electronic annotation


Gene annotation

Gene annotation


Annotating genes

Annotating genes

  • Model organism: look at curated databases


Annotating novel genes

Annotating novel genes

  • Similarity search against curated and well annotated database: functional annotations deducted from close and significant match

  • Functional domains identification: mapping domains and terms to Gene Ontology


Annotating novel genes1

Annotating novel genes

  • Similarity search against curated and well annotated database: functional annotations deducted from close and significant match

  • Identification of functional domains: mapping domain to GO term


Regulatory sequences

Regulatory sequences

  • Difficult to identify using computer programs

    • Most regulatory elements are still not identified

    • They are usually very short: 6-10 base pairs

    • They may differ in one or more places


Finding regulatory elements

Finding regulatory elements

  • Phylogenetic footprinting – comparative genomics used to identify conserved non-coding sequences (MultiPipMaker)

  • Phylogenetic shadowing – conceptually similar to phylogenetic footprinting, with the difference that closely related species are compared (eShadow)

  • Gibbs sampling – random iterations of the promoter regions alignments to identify blocks of conserved residues (Gibbs Motif Sampler)

  • Hidden Markov Models – searching for known regulatory elements using probability matrices (Cister)


Phylogenetic shadowing

Phylogenetic shadowing


Gibbs sampling

Gibbs sampling

ACGGTACGTTGG

GGTACGTAGGAC

TGGCTACCTTGG

CGGTACGTAGGT

******* **


Building model from existing alignment

Building model from existing alignment

ACA - - - ATG

TCA ACT ATC

ACA C - - AGC

AGA - - - ATC

ACC G - - ATC

insertion

Transition probabilities

Output Probabilities

A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.


Cister

Cister

  • Detects cis-elements clusters by using Hidden Markov Model

  • For each element uses separate matrix with frequencies of each nucleotide in each position; user can input matrix for elements not included in the basic option

  • User can specify:

    • distance between neighboring cis-elements within a cluster

    • number of cis-elements in the cluster

    • distance between clusters

    • half-width of the sliding window


Frequency matrix

Frequency matrix

NA AML-1a

XX

DE runt-factor AML-1

XX

BF T02256; AML1a; Species: human, Homo sapiens.

XX

P0 A C G T

01 5 1 2 49 T

02 2 2 52 1 G

03 4 14 1 38 T

04 0 0 57 0 G

05 1 0 55 1 G

06 1 4 0 52 T

TGTGGT

TGCGGT

TGTGGT

AGTGGT

TGTGGC

Sequences of experimentaly identified elements are aligned

and frequencies in each position are calculated


Cister1

Cister

http://zlab.bu.edu/~mfrith/cister.shtml


  • Login