Comparative genomics comparative gene prediction in the human genome
Download
1 / 27

Comparative Genomics Comparative Gene Prediction in the Human Genome - PowerPoint PPT Presentation


  • 267 Views
  • Uploaded on

Comparative Genomics Comparative Gene Prediction in the Human Genome. Maribel Hernandez Rosales. What is Comparative Genomics?. Comparative genomics is the analysis and comparison of genomes from different species.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Comparative Genomics Comparative Gene Prediction in the Human Genome' - achilles


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Comparative genomics comparative gene prediction in the human genome

Comparative GenomicsComparative Gene Prediction in the Human Genome

Maribel Hernandez Rosales


What is comparative genomics
What is Comparative Genomics?

  • Comparative genomics is the analysis and comparison of genomes from different species.

  • The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome.

  • Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse.

  • Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans.

  • Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them.



Eukaryotic gene finding
Eukaryotic Gene Finding organisms being studied?


Comparative gene prediction
Comparative Gene Prediction organisms being studied?

  • GenScan : ab initio gene prediction.

  • GeneWise, Procrustes : homology guided.

  • Rosseta, SGP1 (Syntetic Gene Prediction), CEM (Conserved Exon Method) : gene prediction and sequence alignment are clearly separated.

  • GenomeScan : Ab Initio modified by BLAST homologies.

  • SGP-2, TwinScan, SLAM, DoubleScan : modification of GenScan scoring schema to incorporate similarity to known proteins.


Genescan
GeneScan organisms being studied?

  • A general probabilistic model for the gene structure of human genomic sequences.

  • Gene identification by identifying complete exon/intron structures of genes in genomic DNA.

  • Include de capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands.

  • Markov Model of coding regions: predictions do not depend on presence of a similar gene in the protein sequence databases and complement the information provided by homology-based gene identification methods (BLASTX).

  • Maximal Dependence Decomposition (MDD): new statistical model of donor and acceptor splice sites which capture important dependencies between signal positions.


Pre mrna splicing

P organisms being studied?

P

P

N

N

N

R

5

R

R

5

n

6

n

n

s

3

F

s

s

F

A

2

1

A

1

2

U

U

U

2

U

U

SR proteins

branch signal

5

splice signal

5

splice signal

3

splice signal

polyY

exonic repressor

exonic enhancers

intronic enhancers

intronic repressor

Pre-mRNA Splicing

exon definition

intron definition

...

(assembly of

spliceosome,

catalysis)

...


Hidden semi markov model hmm
Hidden semi-Markov Model (HMM) organisms being studied?


Genscan hmm
GenScan HMM organisms being studied?

  • N - intergenic region

  • P - promoter

  • F - 5’ untranslated region

  • Esngl – single exon (intronless) (translation start -> stop codon)

  • Einit – initial exon (translation start -> donor splice site)

  • Ek – phase k internal exon (acceptor splice site -> donor splice site)

  • Eterm – terminal exon (acceptor splice site -> stop codon)

  • Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon


Genscan features
GenScan Features organisms being studied?

  • Model both strands at once

  • Each state may output a string of symbols (according to some probability distribution).

  • Explicit intron/exon length modeling

  • Advanced splice site modeling

  • Parameters learned from annotated genes

  • Prediction of multiple genes in a sequence (partial or complete).


Genomescan
GenomeScan organisms being studied?

  • We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons.

  • Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan).

  • Focus on ‘typical case’ when homologous but not identical proteins are available.


Ab initio modified by blast homologies
Ab Initio modified by BLAST homologies organisms being studied?


Ab initio modified by blast homologies1
Ab Initio modified by BLAST homologies organisms being studied?


Genewise
GeneWise organisms being studied?

  • Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA

  • GeneWise algorithm aligns a profile HMM directly to the DNA


Genewise1
GeneWise organisms being studied?

  • Start with a PFAM domain HMM

  • Replace AA emissions with codon emissions

  • Allow for sequencing errors (deletions/ insertions)

  • Add a 3-state intron model


Genewise model
GeneWise Model organisms being studied?


Genewise intron model

central organisms being studied?

PY tract

spacer

GeneWise Intron Model

5’ site

3’ site


Genewise features problems
GeneWise Features & Problems organisms being studied?

  • “Best” alignment of DNA to protein domain

  • Alignment gives exact exon-intron boundaries

  • Parameters learned from species-specific statistics

  • Only provides partial prediction, and only where the homology lies

    • Does not find “more” genes

  • Pseudogenes, Retrotransposons picked up

  • CPU intensive

    • Solution: Pre-filter with BLAST


Rosetta
Rosetta organisms being studied?

  • Gene prediction is separated from sequence alignment.

  • First, the alignment is obtained between two homologous genomic sequences using sequence global alignment Glass. Then, gene structures (splice sites, exon number and length, etc.) are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions.


Syntenic gene prediction
Syntenic Gene Prediction organisms being studied?

  • This approach does not require the comparison of two homologous genomic sequences.

  • A query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms.

  • Gene prediction and sequence alignment are separated.


Sgp 2

tblastx organisms being studied? HSPs

HSPsProjections

QuerySequence

geneidExons

SGPExons

SGP-2


Gene predicition programs predict a large number of genes
Gene predicition programs predict organisms being studied? a large number of genes

almost every mouse gene has

the human orthologue counterpart


Orthologous human mouse genes have conserved exonic structure
Orthologous human mouse genes have conserved exonic structure.

  • 85% of the orhologous pairs have identical number of exons

  • 91% of the orthologous exons have identical length

  • 99.5% of the orthologous exons have identical phase

  • there are a few cases of intron insertion/deletion (22)


Summary
Summary structure

  • Genes are complex structures which are difficult to predict with the required level of accuracy/ confidence

  • Different approaches to gene finding improve accuracy/confidence of the predictions:

    • Ab Initio : GenScan

    • Ab Initio modified by BLAST homologies: GenomeScan

    • Homology guided: GeneWise

    • Gene prediction and sequence alignment separately: Rosseta

    • Ab initio with similarity in known proteins: SGP-2



ad