Csce555 bioinformatics
1 / 43

CSCE555 Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

CSCE555 Bioinformatics. Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CSCE555 Bioinformatics' - louis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Csce555 bioinformatics

CSCE555 Bioinformatics

Lecture 9 Gene Finding & Comparative genomics

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun HuCourse page:


University of South Carolina

Department of Computer Science and Engineering



  • Performance Evaluation of Gene Finding programs

  • Comparative genomics:

    • What to do

    • Tools

    • Databases

    • Application case

Accuracy measures of gene finding programs





Coding / No Coding




No Coding / Coding



Accuracy Measures of Gene-Finding Programs

Sensitivity vs. Specificity(adapted from Burset&Guigo 1996)

Test datasets
Test Datasets FN TN

  • Sample Tests reported by Literature

    • Test on the set of 570 vertebrate gene seqs (Burset&Guigo 1996) as a standard for comparison of gene finding methods.

    • Test on the set of 195 seqs of human, mouse or rat origin (named HMR195) (Rogic 2001).

Results: FN TNAccuracy Statistics

Table: Relative Performance(adapted from Rogic 2001)

  • Complicating Factors for Comparison

  • Gene finders were trained on data that had genes homologous to test seq.

    • Percentage of overlap is varied

  • Some gene finders were able to tune their methods for particular data

  • Methods continue to be developed

# of seqs - number of seqs effectively analyzed by each program; in parentheses is the number of seqs where the absence of gene was predicted;

Sn -nucleotide level sensitivity; Sp - nucleotide level specificity;

CC - correlation coefficient;

ESn - exon level sensitivity; ESp - exon level specificity

  • Needed

  • Train and test methods on the same data.

  • Do cross-validation (10% leave-out)

Genscan compared to other gene finding programs
GenScan FN TN compared to other gene-finding programs

Why not perfect
Why not Perfect? FN TN

  • Gene Number

    usually approximately correct, but may not

  • Organism

    primarily for human/vertebrate seqs; maybe lower accuracy for non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs

  • Exon and Feature Type

    Internal exons: predicted more accurately than Initial or Terminal exons;

    Exons: predicted more accurately than Poly-A or Promoter signals

  • Biases in Test Set (Resulting statistics may not be representative)

Eukaryotic gene finding tools
Eukaryotic Gene Finding Tools FN TN

  • Genscan (ab initio), GenomeScan (hybrid)

  • (

  • Twinscan (hybrid)

  • (

  • FGENESH (ab initio)

  • (

  • GeneMark.hmm (ab initio)

  • (

  • MZEF (ab initio)

  • (

  • GrailEXP (hybrid)

  • (

  • GeneID (hybrid)

  • (

Outline for comparative genomics
Outline for Comparative Genomics FN TN

  • Overview

  • Why do comparative genomic analysis?

  • Assumptions/Limitations

  • Genome Analysis and Annotation Standard Procedure

  • General Purposes Databases for Comparative Genomics

  • Organism Specific Databases

  • Genome Analysis Environments

  • Genome Sequence Alignment Programs

  • Genomic Comparison Visualization Tools

What is comparative genomics
What is comparative genomics? FN TN

  • Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease

  • Understand the uniqueness between different species

What is compared
What is compared? FN TN

  • Gene location

  • Gene structure

    • Exon number

    • Exon lengths

    • Intron lengths

    • Sequence similarity

  • Gene characteristics

    • Splice sites

    • Codon usage

    • Conserved synteny

Figure 1   FN TN Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions.

In progress FN TN


Bacteroides fragilis


In progress

Bordetella bronchiseptica

Whooping cough

Bordetella parapertussis



Whooping cough

Bordetella pertussis

Lung infections in CF

In progress

Burkholderia cepacia


In progress


kholderia pseudomallei



Chlamidophila abortus



Clostridium botulinum


In progress

Clostridium difficile



Corynebacterium diphtheriae

Plant pathogen


Erwinia carotovora

Escherichia/Shigella spp.



In progress


In progress

Mycobacterium bovis


In progress

Mycobacterium marinum

Neisseria meningitidis

(serogroup C)

Bacterial meningitis

In progress


Typhoid fever

Salmonella typhi

Salmonella spp.



In progress


Staphylococcus aureus


Various (Nosocomial)

Staphylococcus aureus


Various (Community acquired)

In progress

Bacterial meningitis

In progress

Streptococcus pneumoniae

Various (ARF



In progress

Streptococcus pyogenes

Streptococcus suis


In progress

Streptococcus uberis


In progress





Streptomyces coelicolor

Whipple’s disease

In progress

Tropheryma whipelli

Vector (Bancroftian filariasis)

In progress

Wolbachia (Culex quinquefascia



River Blindness

Wolbachia (Onchocerca volvulus)

Food poisoning

In progress

Yersinia enterocolitica



Yersinia pestis

Sequenced prokaryotic genomes

Sequenced eukaryotic genomes FN TN

Farmer’s lung

In progress

Aspergillus fumigatus

Soil amoeba

In progress

Dictyostelium discoideum

Amoebic dysentry

In progress

Entamoeba histolitica


In progress

Leishmania major


In progress

Plasmodium falciparum


In progress

Schistosoma mansoni


Fission yeast

Schizosaccharomyces pombe


In progress

Theileria annulata


In progress

Toxoplasma gondii

Sleeping sickness

In progress

Trypanosoma brucei

Bioinformatics flow chart
Bioinformatics Flow Chart FN TN

1a. Sequencing

6. Gene & Protein expression data

1b. Analysis of nucleic acid seq.

7. Drug screening

2. Analysis of protein seq.

3. Molecular structure prediction

Ab initio drug design OR

Drug compound screening in

database of molecules

4. molecular interaction

8. Genetic variability

5. Metabolic and regulatory networks

Genome sequencing process
Genome Sequencing Process FN TN

Genomic DNA


Subclone and Sequence

Shotgun reads



Finishing read


Complete sequence

Genome sequencing review

Subcloning; generate small insert libraries FN TN

  • DNA features (repeats/similarities)

  • Gene finding

  • Peptide features

  • Initial role assignment

  • Others- regulatory regions

Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)

  • Problem lies in understanding what you have:

    • Gene prediction/gene finding

    • Annotation

  • Most genome will be sequenced and can be sequenced;

  • few problem are unsolvable.

Clone by clone vs whole genome shotgun

Release data to the public e.g. EMBL or GenBank

Genome Sequencing - Review















Annotation of eukaryotic genomes
Annotation of eukaryotic genomes FN TN

Genomic DNA

ab initio gene prediction


Unprocessed RNA

RNA processing

Mature mRNA



Comparative gene prediction


Nascent polypeptide


Active enzyme

Functional identification


Reactant A

Product B

Why do comparative genomics
Why do comparative genomics? FN TN

  • Many of the genes encoded in each genome from the genome projects had no known or predictable function

  • Analysis of protein set from completely sequenced genomes

  • Uniform evolutionary conservation of proteins in microbial genomes, 70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997)

  • Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms.

  • Cross species comparison to help reveal conserved coding regions

  • No prior knowledge of the sequence motif is necessary

  • Complement to algorithmic analysis

Assumptions limitation
Assumptions/Limitation FN TN

  • Homologous genes are relatively well preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions.

  • Cross species comparative genomics is influenced by the evolutionary distance of the compared species.

Genome analysis and annotation general procedure
Genome Analysis and Annotation: General Procedure FN TN

Basic procedure to determine the functional and structural annotation of uncharacterized proteins:

  • Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time.

  • Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.

  • Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity

  • Generate a secondary and tertiary (if possible) structure prediction

  • Annotation:

    • Transfer of function information from a well-characterized organism to a lesser studied organism and/or

    • Use phylogenetic patterns (or profiles) and/or

    • Use the phylogenetic pattern search tools (e.g. through COGs) to perform a systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).

Automated genome annotation
Automated Genome Annotation FN TN

  • GeneQuiz – limited number of searches/day

  • MAGPIE – outside users cannot submit own seq

  • PEDANT – commercial version allow for full capacity

  • SEALS – semi automated

General databases useful for comparative genomics
General Databases Useful for Comparative Genomics FN TN

  • Locus Link/RefSeq:

  • PEDANT -Protein Extraction Description ANalysis Tool

  • MIPS –

  • COGs - Cluster of Orthologous Groups (of proteins)

  • KEGG - Kyoto Encyclopedia of Genes and Genomes

  • MBGD - Microbial Genome Database

  • GOLD - Genome OnLine Database

  • TOGA –

Problems with existing sequence alignments algorithms for genomic analysis
Problems with existing sequence alignments algorithms for genomic analysis

  • Most algorithms were developed for comparing single protein sequences or DNA sequences containing a single gene

  • Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme.

  • Unfortunately, most of these programs cannot accurately handle long alignments.

  • Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memory-limited) or very time-consuming. Higher speed vs increased sensitivity.

Genome size comparative alignment tools
Genome-size comparative alignment tools genomic analysis

  • ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes

    • (Vincens et al. 1998)

  • BLAT –

    • (Kent xxx)

  • DIALIGN - DIagonal ALIGNment

    • (Morgenstern et al. 1998; Morgenstern 1999(

  • DBA - DNA Block Aligner

    • (Jareborg et al. 1999(

  • GLASS - GLobal Alignment SyStem

    • (Batzoglou et al. 2000)

  • LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS

  • MegaBlast

    • (Zhang 2000)

  • MUMmer - Maximal Unique Match (mer)

    • (Delcher et al. 1999)

  • PIPMaker - Percent Identity Plot MAKER

    • (Schwartz et al. 2000)

  • SSAHA – Sequence Search and Alignment by Hashing Algorithm


  • WABA - Wobble Aware Bulk Aligner

    • (Kent & Zahler 2000)

SSAHA genomic analysis

  • Sequence Search and Alignment by Hashing Algorithm

  • Software tool for very fast matching and alignment of DNA sequences.

  • Achieves fast search speed by converting sequence information into a hash table data structure which can then be searched very rapidly for matches


  • Run from the Unix command line

  • Need > 1GB RAM (needs a lot of memory)

  • SSAHA algorithm best for application requiring exact or “almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs

Genome analysis environment
Genome Analysis Environment genomic analysis

  • MAGPIE - Automated Genome Project Investigation Environment



Problems with visualizing genomes
Problems with Visualizing Genomes genomic analysis

  • Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes.

  • Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation.

  • Genome Alignment Visualization tools need to provide:

    • interpretable alignments,

    • gene prediction and database homologies from different sources

    • Interactive features: real time capabilities, zooming, searching specific regions of homologies

    • Represent breaks in synteny

    • Multiple alignments display

    • Displaying contigs of unfinished genomes with finished genomes

    • Handle various data formats

    • Software availabilty (no black box)

Genome comparison visualization tool
Genome Comparison Visualization Tool genomic analysis

  • ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool)


  • Alfresco (displays DBA alignments and ...)

    • (Jareborg & Durbin 2000)

  • PipMaker (displays BlastZ alignments)

    • (Schwartz et al. 2000)

  • Enteric/Menteric/Maj (displays Blastz alignments)

    • (Florea et al. 2000; McClelland et al. 2000)

  • Intronerator (displays WABA alignments and ...)

    • (Kent & Zahler 2000b)

  • VISTA (Visualization Tool for Alignment) (displays GLASS alignments)


  • SynPlot (displays DIALIGN and GLASS alignments)


Artemis comparison tool act
Artemis Comparison Tool (ACT) genomic analysis

  • ACT is a DNA sequence comparison viewer based on Artemis

  • Can read complete EMBL and GenBank entries or sequence in FASTA or raw format

  • Additional sequence feature can be in EMBL, GenBank, GFF format

  • ACT is free software and is distributed under the GNU Public License

  • Java based software

  • Latest release 2.0 better support Eukaryotic Genome Comparison

Salmonella typhi vs. E. coli genomic analysis– SPI-2



phage/IS genes



Blast hits


Neisseria meningitidis genomic analysis - A vs. B comparison - ACT

A case study comparison of mouse chromosome 16 and the human genome
A case genomic analysisStudy:Comparisonof mouse chromosome 16 and the human genome:

  • Mural et al., Science, 2002, 296:1661

  • Celera group

  • Synteny with human chr.’s 3,8,12,16,21,22 and rat chr.’s 10,11

    Q: Why more breakpoints in mouse-human than in mouse-rat?

    Q: Why more conserved genes in human than in rat?

  • This also can occur between chromosomes genomic analysis

  • The longer the divergence time between 2 species, the more recombination has occurred

  • 100 million years since human-mouse divergence

  • 40 million years since rat-mouse divergence

  • Whole-genome shotgun sequencing: genomic analysis

  • Genome is cut into small sections

  • Each section is hundreds or a few thousand bp of DNA

  • Each section is sequenced and put in a database

  • A computer aligns all sequences together (millions of them from each chromosome) to form contigs

  • Contigs are arranged (using markers, etc) to form scaffolds

  • Q: What are the advantages of this over the traditional method?

  • Q: What are the potential sources of error?

1 assembly of mmu16
1. Assembly of Mmu16 genomic analysis

  • Total size: 99Mbp

  • Not one contiguous sequence (contig)

  • 8,635 contigs on 20 “scaffolds”

  • Average scaffold size: 10Mbp

  • Number of gaps: 8615

  • Total size of gaps: ~6Mbp

  • Total coverage: ~93Mbp

2 identify genes in mmu16
2. Identify genes in Mmu16 genomic analysis

  • Scaffolds of >10kbp were examined (scaffolds larger than 1Mbp were chopped)

  • Regions with repeat motifs were ignored using RepeatMasker

  • Several gene prediction engines use (GenScan, Grail, Fgenes)

  • Amino acid sequences from open reading frames searched against nr protein db (NCBI)

  • Nucleotide searchers (using DNA from across scaffolds) performed against:

    • Celera’s gene clusters

    • Mmu, Rno, & Hsa EST db’s

    • NCBI’s RefSeq mRNA db

    • Celera’s dog genomic db

    • Public pufferfish genomic db

2 identify genes in mmu161
2. Identify genes in Mmu16 genomic analysis

  • 1055 genes with high & medium confidence were predicted

  • Other efforts have identified 1142 genes

  • After visual annotation inspection, psuedogenes and annotation errors removed, leaving 731 homologues genes

  • The genes found were mostly orthologues because they were reciprocal best matches by BLAST searches.

3 identify regions of conserved synteny between mmu16 and hsa
3. Identify regions of conserved synteny between Mmu16 and Hsa

  • Regions of conserved synteny predicted by sequence similarity and by protein comparisons

  • Synteny based on sequence comparisons:

  • Syntenic anchors were located - regions with high (80%) similarity over short distances (~200bp or more).

  • Average distance between anchors is 8kbp, but there are gaps as large as 707kbp in the mouse and 3.4Mbp in the human

3 identify regions of conserved synteny between mmu16 and hsa1
3. Identify regions of conserved synteny between Mmu16 and Hsa

  • 56% of anchors were in mouse genes - exons mostly

  • 44% in intergenic regions

  • Relatively density is independent of coding/noncoding - making the anchors an important marker of synteny (in addition to genes)

Human chr. Mmu len. Hsa len. No. anchors bad anch. (% incon.) Orthologues

16 10,461 12,329 1,429 21 (1.5) 87

8 1,284 1,491 121 1 (0.8) 6

12 363 306 31 3 (9.7) 3

22 2,081 2,273 418 8 (1.9) 30

3q27-29 13,557 16,461 1,714 18 (1.0) 107

3q11.1-13.3 41,660 46,493 5,485 63 (1.1) 165

21 22,327 28,421 2,127 27 (1.3) 111

Summary Hsa

  • Performance evaluation of gene-finding programs

  • Comparative genomics

  • Comparative genomics analysis example

Acknowledgement Hsa

  • Chuong Huynh (NIH)