slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph PowerPoint Presentation
Download Presentation
Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph

Loading in 2 Seconds...

play fullscreen
1 / 56

Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph - PowerPoint PPT Presentation


  • 165 Views
  • Uploaded on

Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor. Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu. Biological Setup. Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph' - paul2


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Carlo Colantuoni

&

Rafael Irizarry

April 19, 2006

ccolantu@jhsph.edu

slide2

Biological Setup

Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes.

The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes.

Tasks necessary for gene expression analysis:

Define what a gene is.

Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes.

Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes.

slide4

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

~30K genes

Sequence is a Necessity

DNA

Probe

protein coding

START

STOP

mRNA

AAAAA

5’ UTR

3’ UTR

Genomic

DNA

3.3 Gb

slide5

Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

From Genomic DNA to mRNA Transcripts

EXONS

INTRONS

~30K

>30K

Alternative splicing

Alternative start & stop sites in same RNA molecule

RNA editing & SNPs

Transcript coverage

Homology to other transcripts

Hybridization dynamics

3’ bias

slide6

Unsurpassed as source of expressed sequence

Sequence Quality!

Redundancy!

Completeness?

Chaos?!?

slide8

Transcript-Based

Gene-Centered Information

slide9

Design of Gene Expression Probes

Content: UniGene, Incyte, Celera Expressed vs. Genomic

Source: cDNA libraries, clone collections, oligos

Cross-referencing of array probes (across platforms):

Sequence <> GenBank <> UniGene <> HomoloGene

Possible mis-referencing:

Genomic GenBank Acc.#’s

Referenced ID has more NT’s than probe

Old DB builds

DB or table errors – copying and pasting 30K rows in excel …

Using RefSeq’s can help.

slide16

Functional Annotation of Lists of Genes

KEGG

PFAM

SWISS-PROT

GO

DRAGON

DAVID

BioConductor

slide24

One of the largest challenges in analyzing genomic data is associating the experimental data with the available metadata, e.g. sequence, gene annotation, chromosomal maps, literature.

  • The annotate and AnnBuilder packages provides some tools for carrying this out.
  • Using AnnBuilder. It is possible to build associations with specific gene lists, eg. hgu95apackage for Affymetrix HGU95A GeneChips.
  • The annotate package maps to GenBank accession number, LocusLink LocusID, gene symbol, gene name, UniGene cluster, chromosome, cytoband, physical distance (bp), orientation, Gene Ontology Consortium (GO), PubMed PMID.
slide26

Functional Gene/Protein Networks

DIP

BIND

MINT

HPRD

PubGene

Predicted Protein Interactions

slide34

Predicted Human Protein Interactions

Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions.

Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins.

>70K Hs interactions predicted

>6K Hs genes

slide36

Thanks to …

Rafael Irizarry

Scott Zeger

Jonathan Pevsner

Carlo Colantuoni

Clinical Brain Disorders Branch, NIMH, NIH

Dept. Biostatistics, JHSPH

ccolantu@jhsph.edu

colantuc@intra.nimh.nih.gov

slide38

NCBI Web Links

http://www.ncbi.nlm.nih.gov

http://www.ncbi.nlm.nih.gov/Entrez/

http://www.ncbi.nih.gov/Genbank/

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide

http://www.ncbi.nlm.nih.gov/dbEST/

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

http://www.ncbi.nlm.nih.gov/LocusLink/

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

http://www.ncbi.nlm.nih.gov/PubMed/

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd

http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp

http://www.ncbi.nlm.nih.gov/SNP/

http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html

http://www.ncbi.nlm.nih.gov/geo/

http://www.ncbi.nlm.nih.gov/RefSeq/

FTP:

ftp://ftp.ncbi.nlm.nih.gov/

ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene

ftp://ftp.ncbi.nih.gov/pub/HomoloGene/

slide39

PROTEIN:

http://us.expasy.org/

ftp://us.expasy.org/

http://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam/ftp.shtml

http://smart.embl-heidelberg.de/

http://www.ebi.ac.uk/interpro/

http://us.expasy.org/prosite/

ftp://us.expasy.org/databases/prosite/

NUCLEOTIDE:

http://genome.ucsc.edu/

http://www.embl-heidelberg.de/

http://www.ensembl.org/

http://www.ebi.ac.uk/

http://www.gdb.org/

http://bioinfo.weizmann.ac.il/cards/index.html

http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl

PATHWAYS and NETWORKS:

http://www.genome.ad.jp/kegg/

ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/)

http://dip.doe-mbi.ucla.edu

http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://www.blueprint.org/bind/

http://www.blueprint.org/bind/bind_downloads.html

http://160.80.34.4/mint/index.php

http://160.80.34.4/mint/release/main.php

http://www.hprd.org/

http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS

http://www.pubgene.org/ (also .com)

More Web Links

http://www.bioconductor.org/

http://apps1.niaid.nih.gov/david/

http://www.geneontology.org/

http://discover.nci.nih.gov/gominer/index.jsp

http://pubmatrix.grc.nia.nih.gov/

http://pevsnerlab.kennedykrieger.org/dragon.htm

slide40

SAVAGE:

Detection of More Subtle Functionally Related Groups of Gene Expression Changes

slide41

KEGG

EXP#1

PFAM

Swiss-Prot

~40K annotations

Differential Expression of Functional

Gene Groups within One Experiment

DRAGON

SAVAGE

30K

10K

~3K

slide42

BioDB

EXP#1

EXP#4

EXP#3

EXP#2

Differential Expression of a Single Functional

Gene Group Across Multiple Experiments

DRAGON

SAVAGE

slide43

3

X

2

1

EXP#1

EXP#1

5

4

4

3

EXP#2

EXP#2

5

X

1

2

Similar Differential Expression Patterns Across Multiple Experiments

p value

<0.1

0.0

CN

CN

CN

ALL

CN

CN

The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling.

slide44

PING:

Detection of Differential Expression in Functional Networks of Proteins

slide46

Large Protein Interaction Network

Network Regulated in Sample #1

slide47

Large Protein Interaction Network

Network Regulated in Sample #2

Network Regulated in Sample #1

slide48

Large Protein Interaction Network

Network Regulated in Sample #3

Network Regulated in Sample #1

Network Regulated in Sample #2

slide49

Large Protein Interaction Network

Network

of Interest

PING

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

slide51

Genomic DNA Content

  • Interspersed repeats (~1/2 Hs. genome)
  • (Processed) pseudogenes
  • Simple sequence repeats
  • Segmental duplications (~5% Hs. genome)
  • Blocks of tandem repeats (can be very large)
  • Genes: Promoters - Exons – Introns <3%

prokaryotes eukaryotes

defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediate

identifying genes within genomic DNA

protein-coding genes (mRNA)

functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA

slide52

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

Protein

protein coding

START

STOP

mRNA

AAAAA

5’ UTR

3’ UTR

Genomic

DNA

3.3 Gb

slide53

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

~30K genes

Sequence is a Necessity

Protein

protein coding

START

STOP

mRNA

AAAAA

5’ UTR

3’ UTR

Genomic

DNA

3.3 Gb

slide54

How is a gene defined

in “wet” biology and in silico?

Seq. from mRNA sample

Array probe design:

Source – cDNA libraries, oligos, clone collections

Content – UniGene, Celera, Incyte

Transcript coverage

Homology to other transcripts

Hybridization dynamics – hyper-multiplex hyb rxn

Empirical validation

3’ bias

Cross-referencing of array probes:

GenBank <> UniGene <> HomoloGene

Possible mis-referencing:

Genomic GenBank Acc.#’s

Referenced ID has more NT’s than probe

Old DB builds

DB or table errors

Alt. splicing - known and not

Alt. start / stop site in same RNA molecule

Less important: RNA editing, SNPs

Seq. on array

slide55

Finding genes in eukaryotic DNA

ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene!

There are several kinds of exons:

-- non-coding

-- initial coding exons

-- internal exons

-- terminal exons

-- some single-exon genes are intronless

slide56

What We Are Going To Cover

Cells, Genes, Transcripts –> Genomics Experiments

Sequence Knowledge Behind Genomics Experiments

Annotation of Genes in Genomics Experiments