Decoding ENCODE

Decoding ENCODE Jim Kent University of California Santa Cruz

ENCODE Timeline • ENCyclopedia of Dna Elements. • Attempt to catalog as many functional elements in human genome as possible using current technologies. • Pilot project - finished 2007, covered 1% of genome. • Production project - ramping up now. Genome-wide. Should have major amounts of data in 6 months.

ENCODE Experiments • Chromatin state: • DNA Hypersensitivity assays • Chromatin Immunoprecipitation (ChIP) • Histones in various methylation states • Sequence-specific transcription factors • DNA methylation • Chromatin conformation capture (5C) • Functional RNA discovery • Nuclear & cytoplasmic, short & long • RNA Immunoprecipitation • Comparative Genomics • Human curated gene annotation

Role of UCSC • Display data in context of what else is known on the UCSC Genome Browser and in other tools. • Facilitate analysis of the data with both Web-based and command line tools.

A Peek at the Pilot Project

ENCODE pilot data at genome.ucsc.edu

Correlation at gene starts in enr221

Transcription at enm221

ENCODE Chromatin Immunoprecipitation

Scientific Highlights of Pilot • Transcription: • Lots of transcription outside of known genes. • Outside of known genes transcribed areas not very well conserved across species. • Lots of rare splice variants, also poorly conserved. • DNA/Protein Interactions • Good correlation between histone markers, gene starts, and _active_ transcription. • Lots of “occupied transcription factor binding sites” not conserved, near promoters etc. • Biological noise? • Main controversy was whether to explain much of the data as “biological noise” that was tolerated but not necessary for function.

From Pilot to Production Phase

ENCODE Production Phase • Moving from microarray based assays to assays based on next-generation sequencing. (ChIP-chip to ChIP-seq) • Genome-wide rather than regional. • Broader set of cell lines used more consistently between labs. • Broader set of antibodies. • Some new technology development continues.

ENCODE Cell Lines • Tier 1 - used in ALL experiments • GM12878 (lymphoblastoid cell line) • K562 (chronic myeloid leukemia) • Tier 2 - used in most experiments • HepG2 (hepatocellular carcinoma) • Hela-S3 (cervical carcinoma) • HUVEC (umbilical vein endothelial cells) • Keratinocyte (normal epidermal cells) • Likely will do an embryonic stem cell too. • Tier 3 - used in one or two experiments • Many of these for assays such as DNAse hypersensitivity, RNA measurements where don’t have to do separate experiment for each antibody.

Simple Model of Eukaryotic Transcription Regulation • Initially chromatin “opened” to allow transcription factors to access DNA • Multiple transcription factors bind to DNA in combination. • Most factors have such small DNA binding sites that by themselves they are not specific or the binding even stable • The right combination of factors in open chromatin leads to active transcription starting at the initiation complex. • With the ENCODE experiments we can directly test most aspects of this model.

Chromatin Experiments • In general applied across a large number of cell lines. • DNAseI hypersensitivity • Formaldehyde Assisted Isolation of Regulatory Elements • Methylation of CpG Islands • ChIP-seq of relevant factors • H3K4me1,2,3 H3K9me3 H4K20me3, H3K27me3, H3K36me3, RPol-II, etc.

Transcription Factor ChIP • Many antibodies in modest number of cell lines. • Limited by good antibodies, hope for 100 or more. • Current good antibodies include • E2F1, E2F4, E2F6, KAP1, L3MBTL2, STAT1, CtBP1, CtBP2, SETDB1, ZNF180, ZNF239, ZNF263, ZNF266, ZNF317, ZNF342 • Part of project pipeline for raising and testing antibodies.

RNA measurement • RNA-seq of poly-A selected RNA to measure mRNA levels in many cell lines. • Sequencing of G-cap selected tags (CAGE) • Sequencing 5’ and 3’ ends (paired end tags) • Measurement of RNAs of several types in several cell compartments of a few cell lines. • Long/short, polyA/nonPolyA, associated with proteins/not associated with proteins • Nucleus, cytosol, polysomes, chromatin, nucleolus

New Pilot Projects Starting to Sprout

New Pilot Projects • Immunoprecipitation of RNA binding proteins/RNA sequencing. • Mapping silencers and enhancers with transient transfection assays • Computational identification of active promoters • Deep comparative sequencing in targeted regions and conservation analysis. • Chromatin Conformation Capture Carbon Copy (5C) to capture long range regulatory elements and their targets.

ENCODE Timeline • Grants funded for 4 years starting Sept 2007. • First production data just now starting to roll into UCSC, not quite ready for public display. • Data should accumulate quickly over next few years.

Data Release Policy • Once have reproducible data (where at least 2 of 3 replicates agree) should be released to public within a month. • Data is still considered pre-publication! • Ok to publish a paper using data on a few genes. • Please wait for consortium papers before papers doing full genome analysis. • Anyone can join ENCODE consortium analysis group to help us write the papers. • We just have ~1 year after data release to write papers, after that fair game to publish full genome analysis. • If in doubt please contact consortium via UCSC.

Web Works for Mice and Men

Mouse ES Cell Chromatin IP • Brad Bernstein lab ChIP-seq based experiment on methylated histones now on UCSC Genome Browser. • Shows some of the user interfaces that will be used for the ENCODE data

List of mouse chromatin subtracks….

Signal densities of entire mouse chromatin data set.

The unending quest for genes

Gencode Project • Project to define structure (exons and introns) for all common splice varients of all genes. • Human curators merge many lines of evidence including • Computational gene predictions • RNA/DNA alignments • Paired end tags • Cross-species alignments • Possibly chromatin state data • PI is Tim Hubbard • Much of the work done by Havana group

Data Mining with Table Browser

Table Browser • Complete access to UCSC Database with results in tab-delimited format • Method for creating “custom tracks” by combining and filtering existing tracks. • Sample query - getting a table of Ensembl gene coordinates and associated Superfamily annotations.

Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

Table Browser Filters • Getting list of Ensembl genes that have SH3 domains.

Table Browser Intersection • Getting list of Ensembl genes that don’t intersect UCSC Known Genes

Custom Track Output • Useful for visualizing results of queries in genome browser • The way to produce more complex queries. • Here we look at how well genes that are Ensembl but not UCSC are conserved across species.

Decoding ENCODE

Decoding ENCODE

Presentation Transcript

Decoding

ENCODE 2012

ENCODE enhancers

An Introduction to ENCODE

Decode ENCODE

Decoding/Word Attack Use Decoding Strategies

Dependencies encode relational structure

ENCODE: understanding our genome

Decoding

ENCODE Pseudogenes and Transcription

Decoding B1

Decoding

Encode variation analysis

ENCODE pseudogene updates

ENCODE, BETHESDA

ENCODE Pseudogene Call Summary

Decoding “Hydroxatone”

Decoding GMAT

Decoding Barcodes

ENCODE updates

ENCODE Chromatin Groups

ENCODE pseudogene updates