Chip sequencing bmi ibgp 730
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

ChIP Sequencing BMI/IBGP 730 PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on
  • Presentation posted in: General

ChIP Sequencing BMI/IBGP 730. Victor Jin, Ph.D. (Slides from Dr. H. Gulcin Ozer) Department of Biomedical Informatics. What is ChIP-Sequencing?. ChIP-Sequencing is a new frontier technology to analyze protein interactions with DNA. ChIP-Seq

Download Presentation

ChIP Sequencing BMI/IBGP 730

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Chip sequencing bmi ibgp 730

ChIP SequencingBMI/IBGP 730

Victor Jin, Ph.D.

(Slides from Dr. H. Gulcin Ozer)

Department of Biomedical Informatics


What is chip sequencing

What is ChIP-Sequencing?

  • ChIP-Sequencing is a new frontier technology to analyze protein interactions with DNA.

  • ChIP-Seq

    • Combination of chromatin immunoprecipitation (ChIP) with ultra high-throughput massively parallel sequencing

    • Allow mapping of protein–DNA interactions in-vivo on a genome scale


Chip sequencing bmi ibgp 730

Workflow of

ChIP-Seq

Mardis, E.R. Nat. Methods4, 613-614 (2007)


Chip sequencing bmi ibgp 730

Workflow of

ChIP-Seq


Johnson et al 2007

Johnson et al, 2007

  • ChIP-Seq technology is used to understand in vivo binding of the neuron-restrictive silencer factor (NRSF)

  • Results are compared to known binding sites

    • ChIP-Seq signals are strongly agree with the existing knowledge

  • Sharp resolution of binding position

  • New noncanonical NRSF binding motifs are identified


Robertson et al 2007

Robertson et al, 2007

  • ChIP-Seq technology used to study genome-wide profiles of STAT1 DNA association

  • STAT1 targets in interferon-γ-stimulated and unstimulated human HeLA S3 cells are compared

  • The performance of ChIP-Seq is compared to the alternative protein-DNA interaction methods of ChIP-PCR and ChIP-chip.

  • 41,582 and 11,004 putative STAT-1 binding regions are identified in stimulated and unstimulated cells respectively.


Why chip sequencing

Why ChIP-Sequencing?

  • Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain.

  • Lower cost

  • Less work in ChIP-Seq

  • Higher accuracy

  • Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.


Chip sequencing bmi ibgp 730

Bioinformatics


Sequencers

Sequencers

  • Solexa (Illumina)

    • 1 GB of sequences in a single run

    • 35 bases in length

  • 454 Life Sciences (Roche Diagnostics)

    • 25-50 MB of sequences in a single run

    • Up to 500 bases in length

  • SOLiD (Applied Biosystems)

    • 6 GB of sequences in a single run

    • 35 bases in length


Illumina genome analysis system

8 lanes

100 tiles per lane

Illumina Genome Analysis System


Sequencing

Sequencing


Sequencer output

Sequence Files

Quality Scores

Sequencer Output


Sequence files

Sequence Files

  • ~10 million sequences per lane

  • ~500 MB files


Quality score files

Quality Score Files

  • Quality scores describe the confidence of bases in each read

  • Solexa pipeline assigns a quality score to the four possible nucleotides for each sequenced base

  • 9 million sequences (500MB file)  ~6.5GB quality score file


Bioinformatics challenges

Bioinformatics Challenges

  • Rapid mapping of these short sequence reads to the reference genome

  • Visualize mapping results

    • Thousand of enriched regions

  • Peak analysis

    • Peak detection

    • Finding exact binding sites

  • Compare results of different experiments

    • Normalization

    • Statistical tests


Mapping of short oligonucleotides to the reference genome

Mapping of Short Oligonucleotides to the Reference Genome

  • Mapping Methods

    • Need to allow mismatches and gaps

      • SNP locations

      • Sequencing errors

      • Reading errors

    • Indexing and hashing

      • genome

      • oligonucleotide reads

  • Use of quality scores

  • Use of SNP knowledge

  • Performance

    • Partitioning the genome or sequence reads


Mapping methods indexing the genome

Mapping Methods: Indexing the Genome

  • Fast sequence similarity search algorithms (like BLAST)

    • Not specifically designed for mapping millions of query sequences

    • Take very long time

      • e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST)

    • Indexing the genome is memory expensive


Soap li et al 2008

SOAP (Li et al, 2008)

  • Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding

  • Load reference genome into memory

    • For human genome, 14GB RAM required for storing reference sequences and index tables

  • 300(gapped) to 1200(ungapped) times faster than BLAST


Soap li et al 20081

SOAP (Li et al, 2008)

  • 2 mismatches or 1-3bp continuous gap

  • Errors accumulate during the sequencing process

    • Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome)

    • Iteratively trim several basepairs at the 3’-end and redo the alignment

    • Improve sensitivity


Mapping methods indexing the oligonucleotide reads

Mapping Methods: Indexing the Oligonucleotide Reads

  • ELAND (Cox, unpublished)

    • “Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.)

  • SeqMap (Jiang, 2008)

    • “Mapping massive amount of oligonucleotides to the genome”

  • RMAP (Smith, 2008)

    • “Using quality scores and longer reads improves accuracy of Solexa read mapping”

  • MAQ (Li, 2008)

    • “Mapping short DNA sequencing reads and calling variants using mapping quality scores”


Mapping algorithm 2 mismatches

GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds

GATGCATTG

CTATGCCTC

CCAGTCCGC

AACTTCACG

.........

Genome

Exact match

Indexed table of exactly matching seeds

Approximate search around the exactly matching seeds

Mapping Algorithm (2 mismatches)

GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG


Mapping algorithm 2 mismatches1

Mapping Algorithm (2 mismatches)

  • Partition reads into 4 seeds {A,B,C,D}

    • At least 2 seed must map with no mismatches

  • Scan genome to identify locations where the seeds match exactly

    • 6 possible combinations of the seeds to search

      • {AB, CD, AC, BD, AD, BC}

    • 6 scans to find all candidates

  • Do approximate matching around the exactly-matching seeds.

    • Determine all targets for the reads

    • Ins/del can be incorporated

  • The reads are indexed and hashed before scanning genome

  • Bit operations are used to accelerate mapping

    • Each nt encoded into 2-bits


Eland cox unpublished

ELAND (Cox, unpublished)

  • Commercial sequence mapping program comes with Solexa machine

  • Allow at most 2 mismatches

  • Map sequences up to 32 nt in length

  • All sequences have to be same length


Rmap smith et al 2008

RMAP (Smith et al, 2008)

  • Improve mapping accuracy

    • Possible sequencing errors at 3’-ends of longer reads

    • Base-call quality scores

  • Use of base-call quality scores

    • Quality cutoff

      • High quality positions are checked for mismatces

      • Low quality positions always induce a match

    • Quality control step eliminates reads with too many low quality positions

  • Allow any number of mismatches


Chip sequencing bmi ibgp 730

Mapped to a unique location

Mapped to multiple locations

No mapping

Low quality

7.2 M

1.8 M

2.5 M

0.5 M

3 M

Quality

filter

12 M

Map to reference

genome

Map to reference

genome


Bioinformatics challenges1

Bioinformatics Challenges

  • Rapid mapping of these short sequence reads to the reference genome

  • Visualize mapping results

    • Thousand of enriched regions

  • Peak analysis

    • Peak detection

    • Finding exact binding sites

  • Compare results of different experiments

    • Normalization

    • Statistical tests


Visualization

Visualization

  • BED files are build to summarize mapping results

  • BED files can be easily visualized in Genome Browser

    http://genome.ucsc.edu


Visualization genome browser

Visualization: Genome Browser

Robertson, G. et al. Nat. Methods 4, 651-657 (2007)


Visualization custom

Visualization: Custom

300 kb region from mouse ES cells

Mikkelsen,T.S. et al. Nature448, 553-562 (2007)


Visualization1

Visualization

Huang, 2008 (unpublished)


Chip sequencing bmi ibgp 730

Huang, 2008 (unpublished)


Bioinformatics challenges2

Bioinformatics Challenges

  • Rapid mapping of these short sequence reads to the reference genome

  • Visualize mapping results

    • Thousand of enriched regions

  • Peak analysis

    • Peak detection

    • Finding exact binding sites

  • Compare results of different experiments

    • Normalization

    • Statistical tests


Peak analysis

Peak Analysis

Peak Detection

  • ChIP-Peak Analysis Module (Swiss Institute of Bioinformatics)

  • ChIPSeq Peak Finder (Wold Lab, Caltech)


Peak analysis1

Peak Analysis

Finding Exact Binding Site

  • Determining the exact binding sites from short reads generated from ChIP-Seq experiments

    • SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008)

    • MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)


Bioinformatics challenges3

Bioinformatics Challenges

  • Rapid mapping of these short sequence reads to the reference genome

  • Visualize mapping results

    • Thousand of enriched regions

  • Peak analysis

    • Peak detection

    • Finding exact binding sites

  • Compare results of different experiments

    • Normalization

    • Statistical tests


Compare samples

Compare Samples

Huang, 2008 (unpublished)


Compare samples1

Compare Samples

  • Fold change

  • HPeak: An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data

    • Xu et al, 2008

  • Advanced statistics


Chip sequencing bmi ibgp 730

QUESTIONS?


  • Login