Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Serghei Mangul Department of Computer Science Georgia State University PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data. Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya , Marius Nicolae , Bassam Tork , Ion Mandoiu and Alex Zelikovsky. Outline. Introduction ML Model

Download Presentation

Serghei Mangul Department of Computer Science Georgia State University

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Serghei mangul department of computer science georgia state university

  • Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data

SergheiMangul

Department of Computer Science

Georgia State University

  • Joint work with Irina Astrovskaya, Marius Nicolae, BassamTork, Ion Mandoiu and Alex Zelikovsky


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Advances in high throughput sequencing hts

Advances in High-Throughput Sequencing (HTS)

Roche/454 FLX Titanium

400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000

Up to 6 billion PE reads/run

35-100bp read length

http://www.economist.com/node/16349358

SOLiD 4/5500

1.4-2.4 billion PE reads/run

35-50bp read length

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml model

MLModel

reads

R1

strings

S1

R2

S2

R3

S3

R4

  • Panel : bipartite graph

    • LEFT: genomic sequences (strings)

      • unknown frequencies

    • RIGHT: reads

      • observed frequencies

    • EDGES: probability of the read to be emitted by the string

      • weights are calculated based on the mapping of the reads to the strings

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml estimates of string frequencies

ML estimates of string frequencies

  • Probability that a read is sampled from string is proportional with its frequency f(j)

  • ML estimates for f(j)is given by n(j)/(n(1) + . . . + n(N))

    • n(j) - number of reads sampled from string j

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Em algorithm

EM algorithm

Initialization

E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct

M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all reads in the sample

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml model quality

ML Model Quality

  • How well the maximum likelihood model explain the reads

  • Measured by deviationbetween expected and observed read frequencies

    • expected read frequency:

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem virtual string em

VSEM : Virtual String EM

(Incomplete) Panel

+ Virtual String

with 0-weights

in virtual string

Update weights

of reads in

virtual string

ML estimates

of string

frequencies

EM

EM

Virtual String frequency change>ε?

Output string

frequencies

and weights

Compute

expected read

frequencies

YES

NO

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration

Example : 1st iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

S3

R4

R4

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration1

Example : 1st iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

14


Example 1 st iteration2

Example : 1st iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration3

Example : 1st iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example last iteration

Example : last iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem virtual string em1

VSEM : Virtual String EM

  • Decide if the panel is likely to be incomplete

  • Estimate total frequency of missing strings

  • Identify read spectrum emitted by missing strings

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem applications

VSEM : Applications

  • RNA-Seq

    • inferring isoform expressions from RNA-Seq

  • Viral Quasispecies Sequencing by 454 pyrosequencing

    • inferring viral quasispecies spectrum from pyrosequencing shotgun reads

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Rna seq

RNA-Seq

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

A

B

C

D

E

Isoform Expression (IE)

Gene Expression (GE)

Isoform Discovery (ID)

A

B

C

A

C

D

E

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Previous approach

Previous Approach

  • IsoEM [Nicolae et al. 2011] – novel expectation-maximization algorithm for inference of alternative splicing isoform frequencies from RNA-Seq data

    • Single and/or paired reads

    • Fragment length distribution

    • Strand information

    • Base quality scores

    • Insert sizes (library preparation)

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Simulation setup

Simulation Setup

  • Human genome UCSC/CCDS known isoforms

    • UCSC : 66803 isoforms, 19372 genes

    • CCDS : 20829 isoforms , 17373 genes

  • GNFAtlas2 gene expression levels

    • geometric expression of gene isoforms

  • Normally distributed fragment lengths

    • Mean 250, std. dev. 25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Exp1 reduced transcriptome data

EXP1 : Reduced transcriptome data

  • Comparison between IsoEM and IsoVSEM on reduced transcriptome data

    • in every gene 25% of isoforms is missing

    • isoforms inside the gene - geometric distribution(p=0.5)

    • select genes with number of isoforms inside the gene is less or equal to 3.

    • removed isoforms with frequency 0.25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Exp2 ccds panel

EXP2 : CCDS panel

  • UCSC database represents the full panel

  • CCDS represents the incomplete panel

    • reads were generated from UCSC library of isoforms

    • only frequencies of known isoforms(CCDS) were estimated

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Error fraction curves

Error Fraction Curves

EXP1, 30M reads of length 25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


454 pyrosequencing

454 Pyrosequencing

  • Pyrosequencing =Sequencing by Synthesis.

  • GS FLX Titanium :

    • Divides the source genetic material into reads (300-800 bp)

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Previous approach1

Previous Approach

  • ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads

    • align reads

    • built a read graph :

      • V – reads

      • E – overlap between reads

      • each path – candidate sequence

    • filter based on ML frequencies

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vispa vsem

reads

removing duplicated & rare qsps

ViSpA-VSEM

assembled Qsps

Qsps Library

ViSPA

Weighted assembler

Stopping condition

reads, weights

VSEM

Virtual String EM

NO

YES

ViSpA

ML estimator

Viral Spectrum

+Statistics

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Simulation setup1

Simulation Setup

  • Real quasispecies sequences data from [von Hahn et al. 2006]

    • 44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus

    • populations sizes: 10, 20, 30, and 40 sequences

    • population distributions: geometric, skewed normal, uniform

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Experimental validation of vsem

Experimental Validation of VSEM

  • Detection of panel incompleteness

    • VSEM can detect >1% of missing strings

  • Improving quasispecies frequencies estimations

  • Detection of reads emitted by missing string

    • Correlation between predicted reads and reads emitted by missing strings >65%

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem improving frequencies estimates

VSEM improving frequencies estimates

r - Correlation between real and predicted frequencies; err - average prediction error

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vispa vs vispa vsem

ViSpAvsViSpA-VSEM

  • 100K reads from 10 QSPS

  • average length 300

r - Correlation between real and predicted frequencies; err - average prediction error

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Conclusions

Conclusions

  • We propose VSEM, a novel modification of EM algorithm

    • improves the ML frequency estimations of multiple genomic sequences

    • identifies reads that belong to unassembled(missing) sequences

  • We applied VSEM to improve two tools:

    • IsoEM

    • ViSpA

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Future work

Future work

Assemble strings from the set of reads emitted by missing strings

Improve other metagenomics tools

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Acknowledgments

Acknowledgments

NSF awards IIS-0546457 IIS-0916948, and DBI-0543365.

NSF award IIS-0916401

Agriculture and Food Research Initiative Competitive Grant no. 2011-67016-30331 from the USDA National Institute of Food and Agriculture.

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Thanks


  • Login