Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data
Download
1 / 39

Serghei Mangul Department of Computer Science Georgia State University - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data. Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya , Marius Nicolae , Bassam Tork , Ion Mandoiu and Alex Zelikovsky. Outline. Introduction ML Model

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Serghei Mangul Department of Computer Science Georgia State University

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Serghei mangul department of computer science georgia state university

SergheiMangul

Department of Computer Science

Georgia State University

  • Joint work with Irina Astrovskaya, Marius Nicolae, BassamTork, Ion Mandoiu and Alex Zelikovsky


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Advances in high throughput sequencing hts
Advances in High-Throughput Sequencing (HTS) from HTS Data

Roche/454 FLX Titanium

400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000

Up to 6 billion PE reads/run

35-100bp read length

http://www.economist.com/node/16349358

SOLiD 4/5500

1.4-2.4 billion PE reads/run

35-50bp read length

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml model
ML from HTS DataModel

reads

R1

strings

S1

R2

S2

R3

S3

R4

  • Panel : bipartite graph

    • LEFT: genomic sequences (strings)

      • unknown frequencies

    • RIGHT: reads

      • observed frequencies

    • EDGES: probability of the read to be emitted by the string

      • weights are calculated based on the mapping of the reads to the strings

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml estimates of string frequencies
ML estimates of string frequencies from HTS Data

  • Probability that a read is sampled from string is proportional with its frequency f(j)

  • ML estimates for f(j)is given by n(j)/(n(1) + . . . + n(N))

    • n(j) - number of reads sampled from string j

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Em algorithm
EM algorithm from HTS Data

Initialization

E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct

M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all reads in the sample

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Ml model quality
ML Model Quality from HTS Data

  • How well the maximum likelihood model explain the reads

  • Measured by deviationbetween expected and observed read frequencies

    • expected read frequency:

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem virtual string em
VSEM : Virtual String EM from HTS Data

(Incomplete) Panel

+ Virtual String

with 0-weights

in virtual string

Update weights

of reads in

virtual string

ML estimates

of string

frequencies

EM

EM

Virtual String frequency change>ε?

Output string

frequencies

and weights

Compute

expected read

frequencies

YES

NO

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration
Example : 1 from HTS Datast iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

S3

R4

R4

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration1
Example : 1 from HTS Datast iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

14


Example 1 st iteration2
Example : 1 from HTS Datast iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example 1 st iteration3
Example : 1 from HTS Datast iteration

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Example last iteration
Example : last iteration from HTS Data

Incomplete Panel

Full Panel

reads

reads

R1

R1

strings

strings

S1

S1

R2

R2

S2

S2

R3

R3

VS

S3

R4

R4

VS

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem virtual string em1
VSEM : Virtual String EM from HTS Data

  • Decide if the panel is likely to be incomplete

  • Estimate total frequency of missing strings

  • Identify read spectrum emitted by missing strings

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem applications
VSEM : Applications from HTS Data

  • RNA-Seq

    • inferring isoform expressions from RNA-Seq

  • Viral Quasispecies Sequencing by 454 pyrosequencing

    • inferring viral quasispecies spectrum from pyrosequencing shotgun reads

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Rna seq
RNA- from HTS DataSeq

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

A

B

C

D

E

Isoform Expression (IE)

Gene Expression (GE)

Isoform Discovery (ID)

A

B

C

A

C

D

E

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Previous approach
Previous Approach from HTS Data

  • IsoEM [Nicolae et al. 2011] – novel expectation-maximization algorithm for inference of alternative splicing isoform frequencies from RNA-Seq data

    • Single and/or paired reads

    • Fragment length distribution

    • Strand information

    • Base quality scores

    • Insert sizes (library preparation)

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Simulation setup
Simulation Setup from HTS Data

  • Human genome UCSC/CCDS known isoforms

    • UCSC : 66803 isoforms, 19372 genes

    • CCDS : 20829 isoforms , 17373 genes

  • GNFAtlas2 gene expression levels

    • geometric expression of gene isoforms

  • Normally distributed fragment lengths

    • Mean 250, std. dev. 25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Exp1 reduced transcriptome data
EXP1 : Reduced from HTS Datatranscriptome data

  • Comparison between IsoEM and IsoVSEM on reduced transcriptome data

    • in every gene 25% of isoforms is missing

    • isoforms inside the gene - geometric distribution(p=0.5)

    • select genes with number of isoforms inside the gene is less or equal to 3.

    • removed isoforms with frequency 0.25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Exp2 ccds panel
EXP2 : CCDS panel from HTS Data

  • UCSC database represents the full panel

  • CCDS represents the incomplete panel

    • reads were generated from UCSC library of isoforms

    • only frequencies of known isoforms(CCDS) were estimated

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Error fraction curves
Error Fraction Curves from HTS Data

EXP1, 30M reads of length 25

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


454 pyrosequencing
454 from HTS DataPyrosequencing

  • Pyrosequencing =Sequencing by Synthesis.

  • GS FLX Titanium :

    • Divides the source genetic material into reads (300-800 bp)

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Previous approach1
Previous Approach from HTS Data

  • ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads

    • align reads

    • built a read graph :

      • V – reads

      • E – overlap between reads

      • each path – candidate sequence

    • filter based on ML frequencies

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vispa vsem

reads from HTS Data

removing duplicated & rare qsps

ViSpA-VSEM

assembled Qsps

Qsps Library

ViSPA

Weighted assembler

Stopping condition

reads, weights

VSEM

Virtual String EM

NO

YES

ViSpA

ML estimator

Viral Spectrum

+Statistics

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Simulation setup1
Simulation Setup from HTS Data

  • Real quasispecies sequences data from [von Hahn et al. 2006]

    • 44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus

    • populations sizes: 10, 20, 30, and 40 sequences

    • population distributions: geometric, skewed normal, uniform

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Experimental validation of vsem
Experimental Validation of VSEM from HTS Data

  • Detection of panel incompleteness

    • VSEM can detect >1% of missing strings

  • Improving quasispecies frequencies estimations

  • Detection of reads emitted by missing string

    • Correlation between predicted reads and reads emitted by missing strings >65%

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vsem improving frequencies estimates
VSEM improving frequencies estimates from HTS Data

r - Correlation between real and predicted frequencies; err - average prediction error

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Vispa vs vispa vsem
ViSpA from HTS DatavsViSpA-VSEM

  • 100K reads from 10 QSPS

  • average length 300

r - Correlation between real and predicted frequencies; err - average prediction error

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Outline from HTS Data

  • Introduction

  • ML Model

  • EM Algorithm

  • VSEM Algorithm

  • Experimental Results

    • RNA-Seq

    • 454

  • Conclusions and future work

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Conclusions
Conclusions from HTS Data

  • We propose VSEM, a novel modification of EM algorithm

    • improves the ML frequency estimations of multiple genomic sequences

    • identifies reads that belong to unassembled(missing) sequences

  • We applied VSEM to improve two tools:

    • IsoEM

    • ViSpA

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Future work
Future work from HTS Data

Assemble strings from the set of reads emitted by missing strings

Improve other metagenomics tools

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Acknowledgments
Acknowledgments from HTS Data

NSF awards IIS-0546457 IIS-0916948, and DBI-0543365.

NSF award IIS-0916401

Agriculture and Food Research Initiative Competitive Grant no. 2011-67016-30331 from the USDA National Institute of Food and Agriculture.

WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany


Serghei mangul department of computer science georgia state university

Thanks from HTS Data


ad
  • Login