slide1
Download
Skip this Video
Download Presentation
The Integrated Microbial Genome (IMG) systems

Loading in 2 Seconds...

play fullscreen
1 / 32

The Integrated Microbial Genome (IMG) systems - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

The Integrated Microbial Genome (IMG) systems. Nikos Kyrpides. Reddy. Bahador. Iain. Denis. Amrita. Billis. Peter. Marcel. OMICS GROUP. STANDARDS GROUP. ANNOTATION GROUP. Natalia. Dino. Kostas. Ioanna. Biological Data Management. Victor Markowitz. Yuri Grechkin. Ken Chu.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Integrated Microbial Genome (IMG) systems' - lukas


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2

Reddy

  • Bahador
  • Iain
  • Denis
  • Amrita
  • Billis
  • Peter
  • Marcel
  • OMICS GROUP
  • STANDARDS GROUP
  • ANNOTATION GROUP
  • Natalia
  • Dino
  • Kostas
  • Ioanna

Biological Data Management

Victor

Markowitz

Yuri

Grechkin

Ken

Chu

Ernest

Szeto

Krishna

Palaniappan

Amy

Chen

Biju Jacob

slide3

Science driven

data generation and analysis

ANALYSIS

  • User
  • Facility
slide4

Science driven

data generation and analysis

ANALYSIS

  • User
  • Facility
slide5

Data analysis

Comparative Analysis

Data Integration

slide6

What is the Matrix?

Data management system for comparative analysis of biological data

Genomes

Functions

Genes

IMG

Clusters

Metadata

I

SNPs

M

Proteomics

G

Regulons

Transcriptomes

slide7

Become the HOME of

Microbial Genomes and Metagenomes

  • support comparative genome analysis
  • support community functional annotation

provide a user friendly interface

IMG’s Mission

integrated microbial genomes img it s easier to analyze 1000 genomes than a single one
Integrated Microbial Genomes (IMG)[It’s easier to analyze 1000 genomes than a single one]

Bacteria: 2780

Archaea: 107

Eukarya: 121

Plasmids: 1186

Viruses: 2697

http://img.jgi.doe.gov/

  • What is IMG:
  • IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context.
  • Mission:
  • To become the Home of Microbial Genome and Metagenome Analysis
  • Background:
  •  Launched on March 2005
  •  3 Releases/Year, 20 releases so far
  • >5,000 unique visitors per month
  •  >350 citations
  • Current Status:
  • 6891 Genomes
  • 11.6 Million Genes
  • http://img.jgi.doe.gov/
  • http://img.jgi.doe.gov/
  • USERS CAN
  • Search data
  • Browse data
  • Compare data
  • Export data
why more data are needed faster and more accurate function prediction
Why more data are neededfaster and more accurate function prediction

Fructokinase family

Ribokinase family

2-dehydro-3-deoxyglucokinase family

metagenomic analysis
Metagenomic Analysis

Binning

?

Soil

Sargasso Sea

Termite Hindgut

Human Gut

Acid Mine Drainage

Reference Genomes

Species complexity

1 10 1001000 1000s 10000

The road to success in Metagenomics is through Microbial Genomics

Source: Susannah Tringe, JGI

availability of reference genomes
Availability of Reference Genomes

?

Soil

Human gut

Termite Gut

Marine

Acid Mine Drainage

Reference Genomes

100%60% 50% 40% 20% 1%

data model abstraction example img operations

Genes present inG1

and absent fromG2, G3, G4 and G5

Gene occurrence profile across genomes

Gene occurrence profiles across pathways

g1

+ + + + +

g2

+ + - + +

g3

+ - - - -

G1 G2G3 G4 G5

Pathways shared by genomes

Data Model Abstraction Example: IMG Operations

Genes

Genomes

Functions/ Pathways

img data integration
IMG Data Integration

Genes

  • RNAs, Proteins
  • Sequence Clusters
  • Positional clusters
  • Regulatory clusters
  • Fusions
  • Operons
  • Expression
  • COG
  • GO
  • Pfam
  • TIGRfam
  • InterPro
  • KEGG
  • BioCyc
  • SEED
  • Protein product
  • MyIMG
  • IMG Terms
  • IMG Pathways
  • IMG Networks

Genomes

Functions

  • Groupings
  • Phylogenetic
  • Phenotypic
  • Ecotypic
  • Disease
  • Geographical
  • Isolation

11.6M

6891

1.1M

img toolkit
IMG Toolkit

Gene

Synteny

Functional

Categories

Projects

Map

Function

Profile

Abundance

Profiles

Chromosome

Map

Genome

Clustering

IMG Pathway

Profile

Metadata

Search

Compare

Annotations

Phylogenetic

Profile

VISTA

KEGG

Maps

Phylogenetic

Distribution

Chromosomal

Map

Recruitment

Plot

Fragment

Recruitment

Artemis

WRITE PAPER

slide15

USERS CAN

  • Search data
  • Browse data
  • Compare data
  • Export data

UNIQUE VISITS

~ 5,000 / month

  • USERS CAN
  • Submit data
  • Annotate data
informatics steps services support of a new user community
Informatics Steps & Servicessupport of a new user community

INTEGRATION & COMPARATIVE ANALYSIS

2012

ASSEMBLY

2005

IMG

2008

IMG-ER

slide18

Data Challenges & Opportunities

  • Metadata
  • Gene calling
  • Annotation
  • Quantity
  • Quality

Data

Analysis

Integration

  • Number of Genes
    • All vs all Blast
  • Number of Datasets
    • How do we navigate through a sea of data
slide19

Challenges we face

  • DATA SIZE
  • DATA QUALITY
  • DATA STANDARDS
slide20

Challenges we face

  • 1. DATA SIZE
  • Number of Genes
  • Number of Datasets
    • How do we compare data
    • How do we find data
    • How do we navigate through data
slide21

ii. Method dev for data reduction & comparison- Computation of Similarities

Use clusters

2. Computation of similarities

Reference genomes

Metagenome

Metagenome

Metagenome

Clusters

  • Common/unique genes
  • Rapid identification of best hit(s)
  • ….
slide24

10

Prochlorococcus marinus Pangenome

17

Listeria monocytogenes Pangenome

Staphylococcus aureus Pangenome

15

Pangenomes

  • We need better ways to
    • represent and browse through thousands of genomes
    • represent an organism
slide25

Metagenome Analysiswith Pangenomes

Best Blast Hit

Reference Genome

Pangenome

slide26

Challenges we face

  • 2. DATA QUALITY
  • Did we generate enough data to support biological conclusions?
  • Did we introduce any biases during sequencing?
  • Is the quality of assembly comparable between different datasets?
  • Is the quality of predicted genes comparable between different datasets?
  • Is the quality of functional annotation comparable between different datasets
slide27

Microbial Genomes

Gene Prediction Quality Assurance

GenePRIMP

http://geneprimp.jgi-psf.org

Gene Prediction Improvement Pipeline

GenePRIMP is a pipeline that consists of a series of

computational units that identify erroneous gene

calls and missed genes and correct a subset of the

identified defective features.

APPLICATIONS

  • Identify gene prediction anomalies
  • Benchmark the quality of gene prediction algorithms
  • Benchmark the quality of combination / coverage of sequencing platforms
  • Improve the sequence quality

Pati A. et al, (2010) Nature Methods

Amrita

Natalia

slide28

Challenges we face

  • 3. DATA STANDARDS
    • Assembly
    • Gene Finding
    • Functional Annotation
    • Metadata
slide29

Project Catalog & Metadata

Genomes OnLine Database

I. Pagani

D. Liolios

slide30

COMPUTATIONSM5: Pilot Project with ANL

innovation through collaboration

Building a roadmap for a scaleable and sustainable computing MetaInfrastructure for the metagenomics community

  • develop standards to share and process data more effectively
  • run data-intensive workflows once (reduce wasted cycles)
  • Develop a single QC data processing pipeline
  • Develop a single data submission entry
  • Develop a single data processing pipeline
  • Develop a common project catalog
slide32

Ongoing Developments

New Data & Tools for Visualization & Analysis of

  • Integration of Expression data
  • Integration of Regulatory Data
  • Resequencing data (strain variation)
  • Pangenomes

Data Processing

  • Short Read annotation
  • Bypass the all vs all Blast bottleneck
ad