pavel morozov march 3 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Pavel Morozov March 3 PowerPoint Presentation
Download Presentation
Pavel Morozov March 3

Loading in 2 Seconds...

play fullscreen
1 / 36

Pavel Morozov March 3 - PowerPoint PPT Presentation

  • Uploaded on

Pavel Morozov March 3. Legionella Functional Genomics Project. Modulation of host-cell gene expression. Adhesion, invasion. Inhibition of lysosome fusion. Evasion. Recruitment of ER. Replication. Legionella pneumophila.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Pavel Morozov March 3' - gyula

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
pavel morozov march 3
Pavel Morozov March 3

Legionella Functional Genomics Project.

legionella pneumophila

Modulation of host-cell gene expression

Adhesion, invasion

Inhibition of lysosome fusion


Recruitment of ER


Legionella pneumophila
  • An intracellular pathogen that can invade and replicate inside human macrophages and causes potentially fatal human infection Legionaires' disease.
  • Transmitted through inhaling mist droplets containing the bacteria.
  • Has extraordinary ability to survive in many different ecological niches (axenic cultures, biofilms with other organisms and intracellular vacuoles of amoebae, ciliates and human cells).
  • In order to relpicate Legionella should be inside if protozoa (amobae, acanthamoeba) which are single-cell eukaryotes, or macrophages of human lungs or monocites.
complete genome of legionella pneumophila strain phyladelphia 1
Complete genome of LEgionella pneumophila (strain Phyladelphia 1).

region 3: efflux

genes in

direct chain

Legionella pneumophila (strain Phyladelphya 1) genome.

The highlighted regions were noteworthy due to their possession of different than average G+C content and GC skew in addition to skewed strand preference of ORFs. These computationally determined regions turn out to contain gene clusters that belong to specific categories (e.g., ribosomal protein cluster), or those corresponding to points of genome rearrangements or acquired by horizontal transfer. Some examples are shown in more detail below.

genes inreverse chain

C+G content

GC skew

region 7: tra/trb region (F-plasmid)

project goals
Project goals
  • Study molecular mechanisms (genetics and regulation) of
    • Legionella ability to survive in different ecological niches.
    • Legionella infection.
  • Extended genome annotation of Legionella species (Phyladelphia, Paris, Lens strains).
  • Custom whole-genome microarrays.
  • Network reconstruction and modeling.
microarray design history of legionella microarrays
Microarray Design. History of Legionella Microarrays.

September 2005

2,997 70-mer oligos

Whole-genome array

3,005 genes in duplicates

640 reference controls

October 2003

3,230 clones

90% of the genome

June 2001

  • 1344 clones in triplicate
  • 40% of the genome

Requirements for Microarray Probes.

The goal was to design 70-mer probes covering all protein- and RNA- coding genes and control probes for testing background and array properties.

Requirements common to all probes:

should not contain short nucleotide stretches that are too abundant;

should be free from secondary structure elements;

should have approximately same melting temperature

Requirements specific to probes specific to genes:

70-mers should be unique (occure once) in experimental system (Legionella, Human, E.coli);

Requirements specific to array control probes:

should not not exist in experimental system (Legionella, Human, E.coli)

microarray probe design using unique oligonucleotides of particular length
Microarray probe design using unique oligonucleotides of particular length.

5’ CDS or genomic sequence 3’14-meroligonucleotides


overrepresented 8-mers

70-mer microarray probe

In simplified form probe selection can be described like selection of regions with maximum number of unique oligonucleotides (in this case of length 14 bp) and minimal number of overrepresented shorter oligonucleotides (in this case 8 bp).

In actual study we have to use oligonucleotides of different length and also check for the probe melting temperature.

Using unique oligonucleotide for designing probes automatically removes secondary structure issues.

chosing length of oligonucleotides



Chosing length of oligonucleotides

DNA or RNA (genomic or mRNA sequence).







Distribution of ancestors and descendant of various length.

For each position we can define the length L at which the nucleotide, starting at this position became unique. All oligonucleotides in this position longer than L will be also unique. Also there are two types of unique oligonucleotides: those who contain unique oligonucleotide of smaller length and those who do not, we name them ancestors and descendants. It is enough to keep information about first occurrence of oligonucleotide for each position in order to have complete information about distribution of unique oligonucleotides for particular sequence region.

Distributions of ancestral and descendant unique oligonucleotides by it’s length. Solid line denote sum of two distributions, dotted line denote distribution of ancestral oligonucleotides and dashed line stands for descendats. A) Results of simulation for genomes of size 1mb. B) Real data for human chromosome X.


Design of probes using unique oligonucleotides positional information.

Sequence region and ancestors for each position (-1 if not known) :

a t g c a c t a g c t a g c t a g t c g …





For each potential probe Pi can be defined vector of number of unique oligonucleotides of various length (both ancestors and descendants):Vi={0,0,0,0,0,0,0,2,3,4,2,3,5,6,7}.

A Golden Standard vector can be defined asG={0,0,0,0,0,0,n1,n1-1,n1-2,n1-3…}.

An Euclidian distance is a relible choise of a measure for the estimation of distance between Vi and G:

D(Vi,G)=√ ∑L (Vi(j)-G(j))²

where L set of oligonucleotide length used.

A probes with minimal distance to golden standard we choosed.


Finding unique oligonucleotides.

Olig Space without Space with length coding coding

4 256 32

5 1,024 128

6 4,096 512

7 16,384 2,048

8 65,536 8,192

9 262,144 32,768

10 1,048,576 131,072

11 4,194,304 524,288

12 16,777,216 2,097,152

13 67,108,864 8,388,608

14 268,435,456 33,554,432

15 1,073,741,824 134,217,728

16 4,294,967,296 536,870,912

17 17,179,869,184 2,147,483,648

18 68,719,476,736 8,589,934,592

19 274,877,906,944 34,359,738,368

20 1,099,511,627,776 137,438,953,472

  • Enumerating oligonucleotides
    • Binary arithmetic : 00 stands for A, 01 for T, 10 for C and 11 for G.

Binary:01110001 Decimal:142 T G A T

    • Enumeration is complete, dense, and nonredundant.
  • Counting oligonucleotides
    • direct counting
    • Complete space of possible oligonucleotides grows as 4n.
    • Memory size of current computers allows to handle oligonucleotides up to 16 on PC, up to 18 on Sun Solaris. With algorithm enhancements we can go up to 24 (but no need).The best resolution for human genome provided by length 18 and most bacterial genomes 12-14.
  • computable on desktop- computable on workstation with big memory- computable on workstation with big memory with enhanced algorithm- hardly computable
program realization and data formats









  • Symbol Length of first Overrepresented flag unique oligonucleotides
  • in this position
Program realization and data formats.


for minimal oligonucleotidelength

Marked for unique oligonucleotides fasta file

List of fasta files

(genomes etc.)

u_find.exeu_findm.exefor all desired oligonucleotidelength


Storing data in Rich FASTA format

Results of the search for unique oligonucleotides are stored in “rich” Fasta format. Essentially it is linear record of positional information like regular Fasta file, but with coded additional information.


Microarray probes


Design of control probes using non-existing oligonucleotides information.

Goal: sequence which have no homology to any genome ( no blast hits over threshold)

  • Selecting nonexistent oligonucleotides
  • Overlapping and merging oligonucleotides
  • Choosing probes from merged sequences







Nonexisting oligonuclleotides

Nonexisting sequence.

Probe selection (temperature, secondary structure)


Properties of proposed probe design method.

  • Finding of unique and nonexistent oligonucleotides have linear computational time on the size of genomes used.
  • Once the unique and system is represented in “rich” fasta format, design of new probes became extremely fast and can be repeated as much as needed in order to create probes for new set of CDS or genomic region.
  • Probes, selected by using unique oligonucleotides automatically reduce the presence of hairpins on RNA secondary structure.
  • Method can be applied to experimental systems with multiple non-related genomes (genomes can be as far from each other as eu- and prokaryotes).
  • Method is efficient for control probe selection.
  • Problem: Method did not provide robust estimation of sequence homology between probe and the rest of genomes, at the same time selected probes have the lowest homology to the rest of genome possible.
  • Method provides valuable statstics about oligonucleotide usage in particular genomes and genome sets.
legionella in microbial communities
Legionella in Microbial Communities.
  • Biofilms are not just a bunch of microbes, they are a special environment, protected from harsh outside by a special polysaccharide layer, which is produced by other microbes in the community.
  • Microbial community in biofilms have shared metabolic and regulatory networks.
  • Biofilms provide excellent environment for horizontal gene transfer.
  • Since biofilms prevent antibiotics and other biocide from getting to the pathogens biofilms are significant reservoir of health-hazardous pathogens.
  • Legionella can survive in biofilms, but cannot form it by itself, only as part of the microbial community.

Similar applications and potential use of proposed method.

  • Evolutionary studies (Traces of ancient events?).

Hsieh, Minimal model for genome evolution and growth. Phys Rev Lett. 2003 Jan 10;90(1):018101.

Jordan, A universal trend of amino acid gain and loss in protein evolution.Nature. 2005 Feb 10;433(7026):633-8. Epub 2005 Jan 19.

  • Use in organism and sequence identification – metagenomics.

Metagenomics: "the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.“ (Chen and Pachter, University of California, Berkeley)

Bailey & Ulrich, Molecular profiling approaches for identifying novel biomarkers.Expert Opin Drug Saf. 2004 Mar;3(2):137-51. Review.

Palmer, Rapid quantitative profiling of complex microbial populations.Nucleic Acids Res. 2006 Jan 10;34(1):e5.


Clickable Interactive Interface

ADAPTERS LAYER: Converting and performing requests, formatting output.UNIX web server, Perl scripts, JAVA, C.




Local Databases

Remote Databases

Remote Methods

Local Methods

Memory Engine


SQL Engine


data transfer


Client Side

Server Side


Solved technical problems:



Setting up the server side

Setting up mySQL server and services

Tools for importing and parsing external databases

scripts to process flat files (perl, mySQL):

extracting related information

fomatting into SQL database

Formatting into static HTML

scripts to pars remote databases (perl, java, mySQL):

extracting related information

fomatting into SQL database

Formatting into static HTML

update engine (under construction)

WEB page development (HTML, JavaScript, CSS)

Testing with Explorer, Fire Fox, Opera, Safari.


Sources of Information

Proprietary data

Publicly available data

Results of computations


NCBI EMBL TIGR Individual genomes

Functional Domains

Function and annotation







GeneNet MetaCyc


current list of integrated databases
Current list of integrated databases

Parsed for Legionella-related information, organized and stored locally:

  • NCBI
  • EMBL
  • UniProt
  • InterPro
  • PIRSF (PIR superfamily/family)
  • Pfam
  • HSSP
  • MedLine/PubMed
  • MetaCyc
  • KEGG
web site scheme


WEB site scheme


Static interactive tables







Static interactive gene descriptions

Dynamic (by user requests)

Interactive data retrieval into interactive tables

Interactive genome map


Semi Dynamic

Search History


Legionella Genome Browser.

  • Interactive.
  • You can:
    • Choose scale and region
    • Links to tables and annotation data
    • Choose annotation tracks to display and track parameters
    • Choose various color schemes
    • Add custom annotation tracks
interactive tables

Row operations: Select/Unselect, Show/Hiderows

Columns (fields) operations: Show/hide column

Sorting columns

Interactive tables

Snapshots of the NMPDR annotation pages

Region comparisons in other genomes by sequence homology:


icmP Phil1

Coxiella burnetii

visualization of the gene expression in nmpdr system
Visualization of the gene expression in NMPDR system

pathway reactions

expression ratios

Legionella gene info


Study gene expression:

  • during intracellular growth and under various environmental stresses
  • axenically- and protozoan-grown Legionella
  • in Legionella-containing biofilms

4. Develop models (gene networks and reporter genes) that describe relevant patterns of gene expression:

(gene networks =expressed genes + their regulators)


560 assignments

LegCyc:181 pathways

678 assignments


Expressed genes: Original Gene Function Assignments

~3000 genes


ORF Finders

KEGG  Pathways



Plus Missing


  • Use lower stringency search
  • BLAST expected genes to Legionella genome sequence
  • Search for probable motif combinations

Confirm absence

of these genes:


histidine biosynthesis


Legionella metabolic pathway overview(a portion)


1 2 3 4 5 6 7 8

Search for transcription factor binding sites


Predicted operons

Clusters of co-expressed genes



TF site prediction (in silico).

  • Promoter manipulations
  • Co-expressed gene sets
  • Regulatory networks

Experimental confirmation of the predicted promoters. Transcription start sites.

Use of confirmed motifs to identify additional co-regulated genes.


Columbia Genome Center

  • Jing Ju lab, S. Kalachikov, S. Pompu
  • Gene expression microarrays
  • Clusters of coexpressed genes
  • Regulatory genes knockout results (expression)
  • Molecular biology methods
  • Gene expression microarrays
  • RT-PCR
  • Transcriptional factors
  • promotor verification
  • Microbiology Department
  • Prof. Shuman
  • Gene knockout
  • Phenotypic analysis
  • Computational Analysis
  • Morozov Pavel, Morozova Irina
  • operon structures
  • putative promotors and transcriptional regulation sites
  • detailed gene annotation
  • regulatory network reconstruction