Advertisement
1 / 103

The STRING database PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on 18-11-2011
  • Presentation posted in: General

The STRING database. Michael Kuhn EMBL Heidelberg. protein interactions. example. Tryptophan synthase beta chain E. Coli K12. many sources. genomic context. curated knowledge. experimental evidence. T. literature. 373 genomes. (only completely sequenced genomes). 1.5 million genes. - PowerPoint PPT Presentation

Download Presentation

The STRING database

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The string database l.jpg

The STRING database

Michael Kuhn

EMBL Heidelberg


Protein interactions l.jpg

protein interactions


Example l.jpg

example

  • Tryptophan synthase beta chain

  • E. Coli K12


Many sources l.jpg

many sources


Genomic context l.jpg

genomic context


Curated knowledge l.jpg

curated knowledge


Experimental evidence l.jpg

experimental evidence

T


Literature l.jpg

literature


373 genomes l.jpg

373 genomes

  • (only completely sequenced genomes)


1 5 million genes l.jpg

1.5 million genes

  • (not proteins)


Genome reviews l.jpg

Genome Reviews


Refseq l.jpg

RefSeq


Ensembl l.jpg

Ensembl


Model organism databases l.jpg

model organism databases


Data integration l.jpg

data integration


Genomic context methods l.jpg

genomic context methods


Gene fusion l.jpg

gene fusion


Gene neighborhood l.jpg

gene neighborhood


Phylogenetic profiles l.jpg

phylogenetic profiles


Slide32 l.jpg

Cell

Cellulosomes

Cellulose


Automatic inference of interactions l.jpg

automatic inferenceof interactions


Correct interactions l.jpg

correct interactions


Wrong associations l.jpg

wrong associations


Gene fusion36 l.jpg

gene fusion

  • score: sequence similarity


Gene neighborhood37 l.jpg

gene neighborhood

  • score: sum of intergenic distances


Phylogenetic profiles38 l.jpg

phylogenetic profiles


Slide39 l.jpg

SVD

  • singular value decomposition

  • (removes redundancy)


Score euclidean distance l.jpg

score: Euclidean distance


All scores are raw scores l.jpg

all scores are “raw scores”


Not comparable l.jpg

not comparable

  • sequence similarity

  • sum of intergenic distances

  • Euclidean distance


Benchmarking l.jpg

benchmarking

  • calibrate against “gold standard”

  • (KEGG)


Raw scores l.jpg

raw scores


Probabilistic scores l.jpg

probabilistic scores

  • e.g. “70% chance for an assocation”


Curated knowledge47 l.jpg

curated knowledge


Slide48 l.jpg

KEGG

  • Kyoto Encyclopedia of Genes


Reactome l.jpg

Reactome


Slide50 l.jpg

GO

  • Gene Ontology


Primary experimental data l.jpg

primary experimental data


Many sources52 l.jpg

many sources


Many parsers l.jpg

many parsers


Slide54 l.jpg

BIND

  • Biomolecular Interaction Network Database


Slide55 l.jpg

GRID

  • General Repository for Interaction Datasets


Slide56 l.jpg

HPRD

  • Human Protein Reference Database


Co expression l.jpg

co-expression

  • microarray data


Slide58 l.jpg

GEO

  • Gene Expression Omnibus


Correlation coefficient l.jpg

correlation coefficient


Literature mining l.jpg

literature mining


Different gene identifiers l.jpg

different gene identifiers


Synonyms list l.jpg

synonyms list


Medline l.jpg

Medline


Slide64 l.jpg

SGD

  • Saccharomyces Genome Database


The interactive fly l.jpg

The Interactive Fly


Slide66 l.jpg

OMIM

  • Online Mendelian Inheritance in Man


Simple scheme l.jpg

simple scheme


Co mentioning l.jpg

co-mentioning


More advanced l.jpg

more advanced


Slide70 l.jpg

NLP

  • Natural Language Processing


Slide71 l.jpg

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

The expression of

the cytochrome genes

CYC1 and CYC7

is controlled by

HAP1


Calibrate against gold standard l.jpg

calibrate against gold standard


Combine all evidence l.jpg

combine all evidence


Bayesian scoring scheme l.jpg

Bayesian scoring scheme


E g two scores of 0 7 combined probability l.jpg

e.g.: two scores of 0.7combined probability: ?


E g two scores of 0 7 combined probability 0 91 l.jpg

e.g.: two scores of 0.7combined probability: 0.91

  • 1 - (1-0.7)2 = 0.91


Evidence transfer l.jpg

evidence transfer


Evidence spread over many species l.jpg

evidence spread over many species


Transfer by orthology l.jpg

transfer by orthology

  • (or “fuzzy orthology”)


Slide81 l.jpg

von Mering et al., Nucleic Acids Research, 2005


Slide82 l.jpg

von Mering et al., Nucleic Acids Research, 2005


Two modes l.jpg

two modes


Cog mode l.jpg

COG mode


Slide87 l.jpg

von Mering et al., Nucleic Acids Research, 2005


Higher coverage lower specificity l.jpg

higher coveragelower specificity

  • includes all available evidence

  • some orthologous groups are too large to be meaningful


Proteins mode l.jpg

proteins mode


Slide90 l.jpg

von Mering et al., Nucleic Acids Research, 2005


Maximum specificity lower coverage l.jpg

maximum specificitylower coverage

  • information will be relevant for selected species


Slide92 l.jpg

Demo


Outlook l.jpg

outlook


Take home message l.jpg

take home message

  • STRING integrates information and predicts interactions

  • You can always go to the sources

  • Proteins mode: specific species

  • COG mode: more coverage, especially for prokaryotic genes


Acknowledgements l.jpg

Acknowledgements

  • The STRING team

  • Lars Jensen

  • Peer Bork

  • Christian von Mering & group in Zurich

  • Berend Snel

  • Martijn Huynen


Thank you for your attention l.jpg

Thank you for your attention


Take home message100 l.jpg

take home message

  • STRING integrates information and predicts interactions

  • You can always go to the sources

  • Proteins mode: specific species

  • COG mode: more coverage, especially for prokaryotic genes


Exercises tinyurl com 36twzq or via course wiki alternative server xi embl de l.jpg

Exercises:tinyurl.com/36twzq(or via course wiki)Alternative server:xi.embl.de


Slide103 l.jpg

Bork et al., Current Opinion in Structural Biology, 2004