1 / 89

Havana, Cuba, 21.11.2003

Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee. Havana, Cuba, 21.11.2003. Components of Expression Profiler http://ep.ebi.ac.uk/. External data, tools pathways, function, etc. EP:PPI Prot-Prot ia. EP:GO GeneOntology.

teddy
Download Presentation

Havana, Cuba, 21.11.2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining the functional genomics data IIIData integration:Gene Ontology, PPI, URLMAPJaak Vilovilo@egeen.ee Havana, Cuba, 21.11.2003

  2. Components of Expression Profiler http://ep.ebi.ac.uk/ External data, tools pathways, function, etc. EP:PPI Prot-Prot ia. EP:GO GeneOntology Expression data EPCLUST Expression data GENOMES sequence, function, annotation URLMAP provide links SEQLOGO SPEXS discover patterns PATMATCH visualisepatterns

  3. GeneOntology Pathways Databases SPEXS Other tools Expression Profiler: EPCLUST FOLDER DATA SELECT/ FILTER ANALYZE A “CLUSTER” URLMAP

  4. URLMAP • Given a cluster of genes - many web based tools and databases to consult/follow up. How to link to them? • How to manage many links, many tools? • Answer: Centralize that linking

  5. URLMAP - no need to “cut & paste” SRS/InterPro KEGG: • Generates all links/forms dynamically • Maintain links in one place • Handle renaming of gene id’s by synonyms • Allow domain-specific link pages

  6. A Simple Metabolic Pathway Shoshanna Wodak, Jacques van Helden

  7. Links for each item type • Yeast S. cerevisiae gene ID-s (ORFname, SP id, SGD ID, …) • Pattern collections, e.g. substrings to profile generation by SEQLOGO • Keyword searches from web based search engines

  8. Management of links • Hierarchies of link collections • One can point to any (sub)hierarchy directly • LINK = • URL, title, • form parameters • modifications/code • DB lookups for synonyms

  9. “Screen scraping” – doable with a little perl programming g1,g2,g356 g1,g2,g356 g1 g1,g2,g356 g2 g356 g1,g2,g356 Report

  10. Gene OntologyTMwww.geneontology.org • GO is a systematic effort for data annotation • Three independent ontologies • Molecular Function • Biological Process • Cellular component • How to integrate that into analysis tools?

  11. DAG Structure mitosis S.c. NNF1 mitotic chromosome condensation S.c. BRN1, D.m. barren Annotate to any level within DAG

  12. GO Annotation: Data • Database object: gene or gene product • GO term ID • Reference • publication or computational method • Evidence supporting annotation

  13. GO Evidence Codes IDA-Inferred from Direct Assay IMP-Inferred from Mutant Phenotype IGI-Inferred from Genetic Interaction IPI-Inferred from Physical Interaction IEP-Inferred from Expression Pattern TAS-Traceable Author Statement NAS-Non-traceable Author Statement IC - Inferred by Curator ISS-Inferred from Sequence or structural Similarity IEA-Inferred from Electronic Annotation ND-Not Determined

  14. GO Evidence Codes From reviews or introductions IDA-Inferred from Direct Assay IMP-Inferred from Mutant Phenotype IGI-Inferred from Genetic Interaction IPI-Inferred from Physical Interaction IEP-Inferred from Expression Pattern TAS-Traceable Author Statement NAS-Non-traceable Author Statement IC - Inferred by Curator ISS-Inferred from Sequence or structural Similarity IEA-Inferred from Electronic Annotation ND-Not Determined automated From primary literature

  15. Example (GoMiner)

  16. EP:GO tool for GeneOntology • Browse • Search by keywords; EC, term. etc.. • Get associated genes • Submit associated genes to URLMAP • Annotate gene clusters using GO terms

  17. EP:GO EPCLUST URLMAP => Look up expression data

  18. Annotate Clusters (EP:GO) F,G,H 1 J F,G F,G,I 2 3 4 B,E A,D I E B,A F,G,H B,C 5 6 B,E,F,I

  19. Set overlap N genes CLUSTER GO term G ∩ C A: |G ∩ C| / min( |G|, |C|) B: P( choose |C| from N with |G|, observe |G ∩ C|+)

  20. Annotation of clusters • GO:0042254 <U:L> Process: ribosome biogenesis and assembly (+2:15) (depth=7) [sgd:2:187]GO:0042254: 47 from cluster (size 98) vs 187 in this class (including subclasses)GO:0006364 <U:L> Process: rRNA processing (+3:3) (depth=8) [sgd:50:126]GO:0006364: 35 from cluster (size 98) vs 126 in this class (including subclasses)GO:0006360 <U:L> Process: transcription from Pol I promoter (+6:14) (depth=8) [sgd:23:155]GO:0006360: 38 from cluster (size 98) vs 155 in this class (including subclasses)GO:0005730 <U:L> Component: nucleolus (+10:17) (depth=6) [sgd:154:210]GO:0005730: 45 from cluster (size 98) vs 210 in this class (including subclasses)GO:0030515 <U:L> Function: snoRNA binding (depth=6) [sgd:23:23]GO:0030515: 17 from cluster (size 98) vs 23 in this class (including subclasses)GO:0030490 <U:L> Process: processing of 20S pre-rRNA (depth=9) [sgd:33:33]GO:0030490: 18 from cluster (size 98) vs 33 in this class (including subclasses)GO:0005732 <U:L> Component: small nucleolar ribonucleoprotein complex (depth=6) [sgd:30:30]GO:0005732: 16 from cluster (size 98) vs 30 in this class (including subclasses)GO:0006396 <U:L> Process: RNA processing (+7:52) (depth=7) [sgd:7:370]GO:0006396: 40 from cluster (size 98) vs 370 in this class (including subclasses) • …

  21. 101 Sequences relative to ORF start YGR128C + 100 >YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754) TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_ >YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747) CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_ ... >YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014) CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_ GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29 ... GATGAG.T TGAAA..TTT

  22. EP:PPI Protein-protein interaction • There are high-throughput technologies for identifying hypothetical protein-protein interactions • Which ones of these are more likely to be true? • Can these predictions help predicting gene function?

  23. PPI pairs

  24. We have expression data

  25. Cluster

  26. Trust those within the same cluster

  27. PPI are enriched within clusters Ge, Liu, Church, Vidal: Nature Genetics Nov. 2001

  28. Protein-protein interactions: which to trust more? Answer: Use the distance measure alone

  29. Kemmeren et.al. Randomized expression data Yeast 2-hybrid studies Known (literature) PPI MPK1 YLR350w SNF4 YCL046W SNF7 YGR122W Molecular Cell, Vol. 9, 1133–1143, May, 2002

  30. Interacting pairs of proteins A and B; C and D Which would you trust? A 0 1 d B 0 C 12 D d 7 13

  31. EP:PPI – combine PPI and expression

  32. Results • Confidence in 973 out of 5342 putative two-hybrid interactions from S. cerevisiae is increased. • Besides verification, integration of expression and interaction data is employed to provide functional annotation for over 300 previously uncharacterized genes. • The robustness of these approaches is demonstrated by experiments that test the in silico predictions made. • This study shows how integration improves the utility of different types of functional genomic data and how well this contributes to functional annotation.

  33. promoter coding DNA Gene regulation by transcription factors GENE 1 GENE 2 GENE 3 GENE 4 DNA transcription factors G1 G3 G2 G4

  34. Networks • Graphical models • Directed labelled graph • Nodes  genes • Arcs/Edges  relationships • Labels  types of relationships

  35. Graph drawing W A B Start node (gene) Connection weight, w End node (gene)

  36. Different interpretation of arcs • Edges can have different meanings, hence different networks • Binding site for A is in front of B • Proteins A and B interact • Deletion of gene A affects expression of B (is somewhere in regulation cascade) • “Literature” mentions genes together

  37. promoter coding DNA Gene regulation by transcription factors GENE 1 GENE 2 GENE 3 GENE 4 DNA transcription factors G1 G3 G2 G4

  38. Deletion mutants (gene knockouts) DA DC DB gene A gene B gene C gene D A D B C

  39. Hughes, T. R. et al: “Functional Discovery via a Compendium of Expression Profiles”, Cell 102 (2000), 109-126.

  40. Green arrows - upregulation Red arrows - downregulation Thickness of arrow represents certainty of direction (up/down)

  41. A complete graph

  42. Features/distributions that do not depend on discretisation thresholds • Visual inspection, biological interpretation • General statistics and features of the graphs • Indegree/Outdegree • Complexity of the networks • What is the modularity? • How many components? • Deletion of hot-spots, does it break the net?

  43. Filter • choose a list of genes • (MATING, marked in red) • filter for these genes plus • neighbouring genes from the graph Mutation network Dg=4

  44. Mutation network Dg=2

  45. + Repressor Lactose Galactose Glucose Galactosidase Promoter lacI Promoter Operator lacZ ... Activator Glucose Lac-Operon Thomas Schlitt

  46. Gene regulatory networks • What formalisms to use to describe them? • When does model correspond to biological reality? • How to simulate models on computer • Is it possible to verify models by experiments? • How to restore networks from raw data without knowing the structure or parameters?

  47. Number of incoming/outgoing edges Most genes have only a few incoming / outgoing edges, but some have high numbers (>500) count ... number of outgoing edges

  48. Rank of outdegree ARG5,6(108,28) SST2(60,25) Rank of indegree TEC1 ERG3(164,15) GCN4 GAS1 ERG28 QCR2 YER083C FUS3 GLN3 CLB2 HPT1 SPF1 YHL029C MRT4

More Related