1 / 43

EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context Enzyme Function Initiative (EFI) Gordon Research Conference on Enzymes, Coenzymes, and Metabolic Pathways July 15, 2014. What is a Genome Neighborhood Network?. High sequence homology. Enzyme function.

tulia
Download Presentation

EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EFI-Genome Neighborhood Tool: a web tool for large-scale analysis of genome context Enzyme Function Initiative (EFI) Gordon Research Conference on Enzymes, Coenzymes, and Metabolic Pathways July 15, 2014

  2. What is a Genome Neighborhood Network? High sequence homology Enzyme function Low/Med. Sequence homology + Genome Context Enzyme function

  3. What is a Genome Neighborhood Network? Genes << Operon << Regulon gene products forming a biological pathway R A B C Genome neighborhood information facilitates enzyme function discovery via contextual evidence

  4. What is a Genome Neighborhood Network? The GNN organizes genome neighborhood information for thousands of query genes in a high throughput and rapid fashion. The resulting network allows a user to quickly identify the protein families that are encoded by the genes within close proximity to the SSN dataset.

  5. GNN Generation • European Nucleotide Archive (ENA) is queried with each SSN sequence • Protein-encoding genes are compared to Pfam • Additional annotation information is gathered • Network xgmml file written • Query sequences and neighbor sequences = nodes • Genome proximity = edge • SSN network file parsing • Singletons excluded • Clusters assigned number and unique color The entire process is fast and computationally inexpensive

  6. GNNs: query families Query families

  7. GNNs: bacterial proteins in gene clusters Genome neighbors Query families

  8. GNNs: collect neighbors Genome neighbors Query families

  9. GNNs: cluster neighbors Genome neighbors network for neighbors Query families

  10. GNNs: deduce function Genome neighbors network for neighbors Query families shared context same pathway same function unique context unique pathway unique function

  11. Example: proline racemase superfamily < 10-120 > 60% ID Zhao et al. 2014 eLife: http://dx.doi.org/10.7554/eLife.03275

  12. GNN: “BLAST” network

  13. GNN: Pfam network Full GNN Pfam GNN

  14. GNN: pathway “parts” DAO ALDH DHDPS LDH/MDH OCD

  15. From GNN: complete pathways DAO DHDPS OCD LDH/MDH ALDH

  16. GNN Format The GNN visually organizes genome neighborhood information into multiple hub-and-spoke clusters.

  17. Hub Nodes • Hub node = Pfamfamily in neighborhood • Node Attribute, Neighbor_Accessions = list of all Pfam members found in genome context of SSN, with the following additional information: • EC number • PDB code • PDB-hit • Swiss-Protstatus(reviewed/unreviewed) • Additional Node Attributes: • Num_neighbors= the number of neighbor sequences belonging to this Pfam family • pfam= Pfam number, e.g., PF13365 • Pfam description = a short description of the family, e.g., Trypsin-like peptidase domain

  18. PDB-Hit PDB-hit - a sequence shares significant (e-value < e-15) homology with a protein with an X-ray crystal structure in RCSB Protein DataBase. The format of this information is “PDB code:e-value” Related structure  homology model for docking For users that are new to homology modeling, see resources by Salilab at the University of California at San Francisco. BLASTp

  19. Spoke Nodes Spoke nodes = single cluster from SSN with ≥1 neighbor in hub • The Node Attributes: • Cluster Number = # assigned to SSN-cluster • Query_Accessions= a list of UniProtaccessions for query sequences • Distance = a list of distance between query and neighbor. This is formatted “UniprotID-query:UniprotID-neighbor: (-)N”, where query = 0, next gene = 1, etc., and a negative N value indicates an upstream position. • SSN Cluster Size = the size of SSN-cluster • Num_neighbors= # of neighbor sequences retrieved by spoke node • Num_queries = # of query sequences in spoke node • Num_ratio = % co-occurrence as a ratio • ClusterFraction= % co-occurrence as fraction, 0-3

  20. Spoke Nodes Spoke node size is dependent on the % co-occurrence of that Pfam in the neighborhood of that SSN cluster. % co-occurrence = # neighbors retrieved / SSN cluster size * 100

  21. Pfam and the GNN Highly represented in SSN cluster More universal Lowly represented in SSN cluster Unique www.pfam.xfam.org

  22. Pfam and the GNN Identify the general classes of enzymes present in the genome context of an SSN cluster. Eg., the presence of a kinase Pfam family and isomerasePfam family, may indicate that the proteins of this particular SSN-cluster may carry out an aldolase-type reaction for a catabolic pathway. Isomerase Pfam Kinase Pfam www.pfam.xfam.org

  23. Neighborhood Size EFI-GNT default neighborhood size = +10 and -10 genes Users may lower this to +/- 3 to 9 genes R A B C Zhenget al. 2002, Genome Research 12, 1221

  24. GNN Signal-to-Noise: Added Noise The utility of the GNN is limited primarily by its signal-to-noise Signal= proximal and functionally related genes Noise = proximal and irrelevant genes

  25. GNN Signal-to-Noise: Lost Signal • Why did my query sequence return less than 20 neighbors? • Query sequence does not match to the ENA sub-databases • Non-coding RNA • Query sequence is located near the beginning or end of the ENA file • The neighbor entry does not have an associated EMBL accession number • The neighbor entry has not been incorporated into a current Pfam family. X X R A B C

  26. EFI-GNT Web tool www.enzymefunction.org

  27. EFI-GNT Input www.efi.igb.illinois.edu/efi-gnt 1. Upload xgmml network, full or rep-node 2. Pick neighborhood size: 3-10 +/- genes 3. Enter co-occurrence cutoff (1-100) 4. Enter email address 5. Hit “go” Upload status bar

  28. EFI-GNT Output • The EFI-GNT output is a pair of .xgmml files: • genome neighborhood network (GNN) • Colored version of the original SSN

  29. EFI-GNT Output A download link will be sent to the e-mail address provided. Data stored on server for 7 days.

  30. EFI-GNT Output NOTE – depending on your browser, the files may download with an additional file extension, such as: .xgmml.txt or .xgmml.xml You must delete the .txt or .xml extension in order to open these files in Cytoscape! Cytoscape opens .xgmml

  31. Network Visualization Version 3.1.0 GNN files must be viewed in Cytoscape 3.0 (or more recent) Best layouts: Organicor Prefuse Force Directed Opening both the GNN and colored SSN in a single instance of Cytoscape allows fast comparison between the two networks (see above). www.cytoscape.org

  32. Network Visualization • NOTE – in Cytoscape the automatic rendering and coloring of the colorized SSN is size dependent. Cytoscape settings include a “Threshold View” that needs to be adjusted in the following manner in order to automatically view your colored SSN: • In any version 3.X, go to Edit -> Preferences -> Properties • With “cytoscape 3” selected in the pull-down menu at the top, scroll to the bottom of the Property list and select “viewThreshold” • Click “Modify” and insert 5 zeros to the end of the displayed number • Click “OK” • 
Restart Cytoscape (this should only need to be done once per version of Cytoscape installed on your machine)

  33. Network Manipulation Generally, the full +/-10 neighbor GNN presents an overwhelming amount of information. Filter GNN networks by SNN Cluster Number, in order to assign enzyme function to subgroups of homologous sequences.

  34. Network Manipulation Only hubs connected to the designated SSN cluster (eg., the cyan cluster 5). Analyze the genome neighborhood Pfams specific to this SSN-cluster.

  35. Network Manipulation Spoke length is arbitrary. click+drag+drop overlapping spoke nodes until all are visible

  36. Tutorial Pages Tutorial pages containing content similar to this presentation

  37. Test Case:Predicted Novelties of the Sialic Acid Degradation Pathway

  38. Protein SSN Bacterial extracellular solute-binding protein family 1 (SBP_bac_1, PF01547) 100% rep node net BLAST E-value 10-80 40% identical 21833 sequences 11073 nodes Cluster 164 15 members EFI ID 510644 ThermoFluor hit on N-acetyl-neuraminate J. Bouvier, UIUC

  39. Genome Neighborhood Network for Cluster 164 Permease ABC transporter Epimerase Regulator Kinase DHDPS DUF J. Bouvier, UIUC

  40. EFI ID 510644 gene neighborhood Streptococcus uberis Diernhofer (strain 0140J, ATCC BAA-854) +6 +5 +4 +3 +2 +1 query -1 -2 -3 -4 J. Bouvier, UIUC

  41. N-acetylneuraminate degradation pathway PF00480 PF00701 PF04131 ATP ADP H+ N-acetyl-D- mannosamine 6-phosphate N-acetyl neuraminate N-acetyl-D- mannosamine pyruvate PF01979 PF01182 H2O H2O glycolysis NH4+ D-glucosamine 6-phosphate β-D-fructofuranose 6-phopshate acetate N-acetyl-D- glucosamine 6-phosphate Enzyme Pfam family ID Found in GNN Found alternative Pfam Orphan EC J. Bouvier, UIUC

  42. Three sources of unknown enzymes • Orphan enzyme activity (EC number with no enzyme)- in vivo evidence suggests an enzyme from PF04131 converts N-acetyl-D-mannosamine 6-phosphate to N-acetyl-D-glucosamine 6-phosphate in the third step of the pathway, but no biochemical work has been done on this putative epimerase. • Non orthologous gene replacement- The deacetylase from PF01979 known to convert N-acetyl-D-glucosamine 6-phosphate to D-glucosamine 6-phosphate in the four step of this pathway is located elsewhere in the genome (locus tag Sub1443). However Sub1651 which is located four genes downstream is a member of PF05448, and other members of PF05448 have known deacetylase activity. Is this a non orthologous gene replacement, and does it’s low occurrence (7%) in the neighborhoods of the queries suggest it to be a relic? • Domain of unknown function- The deaminase/isomerase from PF01182 known to convert α-D-glucosamine 6-phosphate to β-D-fructofuranose 6-phosphate in the fifth step of the pathway is located elsewhere in the genome (locus tag Sub1239). However Sub1654 which is located one gene downstream has been suggested to be a sugar isomerase. Sub1654 is a member of PF04074 (DUF386). Sub1654 is a good candidate for docking. J. Bouvier, UIUC

  43. Hands-on Portion of Workshop Feel free now to download Cytoscape 3.1, run EFI-EST, and run EFI-GNT for your protein (family) of interest. Please see posters by Katie Whalen (#55) and Daniel Wichelecki (#56) for further examples of EFI-EST/EFI-GNT use. Tutorials for using Cytoscape: http://enzymefunction.org/resources/tutorials/efi-and-cytoscape3 Feel free to contact us throughout the conference with questions/comments. Acknowledgements GNN Development 
Suwen Zhao (UCSF) 
Alan Barber (Pythoscape, UCSF) 
Shoshana Brown (Pythoscape, UCSF) 
EyalAkiva (Pythoscape, UCSF) 
Jason Bouvier (UIUC) Website Build 
Daniel Davidson (UIUC) 
David Slater (UIUC) Documentation 
Katie Whalen (UIUC) Principal Investigators 
Matthew Jacobson (UCSF) 
Patricia Babbitt (Pythoscape, UCSF) 
John Gerlt(UIUC)

More Related