1 / 29

Glycan database

Glycan database. Database of molecules. Two models (of vocabularies) Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans Residues: monosaccahrides (+ many modifications)

menefer
Download Presentation

Glycan database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Glycan database

  2. Database of molecules • Two models (of vocabularies) • Proteins / Nucleic Acids • Residues (+ modifications) • Genbank / Swissprot • Compounds • Atoms & covalent bonds (SMILE/SMARTS language) • Pubchem / ACS • Glycans • Residues: monosaccahrides (+ many modifications) • Branching nonlinear structure

  3. Simplified molecular input line entry specification (SMILE) • Glucose • OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1

  4. Representation of glycans • Vocabulary • monosaccharides rather than atoms • Two challenges • Controlled vocabulary of monosaccharides • GlycoCT • From residues to molecules: glycan exchange format • GLYDE-II

  5. Searching the glycan database: comparison • Glycan representation • tree vs. sequences • Glycan matching • exact vs. non-exact • Graph theoretic algorithm • alignment? Mutations are natural events. • Multiple glycan matching • Glycan pattern searching • Significance estimation

  6. GlycoCT: controlled vocabulary

  7. GLYDE standard • An XML based representation format for glycan structures • Inter-convertible with existing data represented using IUPAC or LINUCS. • GLYDE II: Incorporation of Probability based representation • Visualization: structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.

  8. Collaborative GlycoInformatics • Enable querying and export of query results in GLYDE format • Using GLYDE representation for disambiguation, mapping and matching GLYDE MonosaccharideDB <glyde> <residue> . . </residue> </glyde> SweetDB QUERY <glyde> <residue> . . </residue> </glyde> KEGG RESULT

  9. Semantic GlcyoInformatics - Ontologies • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) • Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco • ProPreO: a comprehensive process Ontology modeling experimental proteomics • Contains 330 classes, 6 million+ instances • Models three phases of experimental proteomicsURL: http://lsdis.cs.uga.edu/projects/glycomics/propreo

  10. GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.

  11. <Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue> </Glycan>

  12. ProPreO • ProPreO: A process ontology to capture proteomics experimental lifecycle: • Separation • Mass spectrometry • Analysis • 330 classes • 110 properties • 6 million+ instances

  13. Usage: Mass spectrometry analysis Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

  14. Semantic Annotation of Experimental Data • Enables Ontology-mediated Disambiguation • Allows correlation between disparate entities using Semantic Relations P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

  15. Graph Theoretic Basics • tree: an acyclic connected graph, whose vertices we refer to as nodes; • rooted tree: a tree having a specific node called the root, from which the rest of the tree extends. • children: nodes that extend from a node x by one edge are called the children of x; and conversely, x would be called the parent of these children; • Leaf: a node with no children; • Subtree: subtree of a tree T is a tree whose nodes and edges are subsets of those of T; • ordered tree: the rooted tree in which the children of each node are ordered; • labeled tree: a tree in which a label is attached to each node; • Forest: a set of trees • Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.

  16. Maximum Common Subtree Problem (MCST) • Input: Two labeled rooted trees T1 and T2. • Output: A tree which is a subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. • Variants: Each of T1 and T2 can be ordered or unordered. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003).

  17. A bottom-up dynamic programming algorithm • Let {u1, …,un} and {v1, …,vm} are the sets of nodes in T1 and T2, respectively; • R[ui, vj] – the size of the maximum subtree of T1(ui) and T2(vj), the subtrees of T1 and T2 with ui and vj as roots, respectively; • Computed from leaves to roots (bottom-up) • MCST of T1 and T2  R[root(T1), root(T2)] • R[ui, ] = R[vj, ] = 0; • M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1. Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003). Implemented in KEGG glycan matching and many other services.

  18. Alignment algorithm? • Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25. • Extended to MCST problem in multiple trees • Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and T3, where MCST(T1, T2) is the maximum subtree of T1 and T2? • Multi-MCST problem is NP-hard (Akutsu, 2002) • Reduciable from Longest Common Substring problem (LCS) • Finding substructures, motif finding problem  profile models • Should we consider indels as DNA/protein alignments? • Indels is not a natural changes; but mutation might be. • Profile HMM may not be appropriate

  19. Maximum Common Approximate Subtree Problem (MCAST) • Input: Two labeled rooted trees T1 and T2. • Output: A tree which is a k-appximatesubtreeof both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees. • T is a k-appximatesubtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.

  20. Subtree finding problem (pattern matching problem) • Input: a labeled rooted tree P and a set (database) S of labeled rooted trees. • Output: all trees in S which each has a subtree matching P. • Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves • A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.

  21. A bottom-up dynamic programming algorithm • Let {u1, …,un} and {v1, …,vm} are the sets of nodes in P and T. • R[ui, vj] – indicator if the tree with the root of ui is a subtree of the tree with the root of vj, which is rooted by vj • Output  subtree with the root of vjwhich has R[root(P), vj] = 1; • R[x, ] = R[, y] = 0. • R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively. • For ordered tree, matching edges rather than nodes. • Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves; (2) root: Output  tree T which has R[root(P), root(T)] = 1;

  22. Significance of matching glycans • MCST of T1 and T2 has k nodes (monosaccharides) • N(T, k): # of subtrees of T with k nodes • Can be counted by a DP algorithm (how?) • P = a-k N(T1, k)  N(T2, k)

  23. Motif retrieval from glycans • PSTMM (Probabilistic Sibling-dependent Tree Markov Model) • Learns patterns from glycan structures • Profile PSTMM • Extracts patterns (as profiles) from glycan structures • Kernel methods • Classification of glycans • Extraction of “features” to predict glycan biomarkers

  24. Kernel method • Extracted glycan structures from CarbBank • Pre-analysis showed that the trisaccharide structure was most effective for classification • Furthermore, since the non-reducing end is usually the portion being recognized, this information was included in the kernel model

  25. Kernel method

  26. Other kernels • Q-gram distribution kernel: • Wanted to be able to analyze any data regardless of marker structure or size • Definition of q-gram: A sub-tree containing q nodes • All of the q-grams for a particular glycan were included in the kernel • Multiple kernel: • A kernel of kernels

  27. Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data

  28. Data mining for glycobiology • Kernels can be utilized in many ways • Feature retrieval methods for detecting putative biomarkers • Cell-specific glycan structures can be extracted • Sequences of glycan binding proteins can be included in a new kernel to predict binding domains • Many more possibilities, depending on the data

More Related