Glycan database
Download
1 / 29

Glycan database - PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on

Glycan database. Database of molecules. Two models (of vocabularies) Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans Residues: monosaccahrides (+ many modifications)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Glycan database' - menefer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Database of molecules
Database of molecules

  • Two models (of vocabularies)

    • Proteins / Nucleic Acids

      • Residues (+ modifications)

      • Genbank / Swissprot

    • Compounds

      • Atoms & covalent bonds (SMILE/SMARTS language)

      • Pubchem / ACS

  • Glycans

    • Residues: monosaccahrides (+ many modifications)

    • Branching nonlinear structure



Representation of glycans
Representation of glycans

  • Vocabulary

    • monosaccharides rather than atoms

  • Two challenges

    • Controlled vocabulary of monosaccharides

      • GlycoCT

    • From residues to molecules: glycan exchange format

      • GLYDE-II


Searching the glycan database comparison
Searching the glycan database: comparison

  • Glycan representation

    • tree vs. sequences

  • Glycan matching

    • exact vs. non-exact

      • Graph theoretic algorithm

    • alignment? Mutations are natural events.

    • Multiple glycan matching

  • Glycan pattern searching

    • Significance estimation



Glyde standard
GLYDE standard

  • An XML based representation format for glycan structures

  • Inter-convertible with existing data represented using IUPAC or LINUCS.

  • GLYDE II: Incorporation of Probability based representation

  • Visualization: structures using GLYDE (XML) files

GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.


Collaborative glycoinformatics
Collaborative GlycoInformatics

  • Enable querying and export of query results in GLYDE format

  • Using GLYDE representation for disambiguation, mapping and matching

GLYDE

MonosaccharideDB

<glyde>

<residue>

.

.

</residue>

</glyde>

SweetDB

QUERY

<glyde>

<residue>

.

.

</residue>

</glyde>

KEGG

RESULT


Semantic glcyoinformatics ontologies
Semantic GlcyoInformatics - Ontologies

  • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)

    • Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy

    • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco

  • ProPreO: a comprehensive process Ontology modeling experimental proteomics

    • Contains 330 classes, 6 million+ instances

    • Models three phases of experimental proteomicsURL: http://lsdis.cs.uga.edu/projects/glycomics/propreo


Glyco taxonomy
GlycO taxonomy

The first levels of the GlycO taxonomy

Most relationships and attributes in GlycO

GlycO exploits the expressiveness of OWL-DL.

Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.


<Glycan>

<aglycon name="Asn"/>

<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc">

<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc">

<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" >

<residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" >

<residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >

</residue>

<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >

</residue>

</residue>

<residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" >

<residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc">

</residue>

</residue>

</residue>

</residue>

</residue>

</Glycan>


Propreo
ProPreO

  • ProPreO: A process ontology to capture proteomics experimental lifecycle:

    • Separation

    • Mass spectrometry

    • Analysis

    • 330 classes

    • 110 properties

    • 6 million+ instances


Usage mass spectrometry analysis
Usage: Mass spectrometry analysis

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875


P(S | M = 3461.57) = 0.6

P(T | M = 3461.57) = 0.4

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875


Graph theoretic basics
Graph Theoretic Basics

  • tree: an acyclic connected graph, whose vertices we refer to as nodes;

  • rooted tree: a tree having a specific node called the root, from which the rest of the tree extends.

  • children: nodes that extend from a node x by one edge are called the children of x; and conversely, x would be called the parent of these children;

  • Leaf: a node with no children;

  • Subtree: subtree of a tree T is a tree whose nodes and edges are subsets of those of T;

  • ordered tree: the rooted tree in which the children of each node are ordered;

  • labeled tree: a tree in which a label is attached to each node;

  • Forest: a set of trees

  • Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.


Maximum common subtree problem mcst
Maximum Common Subtree Problem (MCST)

  • Input: Two labeled rooted trees T1 and T2.

  • Output: A tree which is a subtree of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees.

  • Variants: Each of T1 and T2 can be ordered or unordered.

Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003).


A bottom up dynamic programming algorithm
A bottom-up dynamic programming algorithm

  • Let {u1, …,un} and {v1, …,vm} are the sets of nodes in T1 and T2, respectively;

  • R[ui, vj] – the size of the maximum subtree of T1(ui) and T2(vj), the subtrees of T1 and T2 with ui and vj as roots, respectively;

    • Computed from leaves to roots (bottom-up)

    • MCST of T1 and T2  R[root(T1), root(T2)]

  • R[ui, ] = R[vj, ] = 0;

  • M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1.

Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003). Implemented in KEGG glycan matching and many other services.


Alignment algorithm
Alignment algorithm?

  • Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25.

  • Extended to MCST problem in multiple trees

    • Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and T3, where MCST(T1, T2) is the maximum subtree of T1 and T2?

    • Multi-MCST problem is NP-hard (Akutsu, 2002)

      • Reduciable from Longest Common Substring problem (LCS)

    • Finding substructures, motif finding problem  profile models

  • Should we consider indels as DNA/protein alignments?

    • Indels is not a natural changes; but mutation might be.

    • Profile HMM may not be appropriate


Maximum common approximate subtree problem mcast
Maximum Common Approximate Subtree Problem (MCAST)

  • Input: Two labeled rooted trees T1 and T2.

  • Output: A tree which is a k-appximatesubtreeof both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees.

  • T is a k-appximatesubtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.


Subtree finding problem pattern matching problem
Subtree finding problem (pattern matching problem)

  • Input: a labeled rooted tree P and a set (database) S of labeled rooted trees.

  • Output: all trees in S which each has a subtree matching P.

  • Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves

  • A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.


A bottom up dynamic programming algorithm1
A bottom-up dynamic programming algorithm

  • Let {u1, …,un} and {v1, …,vm} are the sets of nodes in P and T.

  • R[ui, vj] – indicator if the tree with the root of ui is a subtree of the tree with the root of vj, which is rooted by vj

    • Output  subtree with the root of vjwhich has R[root(P), vj] = 1;

  • R[x, ] = R[, y] = 0.

  • R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively.

  • For ordered tree, matching edges rather than nodes.

  • Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves; (2) root: Output  tree T which has R[root(P), root(T)] = 1;


Significance of matching glycans
Significance of matching glycans

  • MCST of T1 and T2 has k nodes (monosaccharides)

  • N(T, k): # of subtrees of T with k nodes

    • Can be counted by a DP algorithm (how?)

  • P = a-k N(T1, k)  N(T2, k)


Motif retrieval from glycans
Motif retrieval from glycans

  • PSTMM (Probabilistic Sibling-dependent Tree Markov Model)

    • Learns patterns from glycan structures

  • Profile PSTMM

    • Extracts patterns (as profiles) from glycan structures

  • Kernel methods

    • Classification of glycans

    • Extraction of “features” to predict glycan biomarkers


Kernel method
Kernel method

  • Extracted glycan structures from CarbBank

  • Pre-analysis showed that the trisaccharide structure was most effective for classification

  • Furthermore, since the non-reducing end is usually the portion being recognized, this information was included in the kernel model



Other kernels
Other kernels

  • Q-gram distribution kernel:

    • Wanted to be able to analyze any data regardless of marker structure or size

    • Definition of q-gram: A sub-tree containing q nodes

    • All of the q-grams for a particular glycan were included in the kernel

  • Multiple kernel:

    • A kernel of kernels


Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data


Data mining for glycobiology
Data mining for glycobiology appropriate size can be extracted from the data

  • Kernels can be utilized in many ways

    • Feature retrieval methods for detecting putative biomarkers

    • Cell-specific glycan structures can be extracted

    • Sequences of glycan binding proteins can be included in a new kernel to predict binding domains

    • Many more possibilities, depending on the data


ad