Skip this Video
Download Presentation
Hiroshi Dozono Saga University

Loading in 2 Seconds...

play fullscreen
1 / 23

Hiroshi Dozono Saga University - PowerPoint PPT Presentation

  • Uploaded on

Visualization and Classification of DNA sequences using Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient. Hiroshi Dozono Saga University. Introduction (1).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Hiroshi Dozono Saga University' - questa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient

Hiroshi Dozono

Saga University

introduction 1
Introduction (1)
  • The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences.
  • About 10 years ago, DNA sequencing requires large costs and long time.
  • Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost.
    • $100〜$1000 in 1 hour.
  • NGS produces large amounts of sequence data at once.
    • Gbytes 〜Tbytes
introduction 2
  • After reading the sequences, further analyses are conducted.
    • Identify the organisms
    • Identify the functions of genome
    • Remap the sequences on reference sequences
    • Comparison of the genomes among organisms
  • For the comparison of genomes, it will need large amount of computation to compare the sequences precisely.
    • The sequence alignment method is generally used.
    • The sequence alignment is effective for pairwise comparison or comparing small number of sequences.
    • It will need large computation for comparing large number of sequences
  • The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.
dna sequencing
DNA sequencing
  • DNA sequence
    • Sequence of 4 types of -nucleotide A, G, T, C
    • Complement nucleotide hybridizes each other.





  • DNA sequencing - Genome analysis
    • Next generation sequencers can read all DNA sequences of a organism or some organisms at once.
    • Large amount of sequencing data (from some G to T bytes) is produced.
      • The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C.
    • Effective method for identifying the features of the sequences is required.
conventional dna analysis
Conventional DNA analysis
  • Sequencing
  • Reconstruction of the sequence
  • Identification of coding region which codes genes
  • Identification of the function of genes
    • It needs large computational costs after sequencing
    • Our approach aims to extract global features of the DNA sequencing without precise analysis.
frequency based som
Frequency based SOM
  • SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in

T. Abe, T. Ikemura,, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702

  • For N-tuples, the dimension of input vector is 4N
som based on correlation coefficients of nucleotides
SOM based on correlation coefficients of nucleotides.

Correlation Coefficients(CC) of DNA sequence


A 1000010010 ρAA(n) CC between A and n-shifted A

C 0101001000 ρAC(n) CC between A and n-shifted C

G 0010000001 :

T 0000100100 ρTT(n) CC between T and n-shifted T

For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM.

Compared with dimension of n-tuples(4n), dimension of CC is much smaller.

Using these equations, correlation coefficients can be calculated without converting DNA sequences to binary sequences.
experimental results of som based on correlation coefficients
Experimental results of SOM based on correlation coefficients
  • Settings of the experiments
  • Set 1: genes from amino acid metabolisms of 6 species
  • Set 2: genes from 7 metabolic pathway of homosapience
  • The sequences are segmented to 1000 bases.
experimental results of set 1 1
Experimental results of Set 1(1)
  • The resolution and topology of these maps are almost compatible.
  • Map of frequencies of 4-tuples
  • From 6 species L=256
  • Map of CC of 1-4 shifts
  • from 6 species L=2
  • L=64
experimental results of set 1 2
Experimental results of Set 1(2)
  • For small dimensions, CC shows better separation.
experimental results of set 2
Experimental results of Set 2
  • The genes from metabolic pathways of homosapience can not be clearly clustered.
experiments of identification of sequences
Experiments of identification of sequences
  • 70% of the fragments of sequences are used for learning, and remainder are used for test.
  • The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.
Winner and updated units
  • Conventional SOM
  • Pareto learning SOM
  • Overlapped neighbors are updated more strongly.
  • It play a important role for integration of muti-modal vectors.
supervised pareto learning som sp som
Supervised Pareto learning SOM(SP-SOM)
  • The category vector can be introduced as an independent vector to each input vector for P-SOM.
    • The category vector attracts the input vectors in same category closely on the map corporately with other input vectors.
    • The P-SOM learning algorithm becomes supervised.
  • Category of test vector xt is determined as follows.
    • where P(xt) is the Pareto optimal set of units for xt
conclusions 1
  • We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides.
  • Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.
conclusions 2
  • Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors.
  • Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.
feature works
Feature works
  • Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome.
  • Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.
  • This work was supported by JSPS KAKENHI Grant Number 24500279.