Visualization and Classification of DNA sequences
1 / 23

Hiroshi Dozono Saga University - PowerPoint PPT Presentation

  • Uploaded on

Visualization and Classification of DNA sequences using Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient. Hiroshi Dozono Saga University. Introduction (1).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Hiroshi Dozono Saga University' - questa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient

Hiroshi Dozono

Saga University

Introduction 1
Introduction (1)

  • The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences.

  • About 10 years ago, DNA sequencing requires large costs and long time.

  • Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost.

    • $100〜$1000 in 1 hour.

  • NGS produces large amounts of sequence data at once.

    • Gbytes 〜Tbytes

Introduction 2

  • After reading the sequences, further analyses are conducted.

    • Identify the organisms

    • Identify the functions of genome

    • Remap the sequences on reference sequences

    • Comparison of the genomes among organisms

  • For the comparison of genomes, it will need large amount of computation to compare the sequences precisely.

    • The sequence alignment method is generally used.

    • The sequence alignment is effective for pairwise comparison or comparing small number of sequences.

    • It will need large computation for comparing large number of sequences

  • The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.

Dna sequencing
DNA sequencing

  • DNA sequence

    • Sequence of 4 types of -nucleotide A, G, T, C

    • Complement nucleotide hybridizes each other.

      A-T G-C




  • DNA sequencing - Genome analysis

    • Next generation sequencers can read all DNA sequences of a organism or some organisms at once.

    • Large amount of sequencing data (from some G to T bytes) is produced.

      • The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C.

    • Effective method for identifying the features of the sequences is required.

Conventional dna analysis
Conventional DNA analysis

  • Sequencing

  • Reconstruction of the sequence

  • Identification of coding region which codes genes

  • Identification of the function of genes

    • It needs large computational costs after sequencing

    • Our approach aims to extract global features of the DNA sequencing without precise analysis.

Frequency based som
Frequency based SOM

  • SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in

    T. Abe, T. Ikemura,, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702

  • For N-tuples, the dimension of input vector is 4N

Som based on correlation coefficients of nucleotides
SOM based on correlation coefficients of nucleotides.

Correlation Coefficients(CC) of DNA sequence


A 1000010010 ρAA(n) CC between A and n-shifted A

C 0101001000 ρAC(n) CC between A and n-shifted C

G 0010000001 :

T 0000100100 ρTT(n) CC between T and n-shifted T

For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM.

Compared with dimension of n-tuples(4n), dimension of CC is much smaller.

Experimental results of som based on correlation coefficients
Experimental results of SOM based on correlation coefficients

  • Settings of the experiments

  • Set 1: genes from amino acid metabolisms of 6 species

  • Set 2: genes from 7 metabolic pathway of homosapience

  • The sequences are segmented to 1000 bases.

Experimental results of set 1 1
Experimental results of Set 1(1) coefficients

  • The resolution and topology of these maps are almost compatible.

  • Map of frequencies of 4-tuples

  • From 6 species L=256

  • Map of CC of 1-4 shifts

  • from 6 species L=2

  • L=64

Experimental results of set 1 2
Experimental results of Set 1(2) coefficients

  • For small dimensions, CC shows better separation.

Experimental results of set 2
Experimental results of Set 2 coefficients

  • The genes from metabolic pathways of homosapience can not be clearly clustered.

Experiments of identification of sequences
Experiments of identification of sequences coefficients

  • 70% of the fragments of sequences are used for learning, and remainder are used for test.

  • The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.

  • Conventional SOM

  • Pareto learning SOM

  • Overlapped neighbors are updated more strongly.

  • It play a important role for integration of muti-modal vectors.

Supervised pareto learning som sp som
Supervised Pareto learning SOM coefficients(SP-SOM)

  • The category vector can be introduced as an independent vector to each input vector for P-SOM.

    • The category vector attracts the input vectors in same category closely on the map corporately with other input vectors.

    • The P-SOM learning algorithm becomes supervised.

  • Category of test vector xt is determined as follows.

    • where P(xt) is the Pareto optimal set of units for xt

Conclusions 1
Conclusions(1) coefficients

  • We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides.

  • Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.

Conclusions 2
Conclusions(2) coefficients

  • Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors.

  • Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.

Feature works
Feature works coefficients

  • Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome.

  • Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.

Acknowledgements coefficients

  • This work was supported by JSPS KAKENHI Grant Number 24500279.