- 84 Views
- Uploaded on
- Presentation posted in: General

Hiroshi Dozono Saga University

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Visualization and Classification of DNA sequencesusing Pareto learning Self Organizing Maps based on Frequency and Correlation Coefficient

Hiroshi Dozono

Saga University

- The first step of Genome analysis is DNA sequencing which identifies the sequence of nucleotides on DNA sequences.
- About 10 years ago, DNA sequencing requires large costs and long time.
- Recently, Next Generation Sequencing(NGS) can read the sequences very rapidly in low cost.
- $100〜$1000 in 1 hour.

- NGS produces large amounts of sequence data at once.
- Gbytes 〜Tbytes

- After reading the sequences, further analyses are conducted.
- Identify the organisms
- Identify the functions of genome
- Remap the sequences on reference sequences
- Comparison of the genomes among organisms

- For the comparison of genomes, it will need large amount of computation to compare the sequences precisely.
- The sequence alignment method is generally used.
- The sequence alignment is effective for pairwise comparison or comparing small number of sequences.
- It will need large computation for comparing large number of sequences

- The statistical information of the sequences will be the indicator which can identify the similarity among the sequences.

- DNA sequence
- Sequence of 4 types of -nucleotide A, G, T, C
- Complement nucleotide hybridizes each other.
A-TG-C

AGTCTTATCGATTAG

|||||||||||||||

TCAGAATAGCTAATC

- DNA sequencing - Genome analysis
- Next generation sequencers can read all DNA sequences of a organism or some organisms at once.
- Large amount of sequencing data (from some G to T bytes) is produced.
- The result of sequencing is obtained as a collection of short fragment of the nucleotides A,G,T and C.

- Effective method for identifying the features of the sequences is required.

- Sequencing
- Reconstruction of the sequence
- Identification of coding region which codes genes
- Identification of the function of genes
- It needs large computational costs after sequencing
- Our approach aims to extract global features of the DNA sequencing without precise analysis.

- SOM which uses the Frequency of N-tuples in DNA sequences as input vector is proposed in
T. Abe, T. Ikemura,et.al, Informatics for unreveiling hidden genome signatures, Genome Res., vol.13, p.693-702

- For N-tuples, the dimension of input vector is 4N

Correlation Coefficients(CC) of DNA sequence

ACGCTACTAG

A1000010010 ρAA(n) CC between A and n-shifted A

C0101001000 ρAC(n) CC between A and n-shifted C

G0010000001:

T0000100100 ρTT(n) CC between T and n-shifted T

For all combinations of A,G,T,C and from 1 to n shifts, 4x4xn correlation coefficients are calculated, and used as input vector of SOM.

Compared with dimension of n-tuples(4n), dimension of CC is much smaller.

- Using these equations, correlation coefficients can be calculated without converting DNA sequences to binary sequences.

- Settings of the experiments
- Set 1: genes from amino acid metabolisms of 6 species
- Set 2: genes from 7 metabolic pathway of homosapience
- The sequences are segmented to 1000 bases.

- The resolution and topology of these maps are almost compatible.

- Map of frequencies of 4-tuples
- From 6 species L=256

- Map of CC of 1-4 shifts
- from 6 species L=2

- L=64

- For small dimensions, CC shows better separation.

- The genes from metabolic pathways of homosapience can not be clearly clustered.

- 70% of the fragments of sequences are used for learning, and remainder are used for test.
- The experiments are conducted using SOM and Supervised Parato learning SOM, which is proposed by “Dozono”, to combine the integration of multi-modal vector, the visualization and supervised learning.

- Winner and updated units

- Conventional SOM

- Pareto learning SOM

- Overlapped neighbors are updated more strongly.
- It play a important role for integration of muti-modal vectors.

- The category vector can be introduced as an independent vector to each input vector for P-SOM.
- The category vector attracts the input vectors in same category closely on the map corporately with other input vectors.
- The P-SOM learning algorithm becomes supervised.

- Category of test vector xt is determined as follows.
- where P(xt) is the Pareto optimal set of units for xt

- We proposed a preprocessing method for DNA sequences by using correlation coefficients of the occurrence of the nucleotides.
- Using this method, the clustering results of the sequences were nearly compatible with those obtained using the frequencies of the N-tuples despite the difference in the length of input vectors.

- Pareto learning SOM method is applied to the classification of DNA sequences by using correlation coefficients and frequencies as input vectors.
- Pareto learning SOM using CC as the input vector shows good performance for classification compared with that obtained with conventional SOMs, and frequencies.

- Application of this method to additional types sequence data, such as coding region and non-coding region, and to large data sets such as whole genome.
- Improvement of the computational costs of P-SOMs, which are 5 times more than those of conventional SOMs.

- This work was supported by JSPS KAKENHI Grant Number 24500279.