Finding repeat pattern in human genome by TEIRESIAS algorithm

Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu

Human Genome

TEIRESIAS Algorithm

Define Pattern Σ(Σ{‘.’} * )Σ Σ: the alphabet of residues, like ACTG in DNA, the set of all amino acids ‘.’ : wild card character Pattern: Any string that begins and ends with a residue, and contains an arbitrary combination of residues and ‘.’ characters

Define Pattern • The pattern is defined by three parameters • L: the minimal number of literals • W: the pattern length • K: the minimal repeat times • Example: • Pattern: A.CH..E L = 4, W= 7 <4, 7> pattern

Define Pattern • P: pattern ‘A.CH..E’ • S: a set of sequences S={LFAADCHFFEDTR, LKLALCHESESDR, AFAGCHADELFT} • Ls(P) = {(i,j) | sequence si matches P at offset j} Ls(‘A.CH..E’) = {(1,4), (2,4), (3,3)}

Problem Definition • Given a set S={s1,s2, …, sn} of input sequences and parameters L, W, K, find all maximal <L, W> patterns that have support at least K repeat times.

TEIRESIAS Algorithm • The algorithm contains two steps: Scanning and convolution • 1. Scanning the sequences in the input set S and locating all elementary patterns with support at least K. • 2. Combining together the elementary pattern to recover the maximal pattern

Scanning Process

Convolution Process

Convolution Process • Prefix(P) is the prefix sub pattern of P that has exactly (L-1) residues Example: Prefix(‘F.ASTS’) =‘F.A’ if L = 3 • Suffix(P) is the suffix subpattern of P with exactly (L-1) residues Example: Prefix(‘F.A…S’) =‘A..S’ if L = 3

Convolution Process • P, Q are the arbitrary patterns with at least L residues each • R denotes a new pattern • Q’denotes what remains of Q after the prefix(Q) is thrown away • Φ denotes the empty string

Convolution Process • Forward extension If Suffix(P) = Prefix(Q), P is extended to right with Q’(Pattern Q leftover except prefix) Example: ‘F.AS’ convolute ‘AST’ to form ‘F.AST’ • Backward extension If Prefix(P) = Suffix(Q), P is extended to the left with Q’(Pattern Q leftover except suffix) Example: ‘F.AS’ convolute ‘TF.A’ to form ‘TF.AS’

Application 1: Finding repeat pattern in ChrM • Define the elementary pattern: L=W=5 • Scan the ChrM sequence Option 1. Generate all possible elementary patterns (4^5 = 1024), implement in R Option 2. Use Hash function to count elementary patterns (1017), implement in Perl

Application 1: Finding repeat pattern in ChrM

Application 1: Finding repeat pattern in ChrM • Scan the sequence Save the pattern Offset (index) ChrM: G1A2T3CACAGGTCT … Pattern: GATCA Offset: 1 Pattern: ATCAC Offset: 2 Pattern: TCACA Offset: 3

Application 1: Finding repeat pattern in ChrM • Convolution Simple repeat pattern offset example: Repeat Pattern1: G1ATCACAG Offset: 1 Repeat Pattern2: G10ATCACAG Offset: 10 Repeat Pattern3: G20ATCACAG Offset: 20 Elementary pattern and their offset: GATCA 1 10 20 ATCAC 2 11 21 TCACA 3 12 22 CACAG 4 13 23 From the offset, if the pattern forward extend 1 base, the offset increase 1

Application 1: Finding repeat pattern in ChrM Repeat Pattern: GATC4ACAG Elementary pattern and their offset: CACAG 4 13 23 TCACA 3 12 22 ATCAC 2 11 21 GATCA 1 10 20 From the offset, if the pattern backward extend 1 base, the offset decrease 1

Application 1: Finding the repeat pattern in ChrM • Convolution Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824 L=W=5, K=5 Elementary patterns and their offset: ACAAC 492 1712 3303 3400 3445 380640404073 5057 5396 6065 6197 640071857725 8643 8709 8768 9198 9668 10040 10134 10300 10839 10897 10900 10944 11263 11335 11997 12031121781225012488 12545 12554 12741 12858 12931 13380 1374613775 14331 14478 14691 15364 15680 15767 15825 16079 CAACA 1110 1122 2214 2399 2430 2969 3304 3762 3807 3940 40414074 4698 5581 6198 6401 6549 6598 71867726 8347 8641 8663 87109199 9354 9560 9791 10041 10086 10617 10716 10780 10835 10895 10898 11264 11333 11954 11998 120321217912251 12489 12829 129321374713776 13861 13906 14188 14651 14822 15121 1568115826 16077 16281 Forward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =20 >5 , accept New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A) New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825

Application 1: Finding the repeat pattern in ChrM • Convolution patterns and their offset: ACAACA 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 1248812931 13746 137751568015825 AACAA 240 283 362 646 1123 2400 2766 2920 2970 3302 3721 4039 4699 5173 5287 5395 6196 64027727 8508 8642 8664 8691 9667 10299 10420 10781 10896 10899 10943 11334 11573 12030 12249 12416 12490 12553 12740 12933 13388 13395 13745 13774 13777 13862 14029 14189 14193 14573 14658 15363 15603 15679 1568215827 15867 16078 16282 Forward extension: Offset difference = new pattern length (6) – pattern suffix length (4) = 2 Repeat position (repeat times) =8 >5 , accept New pattern: ACAACAA (pattern ACAAC forward extends 1 base A) New Offset: 6400 7725 10897 12488 12931 13775 15680 15825

Application 1: Finding the repeat pattern in ChrM • Convolution patterns and their offset: ACAACAA64007725 10897 1248812931 13775 15680 15825 CACAA 1011 1052 1619 3156 3805 3810 3991 4072 5280 5680 5938 6399 7184 7724 8053 8456 8563 8708 8767 8916 9106 10133 10838 10985 11262 11357 11996 12001 12217 12487 12544 12930 13218 13379 13546 13942 14330 14624 15419 15581 15824 16431 Backward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =5 >=5 , accept New pattern: CACAACAA (pattern ACAACAA backward extends 1 base C) New Offset: 6399 7724 12487 12930 15824 • Forward and backward extension finished, repeat times >=K , report it

Application 1: Finding the repeat pattern in ChrM Pattern report is saved in the ChrM_pattern.txt file. Here is a part of result(L=5, K=10).

Application 1: Finding the repeat pattern in ChrM

Application 2: Finding the repeat pattern in Chr1 • Define the elementary pattern: L=W=17 • Scan the Chr1 sequence • Problem: 1. Ch1 file size: 252,194,721 bytes, very large, can’t been read by R 2. All possible elementary patterns (4^17 = 17,179,869,184), over the R memory limitation

Application 2: Finding the repeat pattern in Chr1 • Solution: Use another computer language, like C or perl The C program gcat and the scanning result were got from Dr. Steven. Parallel program were used for running the program. The elementary patterns over 2000 repeat times are used for convolution. I also wrote perl script: scan.pl Usage: perl scan.pl <chr file> <output file> <L value>

Application 2: Finding the repeat pattern in Chr1 • Convolution Problem: The index file is too large, it can’t been read by R My Solution: perl convolute.pl <index file> <output file> <pattern length(L)> <Repeat times(K)> It’s working, but still very slow. The PVM (parralle virtual machine) techniques should be used for increasing speed.

Application 2: Finding the repeat pattern in Chr1

There are two more patterns, namely,AGCAATTCTCCTGCCTCA: 2858CTCCTGCCTCAGCCTCC: 14168can convolute with gctcactgcaacctccgcctcccaggttcaagcaattctc : 108to form the maximal patternGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCC : 46

Application 2: Finding the repeat pattern in Chr1 • Discussion 1. For the human genome pattern discovery, the high performance computer and parallel program should be used. R is a not good tools for solving this problem. 2. The algorithm should be carefully implemented for the better performance, like reducing the loop time.

Reference • Isodore Rigoutsos, Aris Floratos Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics, 1998, Vol(14) no.1:55-67 • Isodore Rigoutsos et al. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological process, PNAS, 2006, Vol(103) no.17:6605-6610

Finding repeat pattern in human genome by TEIRESIAS algorithm

Finding repeat pattern in human genome by TEIRESIAS algorithm

Presentation Transcript

Human Genome Project

Pattern Finding and Pattern Discovery in Time Series

Detecting Inversions in Human Genome

HUMAN GENOME

Teiresias

Human Genome Project

Design Brief Option 1 - Repeat Pattern

Human Genome Lecture

Human Genome Project

Finding genes in the genome

Finding Genes in the Rice Genome

Human Genome

Finding Genes In a Genome

HUMAN GENOME

HUMAN GENOME

Root-Finding Algorithm

Human Genome

$Human Genome$

The Human Genome

An Implementation of The Teiresias Algorithm

Chapter 6 Gene Prediction: Finding Genes in the Human Genome