Error Correction for Deep Viral Sequencing ( Shotgun,Amplicons )

Error Correction for Deep Viral Sequencing (Shotgun,Amplicons) BassamTork Dr.AlexanderZelikovsky 3Georgia State University, USA.

Introduction • Viruses such as Hepatitis C virus (HCV) show a very high level of sequence heterogeneity, which is responsible for • Its escape from immune responses • Its rapid development of drug resistance. • Deep sequencing presents a novel opportunity for understanding HCV evolution, drug resistance and immune escape.

Deep sequencing require extensive computational processing with error correction algorithms in order to obtain high quality reads for genetic analysis. • The key purpose of such algorithms is to differentiate between artifacts and actual sequences. • Common error correction approaches are very efficient in shotgun experiments but have problems in amplicon experiments.

Common approaches for error correction • Frequency cut-off • Based on reported 454 error rates. • Nucleotide error rate is small but insertions and deletions (indels) are common. • Error rates are likely sequence specific. • Clustering: Pyronoise, SHORAH, AVA • We found that these approaches find too many false sequences in amplicon experiments . • K-mer corrrection: EDAR (Zhao et al, 2010) • In contrast to other methods, it does not require a reference sequence. • Works well in shotgun experiments, but delete supposed error regions instead of correction of errors.

EDAR • Let the length of a k-mer be k, and observe that a read r with length l contains (n=l – k+1) k-mers. k-count Ck(i), i=1..n, be the coverage for the k-mer starting from position i of read r. • Consider the set of k-mers of all reads, for instance: read r = ATCCGAT K-mers for k = 4: {ATCC,TCCG,CCGA, CGAT} 3. Calculate the frequency of each k-mer in the whole data set

Method-EDAR • 2) For each k-mer in the dataset, calculate the k-count (described below). • 3) Normalize k-count values based on GC content (Voelkerding et al., 2009). • 4) Based on the distribution of k-counts across the dataset, determine the threshold between erroneous k-mers and correct ones. • 5) For each read, cluster k-mers using the variable bandwidth mean-shift method.

Method-EDAR..cont. • 6) Categorize each cluster as error or non-error based on the threshold obtained from step 3, above. • 7) Post-process clusters based on k-mers’ locations on the read and other criteria. • 8) Analyze each error region, and detect putative error bases. Remove putative error bases from each • read and generate shorter read fragments.

GC-content Based Adjustments to k-count Values • Some sequencing methods result in a coverage bias in which read coverage increases with GC-content. • For such data, authors normalize k-count values prior to clustering based on their GC-contents as follows: • Since error-containing k-mers will bias this calculation, a reference genome is required to accurately calculate Ck,GC(ki) and covk

Detecting Boundaries between Errors and Non-Error Regions

Detecting and Removing Putative Error Bases in Each Erroneous Region • Let an erroneous region x=[b ..e] , with length l(x).There are several observations regarding error bases. • 1st: there is an error in every k-mer starting from each position inside x. • 2nd: it is highly likely that k-mers starting from positions b-1 and e+1 do not contain errors. • Following these observations , authors only consider bases that are highly likely to be incorrect.

Detecting and Removing Putative Error Bases in Each Erroneous Region..cont. • For Example: the last position of the k-mer starting from bshould be incorrect, as k-mer starting from b-1 is error-free while k-mer starting from b is wrong. Furthermore, any base in between b+k and e could be error bases. but we only consider the most likely ones here, and do not consider those bases as error bases.

EDAR: ERROR DETECTION AND REMOVAL • EDAR works as follows: • If m<=k, then • If e is the starting position of the last kmer of the read, then b+k-1 is considered the error base. • Otherwise, e is considered the error base. • If k<x<2k, then both b+k-1 and e are considered error bases. • If x>=2k, then every base in [b+k-1, e] is considered an error base.

EDAR: ERROR DETECTION AND REMOVAL • After detecting the error bases, the algorithm splits the reads by removing these putative error bases from each read. • The resulting read fragments can serve as input to assemblers for sequence assembly. • Unlike methods that alter reads to correct errors (mutations), this approach can effectively remove insertions as well.

Outline of Skum’s paper • Skum’s et al. present two new highly efficient error correction algorithms: • K-mer error correction (KEC) • Empirical frequency threshold (ET). • Both were compared to the recently published algorithm SHORAH to evaluate their relative performance. • Performance was measured in 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences.

k-mer Error Correction Algorithm (KEC) • The scheme of KEC includes 4 steps: • Calculate k-mers s and their frequencies kc(s) (k-counts). Assume that kmers with high k-counts (“solid” k-mers) are correct, while k-mers with low k-counts (“weak” k-mers) contain errors. • Determine the threshold k-count (error threshold), which distinguishes solid kmers from weak k-mers. • Find error regions. The error region of the read is the segment [i,j] such that for every p є [i,j] the k-mer starting at the position p is considered weak. • Correct the errors in error regions

K-mer error correction (KEC) • Consider the set of k-mers of all reads, for instance: read r = ATCCGAT K-mers for k = 4: {ATCC,TCCG,CCGA, CGAT} 2. Calculate the frequency of each k-mer in the whole data set

3. Determine the threshold between erroneous k-mers and correct ones • In shotgun experiments the threshold is easily identified (purple bar in the left figure, separating two distributions). • In amplicon experiments, we found that the threshold definition required a more complex procedure. • 4. For each read, cluster k-mers by their frequencies • Using the variable bandwidth mean-shift method. • Each cluster determines the error region. K-mer distribution in a shotgun experiment K-mer distribution in a HCV amplicon

Finding the Error Threshold-KEC • It was observed in [chaisson, Zhao], that it is not necessary to explicitly consider the model for the distribution, because the first minimum of f(v) satisfactorily separates different distributions, and therefore can be used as the error threshold. • However, this approach is not applicable to the amplicon data. • The first minimum of f(v) is always equal to 0. • Authors define the end of the first sufficiently long segment of the consecutive 0’s of f(v) as the error threshold ter. • In their experiments, the length of the segment equal to k was adequate.

Finding the Error Regions-KEC • sequentially find isolated segments [i,j] such that for every pє[i,j] KCk(p) <= ter. • the k-mers of the read are clustered according to their k-counts using clustering by the variable bandwidth mean-shift method, using FAMS software. • every segment is extended in both directions by adding consecutive positions q by the following rule: q is added if and only if there exists pє[i,j] such, that k-mersSk(p) and Sk(q) belong to the same cluster. Overlapping segments are joined, and the obtained segments are error regions.

Error Correction-KEC This stage consists of the following steps: • Error correction in “short” error regions (with lengths not exceeding k). • (4b) Error correction in “long” error regions (with lengths greater than k).

Error Correction in Short Error Regions-Tail Errors • According to Lemma 1, errors corresponding to non-tail error regions with lengths k could be identified and corrected. • If the error region x = [b,e] of a read r = (r1,…,rn) is a tail (if e = n-k+1), then we delete from the read the suffix starting at the position b+k-1. • if b = 0, delete the prefix ending at position e .

Error Correction in Short Error Regions-Algorithm1 • (a)Consider every non-tail error region x=[b,e] with length not exceeding k of every read r = (r1,…,rn). • Authors assume, that x was caused by single isolated error at the position e. • Taking into account the length of x and the sequence of nucleotides following it, identify the type of error. • If l(x)= k, then x could be caused either by nucleotide replacement with 3 possible corrections or by simple nucleotide insertion. • If l(x) = k-1 and re = re+1, then x could be caused either by the insertion of the nucleotide re or by the deletion of the nucleotide c≠re between re and re+1. • If l(x) < k-1, then the type of error and its correction can be determined unambiguously. In the case of insertion remove re; in the case of deletion duplicate re, if it will introduce a solid k-mer.

Error Correction in Short Error Regions-Algorithm1 • (b)Cut tails, delete short reads, recalculate k-mers and error regions, delete reads covered for more than 40% with error regions. • Repeat previous steps (a+b) until there are no error regions in the data set or the fixed number of iterations is reached.

Error Correction in Long Error Regions • The possible errors in the error region x = [b,e], l(x) > k are located at positions b+k-1 and e. • However, the “long” error regions represent a union of more than one “short’ error regions.

How to Treat Long Error Regions • There are two ways to treat long error regions • One is to discard all reads with errors uncorrected by Algorithm 1 • The other way is to correct errors in “long” error regions. All possible errors at positions b+k-1 and e are considered to choose the correction procedure causing the introduction of k-mer with the highest k-count. • Since these corrections are less reliable than for “short” error regions, correction of “long” error regions is conducted at the end of the algorithm after correcting “short” error regions,

Experimental threshold algorithm (ET) • Simple idea, The main purpose of the procedure is to calculate the frequency of erroneous haplotypes in amplicon samples where a single haplotype is expected. • All reads smaller than 90% of the expected amplicon length are deleted and all reads bigger than 110% of the expected amplicon length are clipped. • Alignment to external references: Each haplotype is aligned against a set of external references of all known genotypes. For each haplotype the best match of the external set is chosen. The aligned sequence is clipped to the size of the chosen external reference. • Alignment to internal references: The 20 most frequent haplotypes that do not create insertions or deletions w.r.t the external reference are selected as the internal reference set. Each haplotype in the dataset is aligned against each member of internal references set. For each haplotype the best match of the internal set is chosen.

Experimental threshold algorithm (ET)…cont. • Homopolymer correction: All homopolymers of 3 or more nucleotides are identified. If the homopolymer region includes an insertion, the nucleotide is removed. If the homopolymer includes a deletion, the gap is replaced by the missing nucleotide. Then all different haplotypes and their frequencies are recalculated.

Experimental threshold algorithm (ET)…cont. • Haplotype error threshold: The frequency of erroneous haplotypes and its standard deviation is calculated over the 14 samples containing a single clone. A haplotype threshold was defined as the weighted average frequency of erroneous haplotypes + 9 standard deviations (0.40%). All haplotypes with a frequency lower than the haplotype threshold are removed. • Removal of reads with Ns: All haplotypes with Ns are removed from the final file. This step was performed at the end rather than at the beginning to take advantage of the information that these reads provided regarding nucleotide frequencies at positions other than those with N.

Algorithm comparison

Missing true sequences • ET and SHORAH are better than KEC in finding true sequences • KEC has problems with low-frequency variants (around 1%)

Frequency of true sequences • ET and KEC are much better than SHORAH in replicating the frequencies of the true sequences

False sequences • ET and KEC are much better than SHORAH in avoiding false sequences (next slide shows a reduced scale).

False sequences (scale clipped to 5) • ET and KEC have similar performance in avoiding false sequences

Hamming distance between the false sequences and true targets • ET and KEC are much better than SHORAH in terms of the average distance from false sequence to their closest true targets

Conclusions • All algorithms perform well regarding the finding of true sequences. • KEC has more problems with low-frequency variants • KEC and ET severely outperform SHORAH in: • Estimating the frequency of true sequences. • Removing false sequences. • KEC and ET are highly suitable for rapid recovery of high quality sequences from amplicon regions of highly heterogeneous viruses such as HCV and HIV. • In contrast to SHORAH and ET, KEC does not require a reference sequence

References • Chaisson M, Brinza D, Pevzner P: De novo fragment assembly with short mate-paired reads: does the read length matter? . Genome Res 2009, 19:336-346 • Zhao X, Palmer L, Bolanos R, Mircean C, Fasulo D, Wittenberg D: EDAR: An efficient error detection and removal algorithm for next generation sequencing data. Journal of computational biology 2010, 17(11):1549 - 1560

Error Correction for Deep Viral Sequencing ( Shotgun,Amplicons )