NGS Data Analysis for Structural Variants & Copy Number Variations Detection

Detection of Structural Variants (SVs) and Copy Number Variations (CNVs) on NGS Data

SVs and CNVs • They are often confused… • SVs: regions contain insertions and deletions (indels) or inversions. • CNVs: regions appearing a different number of times in different individuals. • They are closely related phenomena. • SVs are operationally defined as genomic events involving >50bp. They include CNVs as well as rearrangements such as inversions and translocations.

Why studying SVs and CNVs • It has been acknowledged only very recently that human genomes differ more as a consequence of SVs (including CNVs) than of single-base differences. [first "hypothesis" in 2004-2005 not taken seriously, and evidence only in 2010 with NGS]. • In particular, many studies observed CNVs and did genotyping with them: they are the easiest SVs to detect. • I am not aware of tools for detecting inversions and translocations of DNA durectly on NGS data • There are some tools for detecting CNVs: from now on we discuss them only.

Copy Number Variations • The challenge is to discover effects of CNVs on human diseases, complex traits (combination of genetic and environmental effects), and evolution. • Genotyping of human CNVs is far from being a routine procedure. This is a limit to personalized medicine, for which CNVs detection is a crucial step. • No standard method (many and very recent tools, not yet a fair/sharp comparison) exists.

Detecting CNV • Since late nineties and until very recently (and still) CNVs were/are detected using aCGH: Array Comparative Hybridization. • The array platform are not as rapid and cheap as NGS, and their data cannot be re-used once processed. • The aGCH Array has inherent limits on the size and frequency of detectable CNVs. • NGS opened a new era in CNV-detection!

CNV calling • There are three approaches for CNV calling: • Based on read count (RC), or read alignment coverage (Breakdancer, CNVnator, CNV-seq, and tools of [Campbell et al, 2008], [Alkan et al, 2009], [Sudmant et al, 2010], [Yoon et al, 2009]). • Based on paired end reads (PEMer, CNVer, VariationHunter, MoDIL, Breakdancer, tools of [Sindi et al, 2009], [Quinlan et al, 2010]). • Based on split-read alignments (Pindel, Mosaik, tool of [Mills et al, 2006]).

Read Count Approaches • Measuring the amount of reads mapped to a location in the reference genome. • Identify CNV regions • Estimating Copy Number • Some use a sliding window. • Problems when coverage is not uniform within a CNV…

Paired End Reads Approaches • Mate/Paired End Reads are mapped on the reference genome.

Split Read Approach • A read is not mapped in a single locations because of possible structural variation. • All un-aligned reads are split and then mapping is sought again. • Iterate to find the actual breakpoint.

NGS Data Analysis for Structural Variants & Copy Number Variations Detection