Mapping NGS sequences to a reference genome

Mapping NGS sequences to a reference genome

Why? • Resequencing studies (DNA) • Structural variation • SNP identification • RNAseq • Mapping transcripts to a genome sequence • Genome annotation • Transcript enumeration • Identification of splice junctions/variants

Blast is too slow • Different alignment algorithms are necessary • Burrows Wheeler Alignment • sequence database (genome) is transformed to produce an index • Individual sequence reads are searched against this index • STAR Aligner (Dobin et al. 2012) Bioinformatics • Uncompressed Suffix trees

BWT of “banana”

Tophat2 • Based on the Bowtie alignment engine • Bowtie, matching with no gaps • Tophat2, gapped matches • Aligns reads to a Burrows Wheeler transformed index of the genome • 1st pass  non-gapped matches • 2nd pass  splits unmapped reads and attempts to align the fragments

The STAR Aligner • Start at the first base of sequence read • Find Maximal Mappable Prefix (MMP) • Repeat process using unmapped portion of read • 50x faster than other aligners

OUTPUTS • TopHat (Bowtie) • .bam file (binary alignment/map) • .sam (sequence alignment/map) • Single .sam file entry: I8MVR:53:837 0 17_dna:chromosome 14090858 255 21M * 00 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT

.sam fields

.sam flags • 1 • 2 • 1+2 • 0+4 • 1+4 • 0+2+4 • 1+2+4 • 0+8 • 1+8 • 0+2+8 • 1+2+8 • 0+4+8 • 1+4+8 • 0+2+4+8 • 1+2+4+8 • …etc.

CIGAR format I8MVR:104:144 0 7_dna:chromosome120102744 255 62M1I14M * 00 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA

Quantifying alignments • How many reads overlap a given interval on a chromosome (scaffold)? • How do these regions correspond to known genes? • .gtf file • How many transcripts from my gene of interest? • How confident can I be about a variant call?

Annotate regions - GTF files • Score • Strand • Frame • Attribute GTF fields • Sequence ID • Source • Feature • Start • End

Variant Calling • .bam/.sam file contains all of the information required to call variants • Variant calls can’t be extracted from the .bam file • Must provide the genome sequence I8MVR:53:837 0 17_dna:chromosome 14090858 255 21M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT

Today’s exercises

Variant Analysis • Extract variant information from provided .bam file • Examine output file and learn about the information contained in the various fields

Introducing… Dr. Eric Rouchka • Bioinformatics Core Director • Department of Computer Engineering and Computer Science • University of Louisville • Kentucky Biomedical Research Infrastructure Network

Mapping NGS sequences to a reference genome

Mapping NGS sequences to a reference genome

Presentation Transcript

Genome mapping

Performance Profiling of NGS Genome A ssembly A lgorithms

Comparative genome mapping

Mapping the Human Genome

What do genome sequences reveal?

Mapping the Human Genome

Genome Sequences

Locus Reference Genomic (LRG) Sequences

MAPPING OF SEQUENCES TO GENE ONTOLOGY

From Genome Sequences to Regulatory Network Phenotypes

Indexing Genome Sequences

Computational Analysis of Genome Sequences

Human protein reference sequences

GENOME MAPPING

Genome Sequences/ the Human Genome Project Dr. Chris Evelo

Genome Structure/Mapping

Genome mapping

NGS Bioinformatics Workshop 1.5 Genome Annotation

Mapping the Workplace Genome

NGS Read Mapping

Mapping the Human Genome

Whole-Genome Optical Mapping