1 / 13

MCB3895-004 Lecture # 14 Oct 21/14

MCB3895-004 Lecture # 14 Oct 21/14. Genome alignment. Genome alignment. Basic problem: align long DNA sequences to each other Difficult to do computationally efficiently Note: different from BLAST, which is designed to find homologs between a query sequence and a database

Download Presentation

MCB3895-004 Lecture # 14 Oct 21/14

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MCB3895-004 Lecture #14Oct 21/14 Genome alignment

  2. Genome alignment • Basic problem: align long DNA sequences to each other • Difficult to do computationally efficiently • Note: different from BLAST, which is designed to find homologs between a query sequence and a database • Poorly adapted to long sequences

  3. Basic approach • Find small regions between two genomes that are (near-)perfectly matched • Extend these seed alignments in both directions until alignment quality drops • Join overlapping alignments • Calculate summary stats

  4. MUMmer • The first widely-used genome alignment program, still current • Delcher et al. 2002 Nuc Acids Res 30:2478-2483 • Kurtz et al. 2004 Genome Biol. 5:R12 • http://mummer.sourceforge.net/

  5. MUMmer • nucmer: nucleotide-nucleotide comparisons • promer: genome comparisons using translated protein sequences • Useful for comparing evolutionarily more divergent genomes • Protein sequences are more conserved than nucleotide sequences

  6. Step #1: perform alignment • Command: $ nucmer <ref genome> <query genome> • e.g., $ nucmer GCF_000006765.1_ASM676v1_genomic.fna GCF_000152525.1_ASM15252v1_genomic.fna • produces out.delta output file by default (can be changed using -p flag)

  7. Step #2: restrict output to non-overlapping hits • Command: $ delta-filter -1 <input .delta file> > <output file> • e.g., $ delta-filter -1 out.delta > filtered.delta

  8. Step #3: create table of aligned regions • Command: $ show-cords <filtered delta file> > <output file> • e.g., $ show-cords filtered.delta > filtered.coords [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS] ===================================================================================== 1 51709 | 4285564 4233865 | 51709 51700 | 99.42 | NC_002516.2 NZ_CH482383.1 54035 54174 | 1574855 1574716 | 140 140 | 92.86 | NC_002516.2 NZ_CH482383.1 54038 54839 | 4232157 4231356 | 802 802 | 99.38 | NC_002516.2 NZ_CH482383.1 54114 54838 | 4229635 4228911 | 725 725 | 99.04 | NC_002516.2 NZ_CH482383.1 56087 68711 | 4228361 4215746 | 12625 12616 | 99.10 | NC_002516.2 NZ_CH482383.1 56086 56366 | 4232157 4231877 | 281 281 | 92.17 | NC_002516.2 NZ_CH482383.1 69537 88265 | 4215638 4196909 | 18729 18730 | 99.70 | NC_002516.2 NZ_CH482383.1 89162 155482 | 4196024 4129674 | 66321 66351 | 99.51 | NC_002516.2 NZ_CH482383.1 155580 245829 | 4129573 4039375 | 90250 90199 | 99.53 | NC_002516.2 NZ_CH482383.1 245830 260263 | 4038886 4024462 | 14434 14425 | 99.52 | NC_002516.2 NZ_CH482383.1 260434 262436 | 4024296 4022294 | 2003 2003 | 98.55 | NC_002516.2 NZ_CH482383.1 262478 264652 | 4022300 4020126 | 2175 2175 | 98.80 | NC_002516.2 NZ_CH482383.1 264639 264753 | 4020242 4020126 | 115 117 | 94.87 | NC_002516.2 NZ_CH482383.1 264841 286245 | 4020242 3998837 | 21405 21406 | 99.59 | NC_002516.2 NZ_CH482383.1 286332 291173 | 3998245 3993402 | 4842 4844 | 96.17 | NC_002516.2 NZ_CH482383.1 291398 370537 | 3993171 3914014 | 79140 79158 | 99.40 | NC_002516.2 NZ_CH482383.1 294592 297831 | 506430 509669 | 3240 3240 | 99.14 | NC_002516.2 NZ_CH482383.1

  9. Step #4: create table of SNPs and indels in aligned regions • SNP: "single nucleotide polymorphism" • indel: "insertion or deletion" • Command: $ show-snps <filtered delta file> > <output file> • e.g., $ show-snpsfiltered.delta > filtered.snps [P1] [SUB] [P2] | [BUFF] [DIST] | [R] [Q] | [FRM] [TAGS] ================================================================================ 1553 C T 4284012 | 72 1553 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 1625 T C 4283940 | 72 1625 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 1798 . C 4283766 | 173 1798 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 2412 A G 4283152 | 597 2412 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3009 A C 4282555 | 399 3009 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3408 C T 4282156 | 30 3408 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3438 C A 4282126 | 18 3438 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3456 C T 4282108 | 18 3456 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3480 G A 4282084 | 15 3480 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3495 G A 4282069 | 12 3495 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3507 C T 4282057 | 12 3507 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3567 G C 4281997 | 60 3567 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3684 C T 4281880 | 108 3684 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3792 A C 4281772 | 81 3792 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3873 T C 4281691 | 18 3873 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1

  10. Applications • MUMmer is great for speed and easy-to-parse output tables • Not so much for graphical outputs • Mauve is another genome aligner that I prefer for graphics • Especially good for 1-to-many alignments • Darling et al. 2004 Genome Res. 14:1394-1403 • Darling et al. 2010 PLoSONE 5:e11147 • http://gel.ahabs.wisc.edu/mauve/

  11. Mauve example • Collinear gene blocks linked between genomes • Darling et al. 2004 Genome Res. 14:1394-1403

  12. Assignment #6 • Align the P. aeruginosa C3719 genome to PAO1 using nucmer • Write a Perl script that calculates the proportion of the C3719 genome that aligns to PAO1 • Write a second Perl script that separately counts the number of SNPs and indels between the C3719 and PAO1 genomes.

  13. Also: To be done right now! • Register for a PacBio SMRT portal account on the Biotech server at this link: http://bbcsrv3.biotech.uconn.edu:8080/smrtportal/

More Related