1 / 18

Low-complexity and Repetitive Regions

Low-complexity and Repetitive Regions. OraLee Branch John Wootton NCBI branch@ncbi.nlm.nih.gov. 9. 6. *. 10. 20. 4. Sequence Composition. DNA Sequences What would be the expected number of occurrences of a particular sequence in a genome?

beulah
Download Presentation

Low-complexity and Repetitive Regions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-complexity and Repetitive Regions • OraLee Branch • John Wootton • NCBI • branch@ncbi.nlm.nih.gov

  2. 9 6 * 10 20 4 Sequence Composition • DNA Sequences • What would be the expected number of occurrences of a particular sequence in a genome? • Size: human genome 6*109 considering both strands • Base frequency: equal • Sequence length: 20 nucleotides • Bernouli Model: = 0.005 • But: • (GT)n with n>10 = 105

  3. Low-complexity Regions • Simple Sequence Regions (SSR) • MICRO- or MINISATELLITES • Regions that have significant biases in AA or nucleotide composition : repeats of simple motifs • (GT)n (AAC)n (P)n (NANP)n • Low-Complexity Regions/Segments • Complexity can be measured by Shannon’s Entropy • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences

  4. Low-Complexity Regions • Locally abundant residues may be • continuous or loosely clustered irregular or aperiodic • >25% of AA in currently sequenced genome is in LC regions • non-globular domains  SSR • Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function • Beta-pleated sheets • Alpha helices • Coiled-coils

  5. Low-Complexity Regions • Locally abundant residues may be • continuous or loosely clustered irregular or aperiodic • >25% of AA in currently sequenced genome is in LC regions • non-globular domains  SSR • Examples: myosins, pilins, segments in antigens, short subsequences of 10-50 residues with unknown function • Beta-pleated sheets • Alpha helices • Coiled-coils

  6. Detecting Low-Complexity • SEG and PSEG/NSEG algorithms • Wootton and Federhen • Methods in Enzymology 266:33 (1996) • Computers and Chemistry 17:149 (1993) • SEG • UNIX Executable available on ncbi servers • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Longer Window lengths define more sustained regions, but overlook short biased subsequences

  7. clobber> seg hu.piron.fa 12 2.20 2.50>gi|730388|sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) 1-49 MANLGCWMLVVFVATWSDLGLCKKRPKPGG WNTGGSRYPGQGSPGGNRYppqggggwgqphgggwgqphgggwgqphgg 50-86 gwgqggg 87-104 THNQWHKPSKPKTSMKHM agaaaagavvgglggymlgsams 105-127 128-179 RPLIHFGNDYEDRYYRENMYRYPNQVYYRP VDQYSNQNNFVHDCVNITIKQH tvttttkgenftet 180-193 194-228 DVKMMERVVEQMCITQYEKESQAYYQRGSS MVLFS sppvillisflifliv 229-244 245-245 Gclobber> seg hu.piron.fa 12 2.20 2.50 -l>gi|730388|sp|P40250|PRIO_CERAE(50-86) complexity=1.90 (12/2.20/2.50)ppqggggwgqphgggwgqphgggwgqphgggwgqggg>gi|730388|sp|P40250|PRIO_CERAE(105-127) complexity=2.47 (12/2.20/2.50)agaaaagavvgglggymlgsams>gi|730388|sp|P40250|PRIO_CERAE(180-193) complexity=2.26 (12/2.20/2.50)tvttttkgenftet>gi|730388|sp|P40250|PRIO_CERAE(229-244) complexity=2.50 (12/2.20/2.50)sppvillisflifliv

  8. SEG piron with different window lengthsquestion-based – exploratory tool – optimization step

  9. Detecting Low-Complexity • Intuitive explanation • Take a 20-residue long sequence • (20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) • ( 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ) • ( 3 3 3 3 3 2 2 2 2 1 1 0 0 0 0 0 0 0 0 0) • Complexity can be described by Shannon’s Entropy (K2) • Regarding an amino acid sequence • For each composition of a complexity state, there exists a large number of possible sequences (K1)

  10. How SEG works • seg FASTAfile Window TriggerComplexity Extension K2(1) K2(2) • Looks within window length: if complexity < K2(1) then extends until complexity < K2(2) • Uniform prior probabilities • Protein sequence data base is a heterogeneous statistical mixture such that the initially-unknown AA frequencies in Low-complexity subsets need have no similarity to frequencies in total data base • Unbiased view of low-complexity regions • Gives equiprobable compositions for any complexity state

  11. How SEG works, continued • How do you correct for the background AA/nuc composition bias? • After randomly shuffling all the residues, determine the trigger complexity that results in 4% of the data base being within Low-complexity regions • Then use this trigger complexity and subtract 4% from %AA in Low-complexity regions

  12. Detecting Low-complexity with repetitive motif: SSR • PSEG or NSEG • Repetition of residue types or k-grams • Period 3 (n V E n K N n V D n K D n V N n K S n K) (n m i n m i n m i n m i n m i n m i n m) (n m E n m N n m D n m D n m N n m S n m) • Sliding window along sequence in single residue steps

  13. Evolutionary Mechanisms • Evolution of sequences in general • Evolution rate of 10-5 – 10-9 • Base pair substitution (10-9 ) • Insertion/deletions • Recombination • In SSR, Low-complexity regions, mutations are in length – with steps typically +/- one repeat unit • Evolution rate 10-3 • Biased nucleotide substitution due to increased recombination in repetitive regions • Unequal crossing over (recombination) • Replication slippage • Alignment of repeats does not imply relationships/ancestory

  14. Low-Complexity and BLAST searches • Low-complexity regions results in BLAST searches being dominated by Low-complexity regions – biased AA/nuc composition • BLAST added “mask low-complexity” by default • Seg parameters: 12 2.2 2.5 • BLAST now also uses a compositional bias filter on the whole database • Masks if composition bias using seg 10 1.8 2.1 • YOU MAY WANT TO TURN THESE OPTIONS OFF and use your own organism-specific seg paramenters when doing protein homology searching • YOU WILL NEED TO TURN THESE OPTIONS OFF if you are interested in looking at sequence similarities of repetitive/low complexity regions.

  15. Example: Plasmodium falciparum • Using whole genome sequences is important to limit pcr sequencing bias for antigens: hydrophilic proteins • Considering GC-content / AA bias • P. falciparum is approximately 28 % GC • Visualization of individual proteins

  16. A helpful tool here and in general • SEALS: A system for Easy Analysis of Lots of Sequences, R. Walker and E. Koonin, NCBI • www.ncbi.nlm.nih.gov/ CBBresearch/Walker/SEALS/index.html • Demonstrate getting an appropriate data set • Taxnode2gi, gi2fasta • Daffy • Purge • Gref • Fanot • Use cleaned data set of P. falciparum proteins

  17. Protein Analysis • Setting the trigger complexity: • Dbcomp • Shuffledb • Seg • Run SEG on P. falciparum MSP1, PfEMP2, Cg2 • Options • –p (tree form output) • -l (only report Low-C segs) • -h (don’t report Low-C segs) • -x (substitute Low-C with x) • Run PSEG on P. falciparum MSP1, PfEMP2, Cg2 with different –z (periodicity)

  18. Usefulness of studying Low-Complexity Within a protein secondary structure, homology searchers, protein location genetic disorders Within taxa microsatellite markers polymorphism comparisons between proteins Between taxa Synteny , orthologs different selection pressures upon different organisms parasites: immunogenicity, rapid evolution of antigens, recombination

More Related