1 / 27

Sequence Clustering

Sequence Clustering. Reducing Search S pace in Protein and DNA /RNA S equence A nalysis Denis Kaznadzey, GBP. MGM Workshop September 26, 2011. Sequence clustering. To deal with a huge variety of individual ‘objects’:. Classify into groups of essentially similar objects

tirzah
Download Presentation

Sequence Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Clustering Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, GBP MGM Workshop September 26, 2011

  2. Sequence clustering To deal with a huge variety of individual ‘objects’: • Classify into groups of essentially similar objects • When new data arrives, assign objects to existing groups • Classify ‘leftovers’ • Occasionally review entire classification • Problem: What is essentially similar’? • Finding properties that are important (Ontological relevancy) • Does classification reflect reality in any way?

  3. Sequence clustering Taxonomical Classification vs. Continuity of Great Chain of Being Even if reductionist, classification is a tool to study the world – the biology in particular. When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”. Carl Linnaeus Georges Buffon

  4. Sequence clustering • In Modern Biology: Most abundant type of data is sequence: • Genomic DNA • RNA (through RNASeq) • Derived Proteins • Primary feature is Primary Structure, but • - Classification criteria depends on application.

  5. Sequence Clustering Select Applications in Genomic Sciences: Genome Assembly: Binning, Scaffolding Transcriptomics: EST (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs

  6. Sequence Clustering • In Metagenomics: • Primary tasks: • Assess diversity • Find genes • Predict functions • Predict pathways • Estimate capabilities Based on sequence comparison.

  7. Sequence Clustering • Any Clustering is based on the Distance in some Metric. • Initial clustering is based on pair-wise distances. • Subsequent classification is based on distances from object to clusters • Representative • Set of representatives (all at extreme) • Other measure, may be unrelated to initial.

  8. Sequence Clustering • When distance measure is chosen, and distances are obtained / computed: • There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology) • K-mean, average linkage, complete linkage, single linkage, iterative, SOM, etc. • However options for large volume clustering are limited due to performance of algorithms. • Single-linkage can be computed very efficiently • (Method for pledging new sequences to clusters may be computationally more intense)

  9. Sequence clustering • Most efficient clustering: transitive-closure based. • Requires ‘boolean’ distances (two sequences can be linked or not linked • Requires number of nodes to be known • Space ~ NodesNo • Run-time (worst) ~ EdgesNo* AveClustSize • Run-time (average) ~ EdgesNo * log2 (AveClustSize))

  10. Sequence clustering • Practical Transitive Closure algorithm: • Allocate array of sequence numbers A [0..N] • Phase I: connect linked vertices through vertex of smallest index • For each edge (m, n): • While A [n] != n: • n = A [n] • While A [m] != m: • m = A [m] • A [max (m, n)] = min (m, n) • Phase II: propagate smallest indices as cluster identifiers • For each n from 0 to N: • If A [n] ! = A [ A [n]]: • A [n] = A [A [n]] • Phase III: collect clusters. (Implementation dependent) • Count number of distinct cluster “id”s => M (1 pass) • Allocate array of sizes; Count size of each cluster (1 pass) • Allocate array of clusters; fill it in (1 pass) +(1,3) +(5,6) +(6, 1) (0); (1,3,5,6); (2); (4)

  11. Sequence clustering • Computing ‘boolean’ distances: • Threshold – based • Additional rules (match arrangement) • Example: read/EST clustering • % identity + length + arrangement: OK

  12. Computing similarity measure: • Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. • Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee • K-mere statistics: CD-HIT, USEARCH, MUSCLE • Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ • Suffix Arrays: Bowtie, BWT • Position-Specific scoring matrix: PSI-Blast, Impala • Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM

  13. Sequence clustering • Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) • For large data sets only k-mere and suffix array measures are practical. • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible. • For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))

  14. Sequence clustering • Boolean distance clustering killer: • CLUSTER AGGREGATION. • In large clusters, even a small number of random links lead to huge conglomerates.

  15. Common causes: • Contamination with standard constructs • Repeats • Chimeras • Spurious similarities (low complexity zones etc.

  16. Sequence clustering • Fighting aggregation • Vector / adapter trimming: • Lucy, Figaro, etc. Integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) • Low complexity detection / masking: • SEG, DUST, FastQC, WindowMaskeretc. – often integrated in search tools.

  17. Sequence clustering • Repeat detection / masking: • Regular (tandem) repeats: • Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB) • Post-search detection based on similarity properties (multiple parallel threads) • Irregular (long) repeats: • Database based: RepeatMasker • De-novo: RepeatScout, orrb, PILER, etc. Require genome as input, construct database.

  18. Sequence clustering • Detecting chimeric sequences: • Abundance-based: Perseus, UCHIME • Chimeras undergo less amplification cycles. So chimera segments in native arrangement are more frequent. • Specific to 16S: ChimeraSlayer, Bellerophon • Chimera ‘arms’ are closer to originating phyla then entire chimera

  19. Sequence clustering • Detecting chimeric sequences • Similarity coverage based: Mira assembler

  20. Sequence clustering • Detecting chimeric sequences • Similarity graph topology based: dchim Alignment view Connectivity view

  21. Protein Clusters: various criteria • Primary structure similarity • Close evolutionary relationship • Similarity in physical properties • 3-D structure similarity • Similar fold arrangement • Domain structure similarity • Common or similar functions • etc.

  22. Sequence clustering • Functional and structural classifications in IMG

  23. Sequence clustering • Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species • Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCHSLOWER. • For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands) • For metagenomescan not be used with foreseeable computing resources.

  24. Sequence clustering • Functional annotation of metagenome genes through protein clusters (under development): • Build set of functionally homogenous clusters of similar proteins – for annotated genomes • Build HMMs for each cluster, compose model database • Pledge metagenomeproteins to clusters by matching to models • Cluster unpledged proteins, build models, update model database. • Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones. • Perform hierarchical searches through profiles tree.

  25. Sequence clustering • Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort. • Improves only searches within parameters space used for clustering • (structure-based clusters not useful for searching for certain codon usage, etc.)

  26. However, for proteins, which form dense relationship networks, clustering is a great tool.

  27. Thank you!

More Related