sequence clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sequence Clustering PowerPoint Presentation
Download Presentation
Sequence Clustering

Loading in 2 Seconds...

play fullscreen
1 / 43

Sequence Clustering - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Sequence Clustering. Reducing Search S pace in Protein and DNA /RNA S equence A nalysis Denis Kaznadzey, Prokaryotic Super Program. MGM Workshop January 30, 2011. Sequence Clustering Outline. Classification of Sequences General Problem of Clustering Distance Measures

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sequence Clustering' - july


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sequence clustering
Sequence Clustering

Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Denis Kaznadzey, Prokaryotic Super Program

MGM Workshop

January 30, 2011

sequence clustering outline
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

classification as research tool
Classification as Research Tool

To deal with a huge variety of individual objects:

  • Classify into groups of essentially similarobjects
  • When new data arrives, assign objects to existing groups
  • Classify ‘leftovers’
  • Occasionally review the entire classification

Problem: What is ‘essentially similar’?

  • Finding properties that are important (ontological relevancy)
  • Does classification reflect reality in any way?
classification
Classification

Ways to classify objects:

  • Spectral methods
  • Parametric decomposition
  • Clustering
sequence data abundance
Sequence Data Abundance

In the modern biology: The most abundant type of data is sequence:

  • DNA
    • Genomic
    • Meta-Genomic
    • Environmental Samples (16S rDNA)
  • RNA (cDNA libraries; RNA-Seq)
  • Derived Proteins

How to compare sequences?

- Criteria depend on application, e.g. GC content vs. order of bases.

sequence clustering1
Sequence Clustering

Select Applications in Genomic Sciences:

Genome Assembly: Binning, Scaffolding

Transcriptomics: RNAseq (read) clustering

Protein Function and Evolution studies:

Protein families

Phylogenetic profiling: OTUs

clustering is crucial for metagenomics
Clustering is Crucial for MetaGenomics

METAGENOMICS

  • Thousands of samples
  • Hundreds of millions reads per sample
  • Trillions of base pairs
  • Billions of genes

impossible to observe/analyze individually

Clustering becomes a strict requirement:

- Find what classes of sequences are seen

- Analyze classes rather then individual sequences

metagenomics analysis t asks
MetaGenomics Analysis Tasks
  • Primary tasks:
  • Assess diversity
  • Find genes
  • Predict functions
  • Predict pathways
  • Estimate capabilities

Based on sequence comparison.

sequence clustering o utline
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

clustering in general
Clustering in General
  • Any Clustering is based on the Distance in some Metric
  • Initial clustering is based on pair-wise distances
  • Subsequent classification is based on distances from objects to clusters: Pledging
sequence clustering o utline1
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

similarity metrics
Similarity Metrics
  • What is “similar”:
  • Similarity measure should better reflect “reality”
  • This “reality” depends on the application:
    • Assembly: find identical sub-strings
    • Orthology detection: Identify homologous proteins across the species
    • Functional prediction: Identify proteins with similar evolutionary conserved motifs

Measure is:

Identity Percentage

Substitution matrix based

Match to HMM or PSSM

similarity measure
Similarity Measure

Computing similarity measure:

  • Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc.
  • Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee
  • K-mere statistics: CD-HIT, USEARCH, MUSCLE
  • Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ
  • Suffix Arrays: Bowtie, BWT
  • Position-Specific scoring matrix: PSI-Blast, Impala
  • Hidden Markov Models: HMMer, HHSearch/HHPred, SAM
sequence clustering o utline2
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

assembling clusters
Assembling Clusters

There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology):

  • Linkage-based
    • Average linkage
    • Complete linkage
    • Single linkage
  • Hierarchy-based
  • Fitting function-based(K-mean)
  • Non-linear classifiers (SOM, etc.)
  • Greedy methods (iterative, suboptimal)
linkage b ased clustering
Linkage-Based Clustering

Average linkage

Complete linkage

Single linkage

hierarchical clustering
Hierarchical Clustering
  • Build a tree representation of relationships
  • Cut the branches using some quantitative criteria
building the tree
Building the Tree

Criteria: More similar sequences appear at closer branches

This goal is not achievable for practical distance measures

2

C

?

B

3

1

2

4

A

D

A

B

C

D

A

B

D

C

4

  • Solutions:
  • Approximation methods: neighbor join, UPGMA
  • Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)
suboptimal tree b uilding
Suboptimal Tree Building

Neighbor joining (corresponds to single-linkage clustering):

  • Order edges by distance
  • Join in order from short to long, merging branches as needed

UnweightedPair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering)

  • For every pair of clusters (A, B), starting with all singletons:
  • Compute average of distances between every object in A and every object in B
  • Merge the clusters of the closest average distance
global fitting function b ased
Global Fitting-Function Based

K-mean clustering

  • Pre-define the number of clusters
  • Find a distribution so that the sum of distances to the means is minimal
  • Computationally hard
  • Heuristics used, application specific heuristics may be efficient
non linear m ethods
Non-Linear Methods
  • Self-Organizing Maps:“self-learning” method
  • A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space
pledging
Pledging

Based on distance to cluster

  • Representative
  • Set of representatives (all at extreme)
  • Other measure, may be unrelated to the initial one (profile, model)
sequence clustering o utline3
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

performance considerations
Performance Considerations

Distance computing is harder than clustering(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

  • For large data sets only k-mere and suffix array measures are practical
  • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible.
  • For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations)
  • Binning: pre-clustering by rough and fast methods

33 objects

528 pairs

4 groups

127 pairs

single linkage is fast
Single Linkage is Fast
  • Time- and space- efficient clustering method: transitive closure-based
  • Requires ‘boolean’ distances (two sequences can be linked or not linked
  • Requires the number of nodes to be known
  • Space ~ NodesNo
  • Run-time (worst) ~ EdgesNo* AveClustSize
  • Run-time (average) ~ EdgesNo * log2 (AveClustSize)
single linkage is prone to aggregation
Single Linkage is Prone to Aggregation

Single-linkage clustering killer:

CLUSTER AGGREGATION

In large clusters, even a small number of random links lead to huge conglomerates.

sequence clustering o utline4
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

case study rna seq p ipeline
Case Study: RNA-SeqPipeline
  • Goals:
  • Compute transcript structures
  • Compute expression profiles (“virtual”)

Reads/EST clusters

Reads/ ESTdb

Reads / clones attributed to particular source/condition

Counting reads originating from different sources

Source / condition specific expression profiles

rnaseq analysis s olutions
RNAseq Analysis Solutions

Source: bioinfo.org, Macquarie University, Sydney

rnaseq clustering
RNAseq Clustering

Approach Outline:

Outcome:

  • 1. Detect identities (common segments):
    • Compute similarities
    • Select the “good” ones
  • 2. Merge sequences into groups with shared segments: SINGLE LINKAGE

One biggest cluster contains more then 60% of all sequences

(selection by better similarity does not help)

What causes aggregation and how to fight it?

aggregation in rna seq c lustering
Aggregation in RNA-SeqClustering
  • “Bad” identities:
  • Pieces of vector constructs / adaptors
  • Repeats
  • Redundant sequences
  • Spurious matches (short infrequent repeats)
  • Chimeras (if pre-amplification is used)
similarities selection
Similarities Selection
  • Computing ‘boolean’ distances:
  • Threshold – based
  • Additional rules (match arrangement)
  • % identity + length + arrangement:

OK

trimming masking
Trimming / Masking
  • Fighting aggregation
  • Vector / adapter trimming:
    • Lucy, Figaro, etc. – integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.)
  • Low complexity detection / masking:
    • SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools
repeat elimination
Repeat Elimination

Regular (tandem) repeats:

  • Pre-search masking: Based on structure (IMEx, SRF) or on database (TRDB)
  • Post-search detection based on similarity properties (multiple parallel threads)
repeat elimination1
Repeat Elimination

Irregular (long) repeats:

  • Database based: RepeatMasker
  • De-novo:
    • RepeatScout,
    • orrb,
    • PILER, etc.

(Require genome as input, construct database)

detecting chimeras
Detecting Chimeras
  • Detecting chimeric sequences:
  • Abundance-based: Perseus, UCHIME
    • Chimeras undergo less amplification cycles. So chimera segments in native arrangements are more frequent
  • Specific to 16S: ChimeraSlayer, Bellerophon
    • Chimera ‘arms’ are closer to originating clades then the entire chimera
detecting chimeras1
Detecting Chimeras
  • Similarity coverage-based: Mira assembler
detecting chimeras2
Detecting Chimeras
  • Similarity graph topology-based: dchim

Alignment view

Connectivity view

sequence clustering o utline5
Sequence Clustering Outline

Classification of Sequences

General Problem of Clustering

Distance Measures

Ab Initio Clustering and Pledging

Performance Considerations

Case Study: Transcriptomics

Introduction to Protein Clustering

protein clustering
Protein Clustering
  • Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species
  • Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW
  • Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight
  • No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale
  • The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)
protein c lustering at jgi
Protein Clustering at JGI
  • Functional annotation of metagenome genes through protein clusters (IMG):
  • Build a set of functionally homogenous clusters of similar proteins – for annotated genomes
  • Build HMM for each cluster, compose model database
  • Pledge metagenome proteins to clusters by matching to models
  • Cluster unpledged proteins, build models, update model database
protein clustering1
Protein Clustering

Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort

However, for proteins, which form dense relationship networks, clustering is a great tool

KonstantinosMavrommatis will elaborate on protein clustering techniques