Bioinformatics 3
Download
1 / 39

BioInformatics (3) - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

BioInformatics (3). Computational Issues. Data Warehousing: Organising Biological Information into a Structured Entity (World’s Largest Distributed DB) Function Analysis (Numerical Analysis) :

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' BioInformatics (3)' - adam-rivera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Computational issues
Computational Issues

  • Data Warehousing:

    • Organising Biological Information into a Structured Entity (World’s Largest Distributed DB)

  • Function Analysis (Numerical Analysis) :

    • Gene Expression Analysis : Applying sophisticated data mining/Visualisation to understand gene activities within an environment (Clustering )

    • Integrated Genomic Study : Relating structural analysis with functional analysis

  • Structure Analysis (Symbolic Analysis) :

    • Sequence Alignment: Analysing a sequence using comparative methods against existing databases to develop hypothesis concerning relatives (genetics) and functions (Dynamic Programming and HMM)

    • Structure prediction : from a sequence of a protein to predict its 3D structure (Inductive LP)



Structure analysis alignments scores
Structure Analysis :Alignments & Scores

Local (motif)

ACCACACA

::::

ACACCATA

Score= 4(+1) = 4

Global (e.g. haplotype)

ACCACACA

::xx::x:

ACACCATA

Score= 5(+1) + 3(-1) = 2

Suffix (shotgun assembly)

ACCACACA

:::

ACACCATA

Score= 3(+1) =3


A comparison of the homology search and the motif search for functional interpretation of sequence information.

Homology Search

Motif Search

New sequence

New sequence

Knowledge

acquisition

Motif library

(Empirical rules)

Sequence database

(Primary data)

Retrieval

Similar

sequence

Inference

Expert

knowledge

Expert

knowledge

Sequence interpretation

Sequence interpretation


Search and learning problems in sequence analysis
Search and learning problems in sequence analysis functional interpretation of sequence information


Whole genome gene expression analysis
(Whole genome) functional interpretation of sequence information Gene Expression Analysis

  • Quantitative Analysis of Gene Activities (Transcription Profiles)

Gene

Expression


Biotinylated RNA functional interpretation of sequence information

from experiment

Each probe cell contains

millions of copies of a specific

oligonucleotide probe

GeneChip expression

analysis probe array

Streptavidin-

phycoerythrin

conjugate

Image of hybridized probe array


Sub cellular inhomogeneity
(Sub)cellular inhomogeneity functional interpretation of sequence information

Cell-cycle differences in expression.

XIST RNA localized on inactive

X-chromosome

( see figure)


Cluster analysis
Cluster Analysis functional interpretation of sequence information

Protein/protein complex

Genes

DNA regulatory elements


Functional analysis via gene expression
Functional Analysis via functional interpretation of sequence informationGene Expression

Pairwise Measures

Clustering

Motif Searching/...


Clustering algorithms
Clustering Algorithms functional interpretation of sequence information

A clustering algorithm attempts to find natural groups of components (or data) based on some similarity. Also, the clustering algorithm finds the centroid of a group of data sets.To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.


Clusters of two dimensional data
Clusters of Two-Dimensional Data functional interpretation of sequence information


Key terms in cluster analysis
Key Terms in Cluster Analysis functional interpretation of sequence information

  • Distance & Similarity measures

  • Hierarchical & non-hierarchical

  • Single/complete/average linkage

  • Dendrograms & ordering


Distance measures minkowski metric
Distance Measures: Minkowski Metric functional interpretation of sequence information

ref


Most common minkowski metrics
Most Common Minkowski Metrics functional interpretation of sequence information


An example
An Example functional interpretation of sequence information

x

3

y

4


Manhattan distance is called hamming distance when all features are binary
Manhattan distance is called functional interpretation of sequence informationHamming distance when all features are binary.

Gene Expression Levels Under 17 Conditions (1-High,0-Low)


Similarity measures correlation coefficient
Similarity Measures: Correlation Coefficient functional interpretation of sequence information


Similarity measures correlation coefficient1
Similarity Measures: Correlation Coefficient functional interpretation of sequence information

Expression Level

Expression Level

Gene A

Gene B

Gene B

Gene A

Time

Time

Expression Level

Gene B

Gene A

Time


Distance based clustering

Assign a distance measure between data functional interpretation of sequence information

Find a partition such that:

Distance between objects within partition (i.e. same cluster) is minimized

Distance between objects from different clusters is maximised

Issues :

Requires defining a distance (similarity) measure in situation where it is unclear how to assign it

What relative weighting to give to one attribute vs another?

Number of possible partition is super-exponential

Distance-based Clustering


Hierarchical non
hierarchical & non- functional interpretation of sequence information

Normalized Expression Data


Hierarchical clustering techniques
Hierarchical Clustering Techniques functional interpretation of sequence information


Hierarchical clustering
Hierarchical Clustering functional interpretation of sequence information

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:

1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.

2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

3.Compute distances (similarities) between the new cluster and each of the old clusters.

4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.


The distance between two clusters is defined as the distance between
The distance between two clusters is defined as the distance between

  • Single-Link Method / Nearest Neighbor

  • Complete-Link / Furthest Neighbor

  • Their Centroids.

  • Average of all cross-cluster pairs.


Computing distances
Computing Distances between

  • single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

  • complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.

  • average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.


Single link method
Single-Link Method between

Euclidean Distance

a

a,b

b

a,b,c

a,b,c,d

c

d

c

d

d

(1)

(3)

(2)

Distance Matrix


Complete link method
Complete-Link Method between

Euclidean Distance

a

a,b

a,b

b

a,b,c,d

c,d

c

d

c

d

(1)

(3)

(2)

Distance Matrix


Compare dendrograms
Compare Dendrograms between

Single-Link

Complete-Link

0

2

4

6


Ordered dendrograms
Ordered dendrograms between

  • 2 n-1 linear orderings of n elements

  • (n= # genes or conditions)

  • Maximizing adjacent similarity is impractical. So order by:

  • Average expression level,

  • Time of max induction, or

  • Chromosome positioning

Eisen98




Problems of hierarchical clustering
Problems of Hierarchical Clustering two-dimensional data?

  • It concerns more about complete tree structure than the optimal number of clusters.

  • There is no possibility of correcting for a poor initial partition.

  • Similarity and distance measures rarely have strict numerical significance.


Non hierarchical clustering
Non-hierarchical clustering two-dimensional data?

Normalized Expression Data

Tavazoie et al. 1999 (http://arep.med.harvard.edu)


Clustering by k means
Clustering by K-means two-dimensional data?

  • Given a set S of N p-dimension vectors without any prior knowledge about the set, the K-means clustering algorithm forms K disjoint nonempty subsets such that each subset minimizes some measure of dissimilarity locally. The algorithm will globally yield an optimal dissimilarity of all subsets.

  • K-means algorithm has time complexity O(RKN) where K is the number of desired clusters and R is the number of iterations to converges.

  • Euclidean distance metric between the coordinates of any two genes in the space reflects ignorance of a more biologically relevant measure of distance. K-means is an unsupervised, iterative algorithm that minimizes the within-cluster sum of squared distances from the cluster mean.

  • The first cluster center is chosen as the centroid of the entire data set and subsequent centers are chosen by finding the data point farthest from the centers already chosen. 200-400 iterations.


K means clustering algorithm

1) Select an initial partition of k clusters two-dimensional data?

2) Assign each object to the cluster with the closest center:

3) Compute the new centers of the clusters:

4) Repeat step 2 and 3 until no object changes cluster

K-Means Clustering Algorithm


Representation of expression data
Representation of expression data two-dimensional data?

T2

T3

T1

Gene 1

Time-point 1

Time-point 3

dij

Gene N

.

Time-point 2

Normalized Expression Data from microarrays

Gene 1

Gene 2


Identifying prevalent expression patterns gene clusters
Identifying prevalent expression patterns (gene clusters) two-dimensional data?

1.5

1

0.5

0

1

2

3

-0.5

-1

-1.5

1.5

1

1.2

0.5

0.7

0

0.2

1

2

3

-0.5

-0.3

1

2

3

-1

-0.8

-1.5

-2

-1.3

-1.8

Time-point 1

Normalized

Expression

Time-point 3

Time -point

Time-point 2

Normalized

Expression

Normalized

Expression

Time -point

Time -point


Evaluate cluster contents
Evaluate Cluster contents two-dimensional data?

Genes

MIPS functional category

Glycolysis

Nuclear Organization

Ribosome

Translation

Unknown


ad