bi clustering n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bi-Clustering PowerPoint Presentation
Download Presentation
Bi-Clustering

Loading in 2 Seconds...

play fullscreen
1 / 21

Bi-Clustering - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

Bi-Clustering. COMP 790-90 Seminar Spring 2008. Coherent Cluster. Want to accommodate noises but not outliers. d xa. d xb. x. x. d ya. z. d yb. y. y. a. a. b. b. Coherent Cluster. Coherent cluster Subspace clustering pair-wise disparity

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bi-Clustering' - valin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bi clustering

Bi-Clustering

COMP 790-90 Seminar

Spring 2008

coherent cluster
Coherent Cluster

Want to accommodate noises but not outliers

coherent cluster1

dxa

dxb

x

x

dya

z

dyb

y

y

a

a

b

b

Coherent Cluster
  • Coherent cluster
    • Subspace clustering
  • pair-wise disparity
    • For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b}

mutual bias

of attribute a

mutual bias

of attribute b

attribute

coherent cluster2
Coherent Cluster
  • A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to .
  • An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster.
    • A -coherent cluster is a maximum-coherent cluster if it is not a submatrix of any other -coherent cluster.
  • Objective: given a data matrix and a threshold , find all maximum -coherent clusters.
coherent cluster3
Coherent Cluster
  • Challenges:
    • Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.
      • The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.
    • The actual values of the objects in a coherent cluster may be far apart from each other.
      • Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.
coherent cluster4
Coherent Cluster

Compute the maximum coherent

attribute sets for each pair of objects

Two-way Pruning

Construct the lexicographical tree

Post-order traverse the tree to

find maximum coherent clusters

coherent cluster5

7

o1

5

3

o2

1

a1

a2

a3

a4

a5

3

2

3.5

2

2.5

 [2, 3.5]

Coherent Cluster
  • Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2ksubmatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than .

If  = 1.5,

then {a1,a2,a3,a4,a5} is a

coherent attribute set (CAS)

of (o1,o2).

coherent cluster6

a1

a2

a3

a4

a5

a6

a7

o1

o2

o3

o4

o5

o6

Coherent Cluster
  • Observation: given a subset of objects {o1, o2, …, ol} and a subset of attributes {a1, a2, …, ak}, the lksubmatrix is a -coherent cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of objects (oi,oj) where 1  i, j  l.
coherent cluster7

7

r1

5

7

3

r2

r1

5

1

3

r2

1

a1

a1

a2

a2

a3

a3

a4

a4

a5

a5

3

3

2

2

3.5

3.5

2

2

2.5

2.5

Coherent Cluster
  • Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .

 = 1

The maximum coherent attribute sets define the search space

for maximum coherent clusters.

two way pruning
Two Way Pruning

(o0,o2) →(a0,a1,a2)

(o1,o2) →(a0,a1,a2)

(a0,a1) →(o0,o1,o2)

(a0,a2) →(o1,o2,o3)

(a1,a2) →(o1,o2,o4)

(a1,a2) →(o0,o2,o4)

(o0,o2) →(a0,a1,a2)

(o1,o2) →(a0,a1,a2)

(a0,a1) →(o0,o1,o2)

(a0,a2) →(o1,o2,o3)

(a1,a2) →(o1,o2,o4)

(a1,a2) →(o0,o2,o4)

delta=1 nc =3 nr = 3

MCAS

MCOS

coherent cluster8

attributes

objects

Coherent Cluster
  • Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s).
  • Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree.
slide12

(o0,o1): {a0,a1}, {a2,a3}

(o0,o2): {a0,a1,a2,a3}

(o0,o4): {a1,a2}

(o1,o2): {a0,a1,a2}, {a2,a3}

(o1,o3): {a0,a2}

(o1,o4): {a1,a2}

(o2,o3): {a0,a2}

(o2,o4): {a1,a2}

{a0,a1} :(o0,o1)

(o1,o2)

(o0,o2)

{a0,a2} :(o1,o3),(o2,o3)

(o1,o2)

(o0,o2)

{a1,a2} :(o0,o4),(o1,o4),(o2,o4)

(o1,o2)

(o0,o2)

{a2,a3} :(o0,o1),(o1,o2)

(o0,o2)

{a0,a1,a2} :(o1,o2)

(o0,o2)

{a0,a1,a2,a3} :(o0,o2)

a0

a2

a1

assume  = 1

a1

a2

a2

a3

(o0,o1)

(o1,o3)

(o0,o4)

(o0,o1)

(o2,o3)

(o1,o4)

(o1,o2)

a2

(o2,o4)

(o1,o2)

a3

(o0,o2)

slide13

a0

a2

a1

a3

a3

a1

a2

a2

a3

(o1,o2)

(o0,o2)

(o0,o1)

(o1,o3)

(o0,o4)

(o0,o1)

(o0,o2)

(o1,o2)

(o2,o3)

(o1,o4)

(o1,o2)

a2

(o0,o2)

(o1,o2)

(o2,o4)

(o0,o2)

a3

a3

a3

(o1,o2)

(o0,o2)

(o0,o2)

a3

(o0,o2)

(o0,o2)

(o0,o2)

(o0,o2)

{o0,o2}  {a0,a1,a2,a3}

{o1,o2}  {a0,a1,a2}

{o0,o1,o2}  {a0,a1}

{o1,o2,o3}  {a0,a2}

{o0,o2,o4}  {a1,a2}

{o1,o2,o4}  {a1,a2}

{o0,o1,o2}  {a2,a3}

(o0,o2)

coherent cluster9
Coherent Cluster
  • High expressive power
    • The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods.
  • Efficient and highly scalable
  • Wide applications
    • Gene expression analysis
    • Collaborative filtering

subspace

cluster

coherent

cluster

remark
Remark
  • Comparing to Bicluster
    • Can well separate noises and outliers
    • No random data insertion and replacement
    • Produce optimal solution
definition of op cluster
Definition of OP-Cluster
  • Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster),if one of the following relationships exists for any pair of conditions.

Expression Levels

A1 A2 A3 A4

when

problem statement
Problem Statement
  • Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr.
conversion to sequence mining problem
Conversion to Sequence Mining Problem

Sequence:

Expression Levels

A1 A2 A3 A4

ming op clusters a na ve approach
Ming OP-Clusters: A naïve approach

root

  • A naïve approach
    • Enumerate all possible subsequences in a prefix tree.
    • For each subsequences, collect all genes that contain the subsequences.
  • Challenge:
    • The total number of distinct subsequences are

a

a

b

c

d

b

b

c

d

a

c

d

c

d

d

b

d

b

c

c

d

a

d

d

c

d

b

c

b

d

c

d

a

A Complete Prefix Tree with 4 items {a,b,c,d}

mining op clusters prefix tree

a:3

d:2

d:3

c:2

c:3

Mining OP-Clusters: Prefix Tree
  • Goal:
  • Build a compact prefix tree that includes all sub-sequences only occurring in the original database.
  • Strategies:
  • Depth-First Traversal
  • Suffix concatenation: Visit subsequences that only exist in the input sequences.
  • Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences.

Root

a:1,2

a:1,2,3

a:1,2

a:1,2,3

b:3

d:1

d:1,2,3

d:1,2,3

d:1,3

d:1,3

b:2

a:3

b:1

c:1,3

c:1,2,3

d:2

d:3

c:1

c:2

c:3

references
References
  • J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002.
  • H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002.
  • Y. Sungroh,  C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004.
  • J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.