1 / 27

Fingerprint Clustering with Bounded Number of Missing Values

Fingerprint Clustering with Bounded Number of Missing Values. Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy. Talk Outline. Biological problem and combinatorial problem Three versions of the problem:

Download Presentation

Fingerprint Clustering with Bounded Number of Missing Values

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy Fingerprint Clustering - CPM 2006

  2. Talk Outline • Biological problem and combinatorial problem • Three versions of the problem: • Clustering with Missing Value (CMV) • Inside Edge Clustering (IEC) • Outside Edge Clustering (OEC) • Approximation algorithm for IEC and OEC • Polynomial time algorithm for restricted CMV • APX-hardness of CMV • APX-hardness of IEC and OEC • Future work Fingerprint Clustering - CPM 2006

  3. Biological Motivations Classification of microorganisms: • A library of rDNA (ribosomal RNA clones) is created • A short DNA sequence (a probe) is applied to hybridize with all clones of the library • After hybridization unbounded probes are removed; the library is analyzed to see how much any probe is hybridized to each spot • Experiment repeated for a set of probes Fingerprint Clustering - CPM 2006

  4. Biological Motivations • Fingerprintof a clone: vector consisting of the hybridization intensity values between the clone and each probe To classify microorganisms: • Fingerprints are transformed in binary vectors • Clustering of fingerprints to infer different properties with respect to the probes Fingerprint Clustering - CPM 2006

  5. Biological Motivations • Goal: translate hybridization intensity values into binary values 0, 1. • Due to the intensity values it is not always possible to get binary vectors • For each clone we are given a fingerprintover alphabet {0,1,N} • 0→ no hybridization • 1 → hybridization • N → unable to determine if a hybridization has happened Fingerprint Clustering - CPM 2006

  6. Clustering of fingerprints – Combinatorial problem • Two fingerprints are compatible iff they agree in each position where they are different from N • Example: Two compatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 1 0 Two uncompatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 0 0 Fingerprint Clustering - CPM 2006

  7. Clustering of fingerprints – Combinatorial problem Clustering of fingerprints:general formulation • Input: a set F of fingerprints • Output: clustering (partition) C of fingerprints such that each cluster ofCcontains only compatible fingerprints Fingerprint Clustering - CPM 2006

  8. Clustering of fingerprints – Combinatorial problem An example • F: f1= 0 1 0 N f2= 0 N 0 1 f3= N 1 0 0 f4= 1 N N 1 Compatibility: f1 and f2; f1 and f3 • Some possible solutions: • (f1= 010N, f2= 0N01), (f3= N100), (f4= 1NN1) • (f1= 010N, f3= N100), (f2= 0N01), (f4= 1NN1) Fingerprint Clustering - CPM 2006

  9. Clustering of fingerprints – Three versions of the problem • Three combinatorial versions of the problem with different objective functions • CMV (Clustering with Missing Values): minimize the number of clusters • IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints • OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters Fingerprint Clustering - CPM 2006

  10. CMV- An example CMV:minimize number of clusters F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1),(f3= 0N00),(f4= 00N1)→size 3 • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→size 2 Fingerprint Clustering - CPM 2006

  11. IEC- An example IEC:maximize the number of co-clustered pairs F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1)→ size 1: pair (f1 ,f2) co-clustered • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→ size 2: pairs (f1 ,f3) and (f2 ,f4) co-clustered Fingerprint Clustering - CPM 2006

  12. OEC- An example OEC:minimize the number of compatible not co-clustered pairs F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1)→ size 2; pair (f1 ,f3) and (f2 ,f4) not co-clustered • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→ size 1; pair (f1 ,f2) not co-clustered Fingerprint Clustering - CPM 2006

  13. Parameterized versions We consider parameterized versions of the problem: number of N’s is our parameter p CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N. Fingerprint Clustering - CPM 2006

  14. Parameterized versions Resolution of a fingerprint f: a vector over {0,1} that is compatible with f Example: f = 01NN10 Possible resolutions: • 01 00 10 • 01 01 10 • 01 10 10 • 01 11 10 Fingerprint Clustering - CPM 2006

  15. Parameterized versions For each fingerprint with p N’s: 2p possible resolutions Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function Fingerprint Clustering - CPM 2006

  16. Previous results CMV(p): • NP-hard for p ≥ 2[Figueroa et al., CATS 2005] • Poly-time for p = 1[Figueroa et al., J of Comp. Biology 2004] • Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa et al., CATS 2005] IEC(p): • Approximation algorithm with factor 22p−1 [Figueroa et al., CATS 2005] for any p =O(log n) OEC(p) • Approximation algorithm with factor 2(1-1/2p) for restricted instances [Figueroa et al., CATS 2005] Fingerprint Clustering - CPM 2006

  17. Approximation algorithm for OEC(p) and IEC(p) Greedy Algorithm: WHILE (there exists a not assigned fingerprint) • select a resolved vector that resolves the maximum number of fingerprints • Delete the assigned fingerprints ENDWHILE 2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC Fingerprint Clustering - CPM 2006

  18. A tight example for IEC f1 = N001; f2= 0N01; f3= 01N1; f4= 011N; f1 compatible with f2, f2 compatible with f3, f3 compatible with f4 Resolved vectors associated with compatibility r12 = 0001; r23 = 0100; r34 = 0111 Each of these resolved vectors resolves two fingerprints Fingerprint Clustering - CPM 2006

  19. A tight example for IEC The algorithm chooses one resolved vector, for example r23; f2 and f3 are assigned to r23 and deleted; r12 is chosen, f1 is assigned to it and deleted; r34 is chosen and f4 is assigned to it and deleted; Number of compatible co-clustered pairs: 1 The optimal solution consists of: r12; f1 and f2 are assigned to r12; r34; f3 and f4 are assigned to r34; Number of compatible co-clustered pairs in the optimal solution: 2 Fingerprint Clustering - CPM 2006

  20. A Polynomial Time Algorithm for Restricted CMV Restricted CMV for each position j there is at most one fingerprint having a value N in j-th position An instance of restricted CMV f1 = NN 01 01 01; f2= 01 NN 01 01; f3= 01 11 NN 01; f4= 01 11 11 NN Fingerprint Clustering - CPM 2006

  21. A Polynomial Time Algorithm for Restricted CMV Two interesting properties of restricted CMV: • the interesting resolved vectors are at most n2(interesting resolved vectors: resolve more than one fingerprint); • there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector; The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint Fingerprint Clustering - CPM 2006

  22. APX-hardness of CMV(2) L-reduction from MIN Vertex Cover on cubic graphs (APX-hard[Alimonti et., TCS 2000]) G=(V, E) cubic graph → graph gadget GA=(VA, EA) • For each vi in V define the following gadget GVi GVi Two possible vertex cover of the gadget: type 1: suboptimal type 2: optimal Fingerprint Clustering - CPM 2006

  23. APX-hardness of CMV(2) G=(V, E) cubic graph to graph gadget GA=(VA, EA) • For each edge (vi, vj ) in E define the edge gadget EGij EGij GVj GVi • Four vertices covered in EGij→ GVi and GVj both optimal • Two vertices covered in EGij→ GVi or GVj suboptimal • Case 2 is always better than case 1 Fingerprint Clustering - CPM 2006

  24. APX-hardness of CMV(2) Instance of CMV(2) is built as follows: • a resolved vector is built for each vertex of the gadgets • a fingerprint is built for each edge of the gadgets • two fingerprints share a common resolution iff they are incident on a common vertex Fingerprint Clustering - CPM 2006

  25. APX-hardness of IEC(2) and OEC(2) • L-reduction from MAX Independent Set on cubic graphs (APX-hard[Alimonti et., TCS 2000]) • Similar to the reduction for CMV(2) • G=(V,E) a cubic graph; • for each vertex vi in V a set Fiof 9 fingerprints • for each edge (vi , vj ) a fingerprint fij Fingerprint Clustering - CPM 2006

  26. Open Problems • Approximation of CMV(p): • constant factor not dependant on p? • improve min(1 + log n, 2 + p log l) approximation factor • Approximation of IEC(p) and OEC(p): • improve approximation factors ½ and 2 • Restricted versions of IEC and OEC are in P? Fingerprint Clustering - CPM 2006

  27. Conclusions • Biological problem and combinatorial problem • Three versions • Clustering with Missing Value (CMV) • Inside Edge Clustering (IEC) • Outside Edge Clustering (OEC) • Approximation algorithms for IEC(p) and OEC(p) • Polynomial time algorithm for restricted CMV • APX-hardness of CMV(2) • APX-hardness of IEC(2) and OEC(2) • Future work Fingerprint Clustering - CPM 2006

More Related