1 / 32

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING. by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington. Outline. What is hierarchical conceptual clustering? Overview of Subdue Conceptual clustering in Subdue Evaluation of hierarchical clusterings

truda
Download Presentation

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

  2. Outline • What is hierarchical conceptual clustering? • Overview of Subdue • Conceptual clustering in Subdue • Evaluation of hierarchical clusterings • Experiments and results • Conclusions

  3. What is clustering?

  4. What is hierarchical conceptual clustering? • Unsupervised concept learning • Generating hierarchies to explain data • Applications • Hypothesis generation and testing • Prediction based on groups • Finding taxonomies

  5. Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Example hierarchical conceptualclustering

  6. The Problem • Hierarchical conceptual clustering in discrete-valued structural databases • Existing systems: • Continuous-valued • Discrete but unstructured • We can do better! (Field under explored)

  7. Related Work • Cobweb • Labyrinth • AutoClass • Snob • In Euclidian space: Chameleon, Cure • Unsupervised learning algorithms

  8. The Solution • Take Subdue and extend it!

  9. E e A A g a a d B D D B b b c c f C C F Overview of Subdue • Data mining in graph representations of structural databases

  10. A a D B b c C Overview of Subdue • Iteratively searching for best substructure by MDL heuristic

  11. E e g d S S f F Overview of Subdue • Compress using best substructure

  12. Overview of Subdue • Fuzzy match • Inexact matching of subgraphs • Applications: • Defining fuzzy concepts • Evaluation of clusterings

  13. Conceptual Clustering with Subdue • Use Subdue to identify clusters • The best subgraph in an iteration defines a cluster • When to stop within an iteration? • Use –limit option • Use –size option • Use first minimum heuristic (new)

  14. The First Minimum Heuristic • Use subgraph at first local minimum • Detect it using –prune2 option

  15. The First Minimum Heuristic • Not a greedy heuristic! • Although first local minimum is usually the global minimum • First local minimum is caused by a smaller, more frequently occurring subgraph • Subsequent minima are caused by bigger, less frequently occurring subgraphs => First subgraph is more general

  16. The First Minimum Heuristic A multi-minimum search space:

  17. Lattice vs. Tree • Previous work defined classification trees • Inadequate in structured domains • Better hierarchical description: classification lattice • A cluster can have more than one parent • A parent can be at any level (not only one level above)

  18. Hierarchical Clustering in Subdue • Subdue can compress by a subgraph after each iteration • Subsequent clusters may be defined in terms of previously defined clusters • This results in a hierarchy

  19. Hierarchical Conceptual Clustering of an Artificial Domain

  20. Root Hierarchical Conceptual Clustering of an Artificial Domain

  21. Evaluation of Clusterings • Traditional evaluation: • Not applicable to hierarchical domains • No known evaluation for hierarchical clusterings • Most hierarchical evaluations are anecdotal

  22. New Evaluation Heuristic for Hierarchical Clusterings Properties of a good clustering: • Small number of clusters • Large coverage  good generality • Big cluster descriptions • More features  more inferential power • Minimal or no overlap between clusters • More distinct clusters  better defined concepts

  23. New Evaluation Heuristic for Hierarchical Clusterings Big clusters: bigger distance between disjoint clusters Overlap: less overlap bigger distance Few clusters: averaging comparisons

  24. Experiments and Results • Validation in an artificial domain • Validation in unstructured domains • Comparison to existing systems • Real world applications

  25. Name Body Cover Heart Chamber Body Temp. Fertilization mammal hair four regulated internal bird feathers four regulated internal reptile cornified-skin imperfect-four unregulated internal mammal Name four hair BodyCover amphibian moist-skin three unregulated external HeartChamber animal Fertilization BodyTemp regulated internal fish scales two unregulated external The Animal Domain

  26. Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Hierarchical Clustering of the Animal Domain

  27. animals amphibian/fish mammal/bird reptile fish amphibian mammal bird Hierarchical Clustering of the Animal Domain by Cobweb

  28. Comparison of Subdue and Cobweb • Quality of Subdue’s lattice (tree): 2.60 • Quality of Cobweb’s tree: 1.74 • Therefore Subdue is better • Reasons for a higher score: • Better generalization resulting in less clusters • Eliminating overlap between (reptile) and (amphibian/fish)

  29. Chemical Application: Clustering of a DNA sequence

  30. DNA O | O == P — OH C — N C — C C — C \ O C \ N — C \ C O | O == P — OH | O | CH2 O \ C / \ C — C N — C / \ O C Chemical Application: Clustering of a DNA sequence Coverage • 61% • 68% • 71%

  31. Conclusions • Goal of hierarchical conceptual clustering of structured databases was achieved • Synthesized classification lattice • Developed new evaluation heuristic for hierarchical clusterings • Good performance in comparison to other systems, even in unstructured domains

  32. Future Work • More experiments on real-world domains • Comparison to other systems • Incorporation of evaluation tool into Subdue

More Related