1 / 12

On Data Labeling for Clustering Categorical Data

On Data Labeling for Clustering Categorical Data. Hung- Leng Chen, Kun-Ta Chuang, Member, and Ming- Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 200 8 / 11/4. Outline . Introduction Related work Model of MARDL ( MAximal Resemblance Data Labeling)

chiku
Download Presentation

On Data Labeling for Clustering Categorical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Data Labeling for Clustering Categorical Data Hung-Leng Chen, Kun-Ta Chuang, Member, and Ming-Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 2008/11/4

  2. Outline • Introduction • Related work • Model of MARDL (MAximal Resemblance Data Labeling) • Experimental results • Conclusions • Comments

  3. Motivation • Sampling • Scales down the size of the database and speed up clustering algorithms. • Problem comes from how to allocate the unclustered data into appropriate clusters. Clustering Large Database Sampled data Sampling Unclustered data Labeling ?

  4. Objective • Data Labeling • Gives each unclustered data point the most appropriate cluster label. • MARDL is independent of clustering algorithms, and any categorical clustering algorithm can be utilized in this framework.

  5. Categorical cluster representative • Node • Attribute name + attribute value. E.g. [A1=a], [A2=m] is an node. • N-nodeset • A set of n nodes, in which every node is a member of the distinct attribute Aa. E.g. {[A1=a], [A2=m]} is a 2-nodeset. • Independent nodesets • Two nodesets do not contain nodes from the same attributes are said to be independent with each other in a represented cluster. • E.g. {[A1=a], [A2=m]} and {[A3=c]} • p({[A1=a], [A2=m],[A3=c]}) =p({[A1=a], [A2=m]})*p({[A3=c]})

  6. Node and n-nodeset importance • Information theorem • Entropy

  7. N-nodeset importance representative(NNIR) • NNIR tree constructing and pruning • An Apriori-like algorithm. • Initialization • Computing candidate nodeset importance and pruning • Generating candidate nodeset • Pruning • Threshold • Importance of t nodeset is less than a predefined θ. • Relative maximum • Importance of (t+1) nodeset is larger than importance of t nodeset. • Hybrid

  8. Maximal resemblance data labeling • Goal of MARDL • Decide the most appropriate cluster label ci for the unlabeled data point. • A unclustered data point {[A1=a], [A2=m],[A3=c ]} to the combination{[A1=a], [A2=m]} and {[A3=c ]} in Cluster c1.

  9. Approximate algorithm for MARDL • Only one combination is considered and utilized • Tree nodes are queued and sorted by importance value. • The nodeset with maximal importance is selected. • Those nodesets which are not independent with the selected nodeset are removed from the queue. • A unclustered data point {[A1=a], [A2=m],[A3=c ]}and a tree nodeset queue.

  10. Experimental results

  11. Conclusions • MARDL • Allocates unlabeled data point into appropriate clusters when the sampling technique is utilized to cluster a very large categorical database. • NIR • A categorical cluster representative technique. • NNIR • A more powerful representative than NIR while the combinations of attribute values are considered.

  12. Comments • Advantage • A good method to assign unclustered data to appropriate trained clusters in categorical data sampling clustering methods. • The concept, derived from existed method (Apriori and information theorem) , is easy to understand and accept. • MARDL is independ of clustering methods and any categorical clustering algorithm can be utilized in this framework. • Drawback • It spends much time to construct the tree of each cluster and the tree is quite complex to represent cluster. • Because the importance of t+1 nodeset may be larger than the importance of t nodeset, it will take much time to process the hybrid pruning in computing all of candidate t+1 nodeset. • Application • Unclustered data classification while the sampling technique is utilized to cluster a very large categorical database.

More Related