Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Balaji Rajagopalan

國立雲林科技大學National Yunlin University of Science and Technology • Exploiting data preparation to enhance • mining and knowledge discovery • Advisor：Dr.Hsu • Graduate： Keng-Wei Chang • Author： Balaji Rajagopalan • Mark W. Isken IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Data Preparation • Research Method • Results

using organizational data for mining and knowledge discovery not amenable for mining in its natural form Motivation • N.Y.U.S.T. • I.M.

data enhancement by the introduction of new attributes along with judicious aggregation of existing attributes results in higher quality knowledge discovery differential impact on the performance of different mining algorithms Objective • N.Y.U.S.T. • I.M.

Exponential growth information result a tremendous volume of data to knowledge workers. Knowledge management solution Knowledge repository Knowledge sharing Knowledge discovery Introduction • N.Y.U.S.T. • I.M.

Present a framework based on prior research in knowledge discovery Data quality Data characteristics Data preparation Data Preparation • N.Y.U.S.T. • I.M.

data set from a large tertiary care hospital in the United States was used few topics A. Problem Domain B. Data C. Clustering Algorithms for Knowledge Discovery D. Entropy-Based Metrics for Cluster Quality Assessment E. Rule Extraction Metrics Research Method • N.Y.U.S.T. • I.M.

allocation of inpatient beds more difficult is use quantitative resource allocation in a manageable set of patient types quantitative resource sequence of hospital units visited and corresponding length of stay patient types a group of patients consuming a similar level of hospital resources Problem Domain • N.Y.U.S.T. • I.M.

refer to this as the patient classification problem too few V.S. too many patient types The key is identify the set of patient types Problem Domain • N.Y.U.S.T. • I.M.

Inpatient obstetrical and gynecological (OB/GYN) patient flow There are numerous fields demographics physician information ICD9-CM diagnostic procedure codes diagnosis-related groups (DRGs) Data • N.Y.U.S.T. • I.M.

almost 500 defined in DRGs range[353-384] are related to OB/GYN grouping these DRGs into five DRG types Data • N.Y.U.S.T. • I.M.

K-means and Kohonen seof-organizing Similarity Euclidean distance function Clustering Algorithms for Knowledge Discovery • N.Y.U.S.T. • I.M.

Entropy Weighted Entropy cluster size calculate a weighted average entropy measure for a cluster solution Purity, let Entropy-Based Metrics for Cluster Quality Assessment • N.Y.U.S.T. • I.M. be the number of cases having a DRG type of i in cluster j

expect a high degree of resonance for most of the rules with our domain knowledge Rule Extraction Metrics • N.Y.U.S.T. • I.M.

detail the data enhancements relevant to this study A. Data Preparation : Basics B. Mining and Knowledge Discovery C. Differential Impact Based on Clustering Method D. Usefulness of Knowledge Discovered E. Limitations F. Implications for Research and Practice Results • N.Y.U.S.T. • I.M.

Data set included fields that represent the path and associated lengths of stay along that path Data Preparation : Basics • N.Y.U.S.T. • I.M.

Consider three data sets characterized in order to illustrate the impact of data preparation ED1 Eight numeric variables Data Preparation : Basics • N.Y.U.S.T. • I.M.

ED2 Both DRG and CCS were designed to serve as aggregate measures of hospital resource consumption in addition ED1, ED2 add five nominal variables Data Preparation : Basics • N.Y.U.S.T. • I.M.

ED3 in addition to ED2, ED3 contains two binary variables whether or not gave birth during the visit whether or not gave birth via C-section Data Preparation : Basics • N.Y.U.S.T. • I.M.

Mining and Knowledge Discovery • N.Y.U.S.T. • I.M.

N.Y.U.S.T. • I.M.

Differential Impact Based on Clustering Method • N.Y.U.S.T. • I.M.

Usefulness of Knowledge Discovered • N.Y.U.S.T. • I.M.

may not exactly applicable in every case examine only two data mining algorithms K-means and Kohonen self-organizing maps illustrative, not exhaustive domain knowledge played a critical role in the data preparation process Limitations • N.Y.U.S.T. • I.M.

provides empirical evidence demonstrating the impact of data preparation on mining and knowledge discovery engage in a comparative investigation of multiple altorithms Implications for Research and Practice • N.Y.U.S.T. • I.M.

… Personal opinion • N.Y.U.S.T. • I.M.

Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Balaji Rajagopalan