بسمه تعالی داده کاوی و کشف دانش محمد تقی پور

بسمه تعالی داده کاوی و کشف دانش محمد تقی پور دارای مدارک تحصیلی: کاردانی،کارشناسی،کارشناسی ارشد آمار و دکترای مهندسی صنایع مدیرگروه مهندسی صنایع غیرانتفاعی آبا استاد نمونه دانشگاه های آزاد و پیام نور Mohamad_taghipour@yahoo.com www.drmohamadtaghipour.ir www.mtaghipour.tk 09123944126

Evaluation of an integrated Knowledge Discovery and Data Mining process model

Contribution of the paper • This paper presents the results of the rigorous evaluation of the Integrated Knowledge Discovery and Data Mining (IKDDM) process model and compares it to the CRISP-DM process model. • Results of statistical tests confirm that the IKDDM leads to more effective and efficient implementation of the knowledge discovery process.

Knowledge Discovery and Data Mining (KDDM)

Data Mining projects are implemented by following the knowledge discovery process. • knowledge discovery process is highly complex and iterative in nature and comprises of several phases. • Each phase comprises of several task.

Phases of KDDM process

Limitations of previous KDDM models Checklist oriented description and lack of tool support Fragmented design Absence of an integrated view 4. Conspicuous lack of support for tasks of the business understanding phase

IKDDM: overcoming the limitations of existing KDDM models • All the identified limitations in previously proposed KDDM process models were used as design requirements in creating this new KDDM process model.

Development of an integrated view • The IKDDM model was designed by explicating the numerous dependencies that exist between the various tasks of the KDDM process. • Some of the dependencies can be regarded as intra-phase dependencies, as they exist between the tasks of the same phase. • For example, there is a dependency between the Data Mining objective and business objective of the business understanding phase as the former utilizes the latter as its input. • Other dependencies can be classified as inter-phase dependencies as they exist between tasks of different phases.

Measurement instrument used for assessing the quality of IKDDM and CRISP-DM process models • Observational (through case studies and field studies). • Analytical (through static analysis, architecture analysis, optimization and dynamic analysis). A static analysis helps in evaluation of a design artifact on the basis of static or desired qualities. • Experimental (through controlled experiments and simulation). • Testing (through functional or black box and structural or white box testing). • Descriptive (through informed arguments and scenario construction).

The model incorporates the same dimensions as Seddon’s model: • perceived ease of use(peou) • perceived usefulness(pu) • user satisfaction(us) • but replaces the Information Quality dimension of the original model with a Perceived semantic quality construct.(psq)

the Information Quality of a conceptual model users is the perceived semantic quality of the model such as how valid and complete it is with respect to (their perception of) the problem domain. • Validity means that all information conveyed by the model is correct and relevant to the problem. • completeness entails that the model contains all information about the domain that is considered correct and relevant. • conceptual model =KDDM process model

analytical testing of IKDDM and CRISP-DM

methodology for performing the analytical testing (using SPSS software v. 15) 1. Identified and recruited 42 study participants and randomly divided them in two groups. 2. Presented one group of users with a test questionnaire, which includes Data Mining tasks posed as multiple choice questions. • Provide them with the documentation of the CRISP-DM process model to assist in answering the questions (i.e. in executing tasks of a Data Mining project). • Presented the second group of users with the same test questionnaire but with the documentation of the IKDDM process model to assist in answering the questions.

3. After the completion of the test questionnaire, recorded the perception of the static qualities of the artifact (i.e. the CRISP-DM or the IKDDM process model) used by each participant through a set of survey questions (Refer Table 3). 4. Recorded each participant’s gender, role/designation, number of years of experience in Data Mining, and time taken to complete the test. A numeric id was used to link the responder’s test to the survey. No identifying detail, such as name of the participant, or name of the organization that the individual is affiliated were recorded.

5. Tested for statistical differences in the quality of the two models, as perceived by the users. The independent meanst-test as well as the Mann Whitney procedure was used to test the differences between the two groups (IKDDM versus CRISP-DM).

independent means t-test (for comparing performance of IKDDM model versus CRISP-DM model) • null hypothesis states that the experimental manipulation has no effect on the subjects and therefore we expect the sample means to be identical or very similar.

Mann–Whitney test ( for comparing difference in groups’ perception about static qualities of KDDM process models) • the static qualities of the KDDM process model employed by the users to execute the Data Mining tests (in the test questionnaire) was assessed through a set of survey questions with 7 point Likert-scale options, ranging from Strongly Agree to Strongly Disagree.

Pilot test of test questionnaire and survey • Four users with expertise in Data Mining participated in the pilot test. • On the basis of feedback received from the users the test questionnaire was slightly revised, and a final version was created for use in the actual evaluation.

Assessment of artifact by users with experience in Data Mining • The 42 participants were randomly assigned into two groups, CRISP-DM (N = 21) or IKDDM (N = 21) and were asked to use the documentation of KDDM process model to answer the Data Mining tasks.

The following information was recorded for each participant: • Date on which data was collected from the individual. • Participant’s Gender. • Participant’s Role/Title. • Participant’s number of years of Data Mining experience. • Start Time for the test. • End Time for the test.

Assessment of validity of the measurement instrument • Recommended threshold for composite reliabilities=0.70 • Minimum threshold for Cronbach’s alpha=0.7 • lower bound threshold value for average variances extracted (AVE)= 0.50

Validity assessments of formative construct: perceived semantic quality

Results of independent means t-test – analysis of performance of CRISP-DMeval versus IKDDMeval on test questionnaire: using independent mean t-test

Discussion of results of independent means t-test

Results of Mann–Whitney test

Results of Mann Whitney test to assess difference between groups on individual constructs • Results for perceived ease of use • Results for user satisfaction • Results for perceived usefulness • Results for perceived semantic quality

Results for perceived ease of use • The mann-whitney test is highly significant (p < 0.001) for the perceived ease of use scores of the two groups(refer table13). • This conclusion is reached by noting that for the survey scores representing perceived ease of use the mean rank is higher in the IKDDM group(30.98) than in the CRISP group(12.02).

Results for user satisfaction • The mann-whitney test is highly significant (p < 0.001) for the user satisfacation scores of the two groups(refer table13). • This conclusion is reached by noting that for the survey scores representing user satisfaction the mean rank is higher in the IKDDM group(30.67) than in the CRISP group(12.63).

Results for perceived usefulness • The mann-whitney test is highly significant (p < 0.001) for the perceived usefulness scores of the two groups(refer table13). • This conclusion is reached by noting that for the survey scores representing perceived usefulness the mean rank is higher in the IKDDM group(31.48) than in the CRISP group(11.52).

Results for perceived semantic quality • The mann-whitney test is highly significant (p < 0.001) for the perceived semantic quality scores of the two groups(refer table13). • This conclusion is reached by noting that for the survey scores representing perceived semantic quality the mean rank is higher in the IKDDM group(29.60) than in the CRISP group(13.40).

The results of Mann–Whitney test on the overall survey scores representing the quality of the process models indicate that a significant difference existed between the CRISP and IKDDM models. • The test results clearly indicate that the IKDDM model outperformed the CRISP model by a highly significant margin (p < 0.001). • This is an important result and signifies that users rated the effectiveness and efficacy of the IKDDM model as much higher than the CRISP model.

The results of Mann–Whitney test across the four constructs also indicated that the IKDDM group and CRISP group significantly differed in their perceptions of ease of use, usefulness, semantic quality and levels of user satisfaction of the model employed by them to execute tasks in Data Mining.

The IKDDM group reported significantly higher levels of perceived ease of use,perceived usefulness, semantic quality and user satisfaction as compared to the CRISP group. • The results confirm that IKDDM is more effective and efficient than the CRISP model in executing tasks of the KDDM process. • The limitations of existing KDDM process models (such as use of only a checklist approach, lack of explicit support towards execution of tasks) as identified in this research are certainly also perceived as being problematic by the Data Mining users.

Mohamad_taghipour@yahoo.com • www.drmohamadtaghipour.ir • 09123944126

بسمه تعالی داده کاوی و کشف دانش محمد تقی پور

بسمه تعالی داده کاوی و کشف دانش محمد تقی پور

Presentation Transcript