1 / 80

2013/2014 Summer

Data Minin g and Knowledge Acquizition — Chapter 7 — — Data Mining Overwiev and Exam Questions —. 2013/2014 Summer. Data Mining. Methodology Problem definition Data set selection Preprocessing transformations Functionalities Classification/prediction Clustering Association

Download Presentation

2013/2014 Summer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Knowledge Acquizition — Chapter 7 ——Data Mining Overwievand Exam Questions— 2013/2014 Summer

  2. Data Mining • Methodology • Problem definition • Data set selection • Preprocessing transformations • Functionalities • Classification/prediction • Clustering • Association • Sequential analysis • others

  3. Methodology cont. • Algorithms • For classification you can use • Decision trees ID3,C4.5 CHAID are algorithms • For clustering you can use • Partitioning methods k-means,k-medoids • Hierarchical AGNES • Probabilistic EM is an algorithm • Presenting results • Back transformations • Reports • Taking action

  4. Two basic style of data mining • Descriptive • Cross tabulations,OLAP,attribute oriented induction,clustering,association • Predictive • Classification,prediction • Questions answered by these styles • Difference between classification and prediction

  5. Classification • Methods • Decision trees • Neureal networks • Bayesian • K-NN or model based reasoning • Adventages disadventages • Given a problem which data processing techniques are required

  6. Classification (cnt.d) • Accuracy of the model • Measures for classification/numerical prediction • How to better estimate • Holdout,cross validation, bootstraping • How to improve • Bagging, boosting • For unbalanced classes • What to do with models • Lift charts

  7. Clustering • Distance measures • Dissimilarity or similarity • For different type of variables • Ordinal,binary,nominal,ratio,interval • Why need to transform data • Partitioning methods • K-means,k-medoids • Adventage disadventage • Hierarchical • Density based • probablistic

  8. Association • Apriori or FP-Growth • How to measure strongness of rules • Support and confidence • Other measures critique of support confidence • Multiple levels • Constraints • Sequential patterns

  9. OLAP • Concept of cube • Fact table • measures • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child

  10. Pre processing • Missing values • Inconsistencies • Redundent data • Outliers • Data reduction • Attribute elimination • Attribute combination • Samplinng • Histograms

  11. Clustering preferences • Consider a popular song competition. There are N competitors A1, A2,… AN. Number of voters is very large; a substantial fraction of the population of the country. Each voter is able to rank the competitors form best to worst e.g. for voter 1 (A4>A2>A3>A1) meaning that there are four competitors and A4 is the best for voter 1 A1 being the worst. Suppose preference data is available for a sample of n voters at the beginning of competition. • Develop a distance measure between the preferences of two voters i and j • Suppose you have the k-means algorithm available in a package. Describe how you can use the k-means algorithm to clusters voters according to their preferences.

  12. Information gain • Consider a data set of two attributes A and B. A is continuous, whereas B is categorical, having two values as “y” and “n”, which can be considered as class of each observation. When attribute A is discretized into two equiwidth intervals no information is provided by the class attribute B but when discretized into three equiwidth intervals there is perfect information provided by B. Construct a simple dataset obeying these characteristics.

  13. Associations • In a particular database; AC and BC are strong association rules based on the support confidence measure. A and B are independent items. Does this imply that A  BC is also a strong rule based on the lift measure? A,B,C are items in a transaction database. • -if A B and BC are strong. Is AC a strong rule • -if A B and AC are strong. İs BC a strong rule

  14. clustering • Construct simple data sets showing the inadequacies of k-means clustering (20 pnt) • this algorithm is not suitable of even spherical clusters of different sizes • What are the adventages and disadventage of using k-means

  15. Node 2 A=a1 Decision Y Node 3 A=a2 Node 4 B=b1 Decision N Node 5 B=b2 Decision is Y Decision tree • 2. a-Construct a data set that generates the tree shown below In addition the following conditions are satisfied

  16. Define data mining problems • 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation?

  17. Define data mining problems • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  18. Define data mining problems • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  19. clustering • Consider a delivery center location decision problem in a city where a set of related products are to be delivered to markets located in the city. Design an algortihm for this lacation selection problem extending an algortihm we cover in class. State clearly the algorithm and its extensions.for this particular problem.

  20. Data warehouse for library • A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. • Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. • What is the total number of cuboids for the library cube? • Describe three meaningfull OLAP queries and write sql expresions for one of them.

  21. Data mining on MIS • A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt)

  22. Data mining on MIS 2 • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  23. Data mining on MIS 3 • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  24. OLAP Big University • 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005)

  25. a) draw a snawflake sheam diagram for that warehouse • What are the concept hierarchys for the dimensions • b) What is the total nmber of cuboids

  26. MIS 542 midterm S06 association constratint • The price of each item in a store is nonnegative. For the following cases indicate the type of constraints (such as: monotone, untimonotone, tough, storngly convertable or succinct) • a) Containing at least one Nintendo Game. • b) The average price of items is between 100 and 500.

  27. MIS 542 Final S06 1 olap • 1. MIS department wants to revise academic strategies for the following ten years. Relevent • questions are: What portion of the courese are required or elective? What is the full time part • time distribution of instuctors? What is the course load of instructors? What percent of • technical or managerial courses are thought by part time instructors? How all theses things

  28. MIS 542 Final S06 1 cont. • changed over years? You can add similar stategic quustions of your own. Do not conside • students aspects of the problem for the time being. Desing and OLAP sheam to be used as a • strategic tool. You are free to decide the dimensions and the fact table. Describe the concept • hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to • answer three of such strategic questions

  29. MIS 542 Final S06 2 • 2. Given the training data set with missing values: • A(Size) B(color) C(shape) Class • small yellow round A • big yellow round A • big yellow red A • small red round A • small black round B • big black cube B • big yellow cube B • big black round B • small yellow cube B

  30. MIS 542 Final S06 2 cont. • a) Apply the C4.5 algorithm to construct a decision tree. • b) Given the new inputs X:size= small,color= missing, shape=round.and Y:size= big,color= yellow, shape=missing What is the prediction of the tree for X and Y? • c) How do you classify the new data points given in part b) using Bayesian Classification? • d) Analyse the possibility of pruning the tree. You can make normal approximation to Binomial distribution though number of observations is low. z value for upper confidence limit of c=25% is 0.69.

  31. MIS 542 Final S06 neural networks • 4. Consider a classification problem with two classes as C1 and C2. There are two numerical input variables X1 and X2, taking values between 0 and infinity. All observations are of class C1, if they are above X2 = 1/X1 curve (a hyperbola) All other observations are class C2. Describe how multilayer perceptrons can separate such a boundary using as few hidden nodes as possible.

  32. MIS 542 Final S06 clustering • 3. a) Describe how to modify k-means algorithm so as to handle categorical variables (binary, ordinal, nominal). • b) What is a disadventage of Agglomerative hierarchical clustering method in the case of large data. Suggest a way of eliminating this disadventages while benefiting the adventages of agglomerative methods

  33. MIS 542 Midterm S08 clustering • Generate data set of two continuous variables X and Y. Consider clustering based on density • When clustered with one variable there (either X or Y) there is one cluster • When clustered with both variable there there are two clusters

  34. MIS 542 Midterm S08 2 cşass,f,cat,pm • Consider a clasification problem with two continuous variables X and Y and a categorical output with two distinct values C1 and C2 • Generate data set such that • A) Decision trees are appropriate for clasification • B) Decision trees are not appropriate for clasification but a perceptron can classify the data succesfully • C) Even s single perceptron is not enough to classify the data • D) How do you encorporate a perceptron into decision trees so that cases in B and C can be clasified by an hybrid approach of DTs and perceptron

  35. Final 2010/2011 Spring • 2 (30 pt.) Consider a prediction problem; e.g. predicting weight using height(a continuous variable) as input, solved by neural networks. Such methods as back propagation try to minimize the prediction error but it is claimed that the magnitude of error depends on the weight: a prediction error of 0.5 for a baby with a short height should not be the same as for an adult with a height of 2.00 meters. • a) Make a scatter plot of such a hypothetical data set for a two variable problem. • b) Plot the prediction error on another graph • c) Do you need to modify the back propagation algorithm so as to handle such a situation? If so explain your modification.

  36. Final 2011/2012 Fall data description • 1 (20 points) Give two examples of outliers. • a) Where outliers are useful and essential patterns to be mined. • b) Outliers are useless steaming from error or noise.

  37. Final 2011/2012 Fall cşass,f,cat,pm • 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points).

  38. Final 2011/2012 Fall pverf,tt,mg • 4. Illustrate the over fitting of neural networks for the following cases by generating data sets. • a) (10 points) For a binary classification problem with two continuous inputs. • b) (10 points) For a numerical prediction problem (output being continuous) with one continuous input variable.

  39. Final 2011/2012 Fall • 3 a (10 points) Generate data sets for two clustering problems with two continuous variables. Two natural clusters for the notion of density based clustering but the quality of these clusters are low for a partitioning approach based on dissimilarity such as k-means • 3.b (10 points) Considering the advantages and disadvantages of partitioning and hierarchical agglomerative clustering approaches. Design a method for combining the two approaches to improve good clustering quality. (Finally there are hierarchies of clusters)

  40. Midterm 2011/2012 Fall • 6. (25 points) A retail company asked to segment its customers. Following variables are available for each customer: age, income, gender number of children, occupation, house owner, have a car or not. There are 6 category of goods sold by the company and total purchases from each category is available for each customer, in addition average • inter-purchase time is also included in the database.

  41. Midterm 2011/2012 Fall • a) What are the types and scales of these variables? • b) If your tool has only k-means algorithm which of these variables are more suitable for the segmentation problem? • c) What data transformations are to be applied? • d) How do you reduce number of variables used in the analysis? • e) If you want to include categorical variables into your clustering, how would you treat them?

  42. Midterm 2011/2012 Fall • In Question 3-5 artificial data sets are generated for given situations. • 3. (10 points) Consider a data set of two continuous variables X and Y. There are two clusters (k=2) • Considering the advantages and disadvantages of partitioning methods k-means and k-medoids of clustering, generate two dimensional data set • a) (5 pnt) Produces almost the same clusters by k-medoids and k-means • b) (5 pnt) Produces different clusters by k-medoids and k-means

  43. Midterm 2011/2012 Fall • 4. (10 points) Consider a classification by a decision tree problem. Consider a categorical input variable A having two distinct values. The output variable B has two distinct classes as well. At a particular node of the tree there are N data objects. Generate partitioning of data by input variable A for the following • a) A does not provide any information: does not decrease information gain at all. • b) A does provides perfect information: decrease information gain as much as possible

  44. Midterm 2011/2012 Fall • 5. (10 points) Consider two continuous variables X and Y. Generate data sets • a) where PCA (principle component analysis) can not reduces the dimensionality from two to one • b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one

  45. 5. (25 points) Consider a data set representing the interactions among a set of people. The degree of interaction is a positive real number; high values can be interpreted as, the two members are closely related (they have close interactions such as heavy telephone calls or mail traffic between them) In other words rather then including the coordinates of variables directly, the similarity/dissimilarity matrix is given. This is a symmetric matrix. Develop an algorithm for clustering similar objects into same clusters. Assume that number of clusters (k) is given

  46. 3. (25 points) Consider a data set of two continuous variables X and Y. X is right skewed and Y is left skewed. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y separately. • b) Draw box plots of X and Y separately. • c) Draw q-plots (quantile) of X and Y separately. • d) Draw q-q plot of X and Y.

  47. 4. (25 points) A strategy for clustering high dimensional data of continuous variables is: First apply principle components to reduce the dimensionality of the data set and apply clustering on the reduced form of the data. Discuss the drawback(s) of this approach.

  48. 1. (25 points) In an organization a data warehouse is to be designed for evaluating performance of employees. To evaluate performance of an employee, survey questionnaire is consisting a set of questions with 5 Likered scale are answered by other employees in the same company at specified times. That is, performance of employees are rated by other employees. • Each employee has a set of characteristics including department, education,… Each survey is conducted at a particular date applied to some of the employees. Questions are aimed to evaluate broad categories of performance such as motivation, cooperation ability,… • Typically, a question in a survey, aiming to measure a specific attitude about an employee is evaluated by another employee (rated f rom 1 to 5) Data is available at question level.

  49. Cube design: a star schema • Fact table: Design the fact table should contain one calculated member. What are the measures and keys? • Dimension tables: Employee, and Time are the two essential dimensions include a Survey and Question dimensions as well. For each dimension show a concept hierarchy. • State three questions that can be answered by that OLAP cube. • Show drilldown and role up operations related to these questions

  50. MIS 541 2012/2013 Final • 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y on the same graph. • b) Draw box plots of X and Y separately.

More Related