1 / 124

MIS 331 Data Mining Overview and Old Exam Questions

MIS 331 Data Mining Overview and Old Exam Questions. 2016/2017 Fall. Outline. Methodology - Overview Introduction Sampling Variance Data Description Data Preprocessing OLAP Clustering Frequent Pattern Mining Classification Numerical Prediction – Regression Analysis of Variance

earthat
Download Presentation

MIS 331 Data Mining Overview and Old Exam Questions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIS 331 Data MiningOverview andOld Exam Questions 2016/2017 Fall

  2. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  3. Methodology - Overview • KDD Methodology • Functionalities

  4. KDD Methodology • Methodology • Problem definition • Data set selection • Preprocessing transformations • Functionalities • Classification/numerical prediction • Clustering • Frequent Pattern Mining • Association • Sequential analysis • Outlier Analysis • others

  5. KDD Methodology (cont.) • Algorithms • For classification you can use • Decision trees ID3,C4.5 CHAID are algorithms • For clustering you can use • Partitioning methods k-means,k-medoids • Hierarchical AGNES • Probabilistic EM is an algorithm • Presenting results • Back transformations • Reports • Taking action

  6. Preprocessing • Data Cleaning • filling missing values • smothing noicy data • Inconsistencies • Identfying outliers • Data ıntegration • Data reduction • Principal components • Attribute elimination • Attribute combination • Samplinng • Histograms • -Data transformation and discretization

  7. Functionalities • Two Styles of Data Mining • Descriptive • Predictive - OLAP

  8. Two basic style of data mining • Descriptive • Cross tabulations,OLAP,attribute oriented induction,clustering,association • Predictive • Classification,numerical prediction • Difference between classification and numerical prediction • Questions answered by these styles • Supervised v.s. Unsupervised learning

  9. Descriptive - OLAP • Concept of data cube • Fact table • Measures – calculated measures • Keys • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child • Attributes not suitable for concept hierarcies

  10. Clustering • Distance measures • Dissimilarity or similarity • For different type of variables • Ordinal,binary,nominal,ratio,interval • Why need to transform data • Partitioning methods • K-means,k-medoids • Adventage disadventage • Hierarchical • Density based • probablistic

  11. Frequent Pattern Mining • Association analysis • Algorithms: Apriori, FP-Growth • How to measure strongness of rules • support and confidence • Other measures of interestingness - critique of support and confidence • Multiple levels • Constraints • Sequential Pattern Mining

  12. Classification • Methods • Decision trees • Neureal networks • Bayesian • K-NN or model based reasoning • Adventages disadventages • Given a problem which data processing techniques are required • Given a problem shich classification method or algorithm is more apprpriate

  13. Classification (cnt.d) • Accuracy of the model • Measures for classification/numerical prediction • How to better estimate • Holdout,cross validation, bootstraping • How to improve • Bagging, boosting • For unbalanced classes • What to do with models • Lift charts

  14. Numercal Prediction • Learning is supervised • Output variable is continuous • Methods • Regression • Simple • Multiple • Most methods for classification can be used for numerical prediction as well • Accuricy • Root mean square, absolute mean deviation

  15. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  16. Introduction • Defineing problems • Given a short description of an environment, deine data mining problems fiting to different functionalities, possible preprocessing problems paciliur to the environment • Basic functionalities • Given a short description of a data mining problem, with which functionality the problem is solved?

  17. Big University Library • 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation?

  18. Big University Library (cont.) • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  19. Big University Library (cont.) • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  20. Data mining on MIS • A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt)

  21. Data mining on MIS (cont.) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  22. Data mining on MIS (cont.) • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  23. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  24. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  25. Data Description • How to describe single variables – categorical and continuous • How to desribe association between two variables • bnoth continuous • both categorical • one continous, one categorical

  26. Data Description • Single variables • Categorical scales: ordinal, nominal • centeral tendency - mode • Frequency plots, tables, pie charts • Continuous scales: interval, ratio • centeral tendency – mean, medin, mode • spread – IQR, variance, standard deviation • 5-point summary • Graphical: • examine the probability distribution, histograms q plots

  27. Data Description • For two variables • Both categorical • Cross tabulation • One categorical the other continuous • Both are continuous • numerical measures: • covariance,correlation coeficient, scatter plots,q-q plots

  28. MIS 494 2001/2002 Spring Midterm – two variables plots • Given the following 5 number summary for two price distributions A and B • A: minimum=10. Q1=30, median= 60, Q3=70, maximum=150 • B: minimum= 20, Q1=25, median= 50, Q3=90, maximum=95 • show a boxplot of the data • a quantile plot of the two data sets • a quantile-quantile plot of the data .

  29. MIS 494 2001/2002 Spring Midterm – contingency test • 1. The following contingency table summarizes supermarket transaction data, where hot dogs refers to the transactions containing hot dogs, hotdogs refers to the transactions that do not contain hot dogs hamburgers refers to the transactions containing hamburgers and hamburgers refers to the transactions that do not contain hamburgers. • hotdogs hotdogs row totals • hamburgers 2000 500 2500 • hamburgers 1000 1500 2500 • column total3000 2000 5000 • a) suppose that the association rule “hot dogs hamburgers” is mined. Given a minimum support threshold of 25% and a minimum confidence threshold of 50%, is this association rule strong? • b) based on the given data is the purchase of hot dogs independent of the purchase of hamburgers? if not what kind of correlation relation ship exists between the two?

  30. MIS 467 2005/2006 Fall Midterm – correlation coefficent • 2. Show that the correlation coefficient between two variables X and Y is not affected by the change of unit of measurments of X or Y. (consider linear transformations such as measuring temperature by oC or oF:X’=aX+b,Y’=cY+d ) Consider the regression of Y on X, Y = +X, show that the least square estimates of  and  are affected from the unit of measurments changes.

  31. 3. (25 points) Consider a data set of two continuous variables X and Y. X is right skewed and Y is left skewed. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y separately. • b) Draw box plots of X and Y separately. • c) Draw q-plots (quantile) of X and Y separately. • d) Draw q-q plot of X and Y.

  32. MIS 542 2012/2013 Final • 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y on the same graph. • b) Draw box plots of X and Y separately.

  33. MIS 542 2014/2015 Midterm - mean • 3. (20 pts) Suppose there are n observations x1,x2,…,xn, with mean mean_old. When a new observation xn+1 is added to the dataset, the value of mean is say mean_new (mean of n+1 observations) Show that change in mean, when the new observation xn+1 is added, is proportional to the difference between xn+1 and mean_old with the proportionality constant is 1/(n+1). Note: not an illustration with numbers but a formal derivation is needed.

  34. MIS 542 2014/2015 Final – correlation coefficient • 1. (20 pts) Show that if there is a perfect linear relationship between two continuous variables X and Y, correlation coefficient between Y and X is either +1 or -1. Note correlation coefficient is covariance of X and Y divided by product of standard deviations of X and Y.

  35. MIS 214 2013/2014 Spring Quiz 3 – contingency test • Following a presidential debate, people were asked how they might vote in the forthcomming election. Is there any association between one’s gender and choice of presidential candidate? • Gender • Candidate Preference male female • Candiate A 150 130 • Canddate B 100 120

  36. MIS 214 2013/2014 Spring Final - contingency test • 4. (15 pt) A random sample of 150 residents was asked to indicate their first preference for one of three television satations (shown bellow). Test the null hypothesis that for the population the first preferences are evenly distributed among the three satations. • Station A B C • # first preference 47 42 61

  37. MIS 214 2014/2015 Spring Quiz 3 – contingency test • Opinions of voters are asked towards supporting a political leader in the elections. (Either support or not support) • What are the minimum and maximum values of samples supporting the candidate when the null hypothesis is that the proportion of supporters and non-supporters are equal against the alternative that the proportions of supporters and non-supporters are different, when the p-value of the 2 test statistics is 0.05, for a sample size of 100?

  38. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  39. Preprocessing • What to do as preprocessing? • Which techniques are applied? • For what reason?

  40. MIS 542 Midterm 2011/2012 Fall PCA • 5. (10 points) Consider two continuous variables X and Y. Generate data sets • a) where PCA (principle component analysis) cannot reduces the dimensionality from two to one • b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one

  41. MIS 542 Final 2011/2012 Fall Outliers • 1 (20 points) Give two examples of outliers. • a) Where outliers are useful and essential patterns to be mined. • b) Outliers are useless steaming from error or noise.

  42. MIS 542 Final 2011/2012 Fall transformations • 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points).

  43. Outline • Methodology - Overview • Introduction • Sampling Variance • Data Description • Data Preprocessing • OLAP • Clustering • Frequent Pattern Mining • Classification • Numerical Prediction – Regression • Analysis of Variance • Recent BIS Exams • Unclassified Exams

  44. OLAP • Concept of data cube • Fact table • Measures – calculated measures • Keys • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child • Attributes not suitable for concept hierarcies

  45. Data warehouse for library • A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. • Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. • What is the total number of cuboids for the library cube? • Describe three meaningfull OLAP queries and write sql expresions for one of them.

  46. Big University • 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005)

  47. Big University (cont.) • a) draw a snawflake sheam diagram for that warehouse • What are the concept hierarchys for the dimensions • b) What is the total nmber of cuboids

  48. MIS 542 Final 2005/2006 Spring olap • 1. MIS department wants to revise academic strategies for the following ten years. Relevent • questions are: What portion of the courese are required or elective? What is the full time part • time distribution of instuctors? What is the course load of instructors? What percent of • technical or managerial courses are thought by part time instructors? How all theses things

  49. MIS 542 Final S06 1 cont. • changed over years? You can add similar stategic quustions of your own. Do not conside • students aspects of the problem for the time being. Desing and OLAP sheam to be used as a • strategic tool. You are free to decide the dimensions and the fact table. Describe the concept • hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to • answer three of such strategic questions

  50. MIS 54 Final 2012/2013 Hospital • 2. (20 pts) Suppose that a data warehouse for a hospital consists of the following dimensions: time, doctor and patient and the two measures count and charge, where charge is the fee a doctor charge a patient for a visit. • Design a warehouse with star schema: • a) Fact table: Design the fact table. • b) Dimension tables: For each dimension show a reasonable concept hierarchy. • c) State two questions that can be answered by that OLAP cube. • d) Show drilldown and roll up operations related to one of these questions

More Related