1 / 76

Probability & Statistical Inference Lecture 1

Probability & Statistical Inference Lecture 1. MSc in Computing (Data Analytics). Lecture Outline. Introduction General Info Questionnaire Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation. Introduction.

sivan
Download Presentation

Probability & Statistical Inference Lecture 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics)

  2. Lecture Outline • Introduction • General Info • Questionnaire • Introduction to Statistics • Statistics at work • The Analytics Process • Descriptive Statistics & Distributions • Graphs and Visualisation

  3. Introduction • Name : Aoife D’Arcy • Email: aoife@theanalyticsstore.com • Bio: Managing Director and Chief Consultant at the Analytics Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 10 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; Aoife has developed particular expertise in risk analytics, fraud analytics, and customer insight analytics. • Lecture Notes: Will be available online on www.comp.dit.ie/bmacnamee and later on webcourses

  4. Pre-requisite TMP-5 Research Writing & Scientific Literature TMP-6 Research Methods & Proposal Writing Programme Overview SPEC 9160 Problem Solving Communication & Innovation TMP-7 Research Project & Dissertaion TMP-1 Data Mining TMP-4 Case Studies in Computing MATH 4814 Decision Theory & Games TMP-3 Data Management TMP-2 Data & Database Design for Data Analytics SPEC9260 Geographic Information Systems BUS9290 Legal Issues for Knowledge Management TMP-10 Designing and Building Semantic Web Applications SPEC9290 Universal Design for Knowledge Management SPEC 9270 Machine Learning TMP-9 Language Technology TMP-0 Probability & Statistical Inference MATH 4821 Industrial & Commercial Statistics SENG X01 Software Project Management MATH 4807 Financial Mathematics - I MATH 4818 Financial Mathematics - II INTC9221 Strategic Issues in IT MATH 4809 Linear Programming INTC9231 Internet Systems TECH9290 Ubiquitous Computing INTC 9141 Enterprise Systems Integration TECH9280 Security Core Module MATH 4810Queuing Theory & Markov Processes TECH9250 Complex and Adaptive Agent Based Computation Option Module

  5. Course Outline

  6. Exam & Assignment Exam • The end of term exam accounts for 60% of the overall mark Assignment • The assignment is worth 40% of the overall mark. • The assignment will be handed out in week 5 • Week 9’s class will be dedicated to working on the assignment.

  7. Software • SAS Enterprise Guide will be the software that will be used during the course.

  8. Recommended Reading Applied Statistics and Probability for Engineers John Wiley & SonsDouglas C. Montgomery Modelling Binary DataChapman & HallDavid Collett Probability and Statistics for Engineers and Scientists Pearson Education R.E. Walpole, R.H. Myers, S.L. Myers, K. Ye Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker Statistical InferenceBrooks/ColeGeorge Casella

  9. Questionnaire

  10. Section 1: Statistics at work

  11. Statistics in Everyday Life • With the increase in the amount of data available and advancement`s in the power of computers, statistics are being used more and more frequently. We are constantly reading about surveys done where 3 out 5 people prefer brand X or research has shown that having tomatoes in your diet can reduce the risk of dieses Y. Is it good that statistics are used so much and what happens when statistics are misused?

  12. Statistics can be misleading • An ad claimed: “9 Out of 10 Dentists prefer Colgate” • What is wrong with this statement? • During the Obama presidential election the follow was stated: “According to the Advertising Project, one out of three McCain ads has been negative, criticizing Obama. Nine out of 10 Obama ads have been positive, stressing his own background and ideas.” • What is wrong with this statement?

  13. Misinterpreted Statistics can be Devastating • In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow. • He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543). • What is wrong with this assumption?

  14. Video

  15. Challenges • As an Analytics practitioner you will face a number of challenges: • Create insight from data • Interpret statistic correctly • Communicate statistically driven insight in a way that is clearly understood

  16. The Analytics Process & Statistics

  17. Section Overview • Statistics and Analytics • Introduction to CRISP

  18. Predictive Analytics Is Multidisciplinary Statistics Pattern Recognition Neurocomputing Data Warehousing Machine Learning AI Predictive Analytics Databases KDD

  19. CRISP-DM Evolution CompleteinformationonCRISP-DMisavailableat: http://www.crisp-dm.org/ • Over 200 members of the CRISP-DM SIG worldwide • DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc • System Suppliers/Consultants: Cap Gemini, ICL Retail, Deloitte & Touche, etc • End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc • Crisp-DM 2.0 is due…

  20. CRISP-DM • Features of CRISP-DM: • Non-proprietary • Application/Industry neutral • Tool neutral • Focus on business issues • As well as technical analysis • Framework for guidance • Experience base • Templates for Analysis

  21. Business Understanding Data Understanding Data Preparation Data Deployment Modelling Evaluation

  22. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Determine Business Objectives BusinessUnderstanding Thisinitialphasefocusesonunderstandingtheprojectobjectivesandrequirementsfromabusinessperspective,thenconvertingthisknowledgeintoadataminingproblemdefinitionandapreliminaryplandesignedtoachievetheobjectives Assess Situation Determine Data Mining Goals Produce Project Plan

  23. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Collect Initial Data DataUnderstanding Thedataunderstandingphasestartswithaninitialdatacollectionandproceedswithactivitiesinordertogetfamiliarwiththedata,toidentifydataqualityproblems,todiscoverfirstinsightsintothedataortodetectinterestingsubsetstoformhypothesesforhiddeninformation. Describe Data Explore Data Verify Data Quality

  24. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Data DataPreparation Thedatapreparationphasecoversallactivitiestoconstructthedatathatwillbefedintothemodellingtoolsfromtheinitialrawdata.Datapreparationtasksarelikelytobeperformedmultipletimesandnotinanyprescribedorder.Tasksincludetable,recordandattributeselectionaswellastransformationandcleaningofdataformodellingtools. Clean Data Construct Data Integrate Data Format Data

  25. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Modeling Technique Modelling Inthisphase,variousmodellingtechniquesareselectedandappliedandtheirparametersarecalibratedtooptimalvalues.Typically,thereareseveraltechniquesforthesamedataminingproblemtype.Sometechniqueshavespecificrequirementsontheformofdata.Therefore,steppingbacktothedatapreparationphaseisoftennecessary. Generate Test Design Build Model Assess Model

  26. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Evaluation Beforeproceedingtofinaldeploymentofamodel,itisimportanttothoroughlyevaluateitandreviewthestepsexecutedtoconstructittobecertainitproperlyachievesthebusinessobjectives.Akeyobjectiveistodetermineifthereissomeimportantbusinessissuethathasnotbeensufficientlyconsidered.Attheendofthisphase,adecisionontheuseofthedataminingresultsshouldbereached. Evaluate Results Review Process Determine Next Steps

  27. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Deployment Creationofamodelisgenerallynottheendoftheproject.Evenifthepurposeofthemodelistoincreaseknowledgeofthedata,theknowledgegainedwillneedtobeorganizedandpresentedinawaythatthecustomercanuseit.Dependingontherequirements,thedeploymentphasecanbeassimpleasgeneratingareportorascomplexasimplementingarepeatabledataminingprocessacrosstheenterprise. Plan Deployment Plan Monitoring & Maintenance Produce Final Report Review Project

  28. Crisp - DM • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment.

  29. Crisp – DM – Areas covered in this course • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment

  30. Section 2: Descriptive Statistics & Distributions

  31. Topics • Introduction to Statistics • The Basics • Measures of location: Mean, Median & Mode. • Measures of location & Skew. • Measures of dispersion: range, standard deviation (variance) & interquartile range.

  32. Introduction to Statistics • According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data. • There are two main branches of Statistics: • The branch of statistics devoted to the organisation, summarization and the description of data sets is called Descriptive Statistics. • The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.

  33. Process of Data Analysis MakeInference Describe A Statistical population is a data set that is our target of interest. A sample is a subset of data selected from the target population. If your sample is not representative then it is referred to as being bias

  34. Types of Data • There are a number of data types that we will be considering. •  These can be split into hierarchy of 4 levels of measurement. • Categorical • Nominal • Ordinal • Interval • Discrete • Continuous

  35. Describing Distributions

  36. Describing Distributions

  37. Measures of Location (Central Tendency) • Numbers that attempt to express the location of data on the number line • Variable(s) are said to be distributed over the number line - so we talk of distributions of numbers • Want a measure of the location of this data on the number line. • There is 'symmetry' around this point in this particular data – hence the term central tendency

  38. Arithmetic Mean (average) • The mean of a data set is one of the most commonly used statistics. It is a measure of the central tendency of the data set. • The mean of a sample is denoted by (pronounced x bar) and the mean of a population is denoted by µ (pronounced mew). • They are both ( and µ ) computed using the same formula.

  39. Arithmetic Mean - Example • Example: Ages of Students in 1st year history of Art degree course 18, 18, 18, 18, 19, 19, 20, 20, 58 Mean of ages here is 23.11 – but this is not a ‘typical value or a value around which the observed values cluster. • The same thing tends to happen with values that are strictly positive: average salaries, house prices etc. • We say that the mean is sensitive to extreme values

  40. Median • The middle value of the ordered set of values, i.e. 50% higher and 50% lower. • Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58 • The data is ordered, and n = 9, so the middle number is (n+1)/2 = (9+1)/2 = 5th value = 19 • => median = 19 years

  41. Median • Robust with regard to extreme values • Often a real value in the distribution or close to 2 real values - in that sense tends to be more typical of actually observed values

  42. Mode • The most commonly occurring value in a distribution • Example: The class age data again 18, 18, 18, 18, 19, 19, 20, 20, 58  The mode is 18 years as it occurs more than any other • Tends to show where the data is concentrated • Mode: 18 Mean: 23.11 Median: 19

  43. Skew – The Shape of a Distribution • There are a number of ways of describing the shape of a distribution. • We will consider only one – skew. • Skew is a measure of how asymmetric a distribution is.

  44. Symmetric Distributions  = skew is zero

  45. Positive Skew There are few very large data points which create a 'tail' going to the right (i.e. up the number line) Note: No axis of symmetry here - skew > 0 (i.e. it is positive) Example: Lifetime of people, house prices

  46. Negative Skew There are few very small data points which create a 'tail' going to the left (i.e. down the number line) Note: No axis of symmetry here - skew < 0 (i.e. it is negative) Examples: Examination Scores, reaction times for drivers

  47. Skew & Measures of Location - Symmetry Mean, Median & Mode are the same and are found in the middle • Mean = 102/17 = 6 • Median = 6 • Mode = 6

  48. Positive Skew Mode Median Mean • Mean = 121/17 = 7.12 • Median = 7 • Mode = 6 In general: Mode < Median < Mean

  49. Negative Skew Mode Median Mean • Mean = 83/17 = 4.89 • Median = 5 • Mode = 6 In general: Mode > Median > Mean

  50. Measures of Spread (Dispersion) • The Mean, Mode and Median all 250 for both companies • But not the same - look at the difference in ‘spread’ of bills • Need a measure of spread (dispersion) as well as location to describe a distribution

More Related