Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics)
Lecture Outline • Introduction • General Info • Questionnaire • Introduction to Statistics • Statistics at work • The Analytics Process • Descriptive Statistics & Distributions • Graphs and Visualisation
Introduction • Name : Aoife D’Arcy • Email: email@example.com • Bio: Managing Director and Chief Consultant at the Analytics Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 12 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; I have developed particular expertise in risk analytics, fraud analytics, and customer insight analytics. • Lecture Notes: Will be available online on www.comp.dit.ie/bmacnamee and later on webcourses
Exam & Assignment Exam • The end of term exam accounts for 60% of the overall mark Assignment • The assignment is worth 40% of the overall mark. • The assignment will be handed out in week 5 • Week 9’s class will be dedicated to working on the assignment.
Software • SAS Enterprise Guide will be the software that will be used during the course.
Recommended Reading Applied Statistics and Probability for Engineers John Wiley & SonsDouglas C. Montgomery Modelling Binary DataChapman & HallDavid Collett Probability and Statistics for Engineers and Scientists Pearson Education R.E. Walpole, R.H. Myers, S.L. Myers, K. Ye Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker Statistical InferenceBrooks/ColeGeorge Casella
We are bombarded with Statistics • http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html • http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html • http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-27th-3229426.html
The internet is full of interesting statistics http://www.usatoday.com/news/politics/twitter-election-meter
Statistics can be misleading • An ad claimed: “9 Out of 10 Dentists prefer Colgate” • What is wrong with this statement? • Consider these complaints about airlines published in US News and World Report on February 5, 2001 • Can we conclude the United airlines has the worst customer service?
Statistics in Everyday Life • With the increase in the amount of data available and advancement`s in the power of computers, statistics are being used more and more frequently Question: Is it good that statistics are used so much and what happens when statistics are misused?
Misinterpreted Statistics can be Devastating • In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow. • He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543). • What is wrong with this assumption?
Video • http://www.youtube.com/watch?v=4TKbIidbyhk&feature=fvwrel
Challenges • As an Analytics practitioner you will face a number of challenges: • Create insight from all available data (and there is lots of it) • Interpret statistic correctly • Communicate statistically driven insight in a way that is clearly understood
Objective of this course • Give you a set of statistical skills to allow you, as an analytics practitioner, turn data into insight!!
Section Overview • Statistics and Analytics • Introduction to CRISP
Data Analytics Is Multidisciplinary Statistics Pattern Recognition Neurocomputing Data Warehousing Machine Learning AI Predictive Analytics Databases KDD
Analytics Is A Lot Of Things What’s the best that can happen? Optimization What will happen next? Predictive modelling Predictive Analytics Forecasting/extrapolation What if these trends continue? Why is this happening? Statistical analysis Competitive advantage Alerts What actions are needed? Where exactly is the problem? Query/drill down Access & reporting How many, how often, where? Ad hoc reports What happened? Standard reports Degree of intelligence
For this course we will concentrate on Statistical Analysis What’s the best that can happen? Optimization What will happen next? Predictive modelling Predictive Analytics Forecasting/extrapolation What if these trends continue? Why is this happening? Statistical analysis Competitive advantage Alerts What actions are needed? Where exactly is the problem? Query/drill down Access & reporting How many, how often, where? Ad hoc reports What happened? Standard reports Degree of intelligence
CRISP-DM Evolution • Over 200 members of the CRISP-DM SIG worldwide • DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc • System Suppliers/Consultants: Cap Gemini, ICL Retail, Deloitte & Touche, etc • End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc • Crisp-DM 2.0 is due… CompleteinformationonCRISP-DMisavailableat: http://www.crisp-dm.org/
CRISP-DM • Features of CRISP-DM: • Non-proprietary • Application/Industry neutral • Tool neutral • Focus on business issues • As well as technical analysis • Framework for guidance • Experience base • Templates for Analysis
Business Understanding Data Understanding Data Preparation Data Deployment Modelling Evaluation
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Determine Business Objectives BusinessUnderstanding Thisinitialphasefocusesonunderstandingtheprojectobjectivesandrequirementsfromabusinessperspective,thenconvertingthisknowledgeintoadataminingproblemdefinitionandapreliminaryplandesignedtoachievetheobjectives Assess Situation Determine Data Mining Goals Produce Project Plan
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Collect Initial Data DataUnderstanding Thedataunderstandingphasestartswithaninitialdatacollectionandproceedswithactivitiesinordertogetfamiliarwiththedata,toidentifydataqualityproblems,todiscoverfirstinsightsintothedataortodetectinterestingsubsetstoformhypothesesforhiddeninformation. Describe Data Explore Data Verify Data Quality
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Data DataPreparation Thedatapreparationphasecoversallactivitiestoconstructthedatathatwillbefedintothemodellingtoolsfromtheinitialrawdata.Datapreparationtasksarelikelytobeperformedmultipletimesandnotinanyprescribedorder.Tasksincludetable,recordandattributeselectionaswellastransformationandcleaningofdataformodellingtools. Clean Data Construct Data Integrate Data Format Data
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Modeling Technique Modelling Inthisphase,variousmodellingtechniquesareselectedandappliedandtheirparametersarecalibratedtooptimalvalues.Typically,thereareseveraltechniquesforthesamedataminingproblemtype.Sometechniqueshavespecificrequirementsontheformofdata.Therefore,steppingbacktothedatapreparationphaseisoftennecessary. Generate Test Design Build Model Assess Model
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Evaluation Beforeproceedingtofinaldeploymentofamodel,itisimportanttothoroughlyevaluateitandreviewthestepsexecutedtoconstructittobecertainitproperlyachievesthebusinessobjectives.Akeyobjectiveistodetermineifthereissomeimportantbusinessissuethathasnotbeensufficientlyconsidered.Attheendofthisphase,adecisionontheuseofthedataminingresultsshouldbereached. Evaluate Results Review Process Determine Next Steps
Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Deployment Creationofamodelisgenerallynottheendoftheproject.Evenifthepurposeofthemodelistoincreaseknowledgeofthedata,theknowledgegainedwillneedtobeorganizedandpresentedinawaythatthecustomercanuseit.Dependingontherequirements,thedeploymentphasecanbeassimpleasgeneratingareportorascomplexasimplementingarepeatabledataminingprocessacrosstheenterprise. Plan Deployment Plan Monitoring & Maintenance Produce Final Report Review Project
Crisp - DM • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment.
Crisp – DM – Areas covered in this course • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment
Topics • Introduction to Statistics • The Basics • Measures of location: Mean, Median & Mode. • Measures of location & Skew. • Measures of dispersion: range, standard deviation (variance) & interquartile range.
Introduction to Statistics • According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data. • There are two main branches of Statistics: • The branch of statistics devoted to the organisation, summarization and the description of data sets is called Descriptive Statistics. • The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.
Process of Data Analysis • A Statistical population is a data set that is our target of interest. • A sample is a subset of data selected from the target population. • If your sample is not representative then it is referred to as being bias MakeInference Describe
Types of Data: Numeric Data • Numeric data can be of two types: • Continuous Data: Data is continuous if it has an interval of real numbers for its range • The number of centimetres of rain that fell in March • Discrete Data: Data is defined as discrete if it has a finite range • The number of correct answers in a 10 question quiz
Types of Data: Categorical Data • Data that is broken into discrete categories is referred to as categorical data • Categorical data has two main types: • Nominal: A nominal variable has a discrete number of categories or levels with no logical order • Gender: Male, Female • Working Status: Employed, Unemployed, Home-maker, Student, Retired • Ordinal: An ordinal variable has a discrete number of categories or levels with a logical order • Income Level: Low, Medium, High • Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th
Class Task • Task: Classify the type of each of the data the following examples: • The profit margin made from customers of an online clothing company • The type of interest rate you can be charged on a mortgage i.e. Fixed rate, Adjustable rate • Number of dependents a associated with a loan applicant
Let’s Start at the Very Beginning • When learning to read and write we start with A-B-C, when starting to count we start with 1-2-3 and of course The Von Trappe family singers started with Do-Re-Me! • When learning statistics you start with the arithmetic mean or a simple average
The Arithmetic Mean • * Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union • Data source http://www.databaseolympics.com/index.htm The table below shows the total medals won and gold medals won by each country in the last 5 Olympic games
Arithmetic Mean – The Formula • The formula for calculating the sample arithmetic mean of n data points x1, x2 ..... xn: is referred to as x-bar
Attributes of the Arithmetic Mean • It is straight-forward to calculate • It is easy to interpret the mean • It gives us a good estimate of where a set of numbers is centred • This is referred to as the central tendency of a sample • It is sensitive to outliers
Other Measures of Central Tendency • Median:The middle value of an ordered set of values, i.e. 50% higher and 50% lower • Mode:The most commonly occurring value in a distribution
Calculating the Median Sort the data Median = 97.5
Calculating the Mode Count frequencies Mode = 94
When to Use Each Central Tendency Value? • Question: When and why would you use the median over the mean?