900 likes | 913 Views
This lecture covers sampling methods and probability distributions, with a focus on estimating the average student age at a university. It explains the concepts of mean, standard deviation, and Gaussian/normal probability distribution, as well as how to calculate z-scores and probabilities.
E N D
Research Methodology Lecture 13 Sampling & Probability distributions Mazhar Hussain Dept of Computer Science ISP,Multan Mazhar.hussain@isp.edu.pk
Sampling • How to findaveragestudentage in the university? • Askeachstudent and compute the average • Randomly select 3 to 4 studentsfromeach discipline and findtheiraverageage – Estimation of the averageage of student in the university
Sampling • Whysampling? • Efforts and resourcesrequired to carry out the study on the population • Examples • Averageincome of families living in a city • Results of an election • Opinion about the a problem
Sampling Samplingis the process of selcetion a few (a sample) from a bigger group (the sampling population) to become the basis for estimating or predicting the prevalence of an unknownpiece of information, situation or outcomeregarding the bigger group
Recap – Mean & Standard deviation • Mean/Average • Standard Deviation • On the average, how far the data values are from the mean
Gaussian Distribution Karl Friedrich Gauss 1777-1855
Gaussian/Normal Probability Distribution • Most of the naturallyoccurringprocessescanbemodeled by a bellshapedcurve
Gaussian/Normal Probability Distribution • The Gaussian probability distribution is perhaps the most used distribution in all of science. • Sometimes it is called the “bell shaped curve” or normal distribution. = mean of distribution = standard deviation of distribution x is a continuous variable (-∞x∞
Gaussian/Normal Probability Distribution The area within +/- σ is ≈ 68% The area within +/- 2σ is ≈ 95% The area within +/- 2σ is ≈ 99.7%
Gaussian/Normal Probability Distribution • Probability (P) of x being in the range [a, b] is given by an integral: Gaussian pdf with m=0 and s=1 95% of area within 2s Only 5% of area outside 2s
Gaussian/Normal Probability Distribution Standard Normal Distribution
Standard Normal Distribution • Normal distribution with mean of zero and standard deviation of one • Since mean and standard deviation define any normal distribution… • Standard normal distribution can be used for any normally distributed variable by converting mean to zero and standard deviation to one—z scores
Z Scores • By itself, a raw score or X value provides very little information about how that particular score compares with other values in the distribution. • A score of X = 53, for example, may be a relatively low score, or an average score, or an extremely high score depending on the mean and standard deviation for the distribution from which the score was obtained. • If the raw score is transformed into a z-score, however, the value of the z-score tells exactly where the score is located relative to all the other scores in the distribution.
Z Scores • The process of changing an X value into a z-score involves creating a signed number, called a z-score, such that • The sign of the z-score (+ or –) identifies whether the X value is located above the mean (positive) or below the mean (negative). • The numerical value of the z-score corresponds to the number of standard deviations between X and the mean of the distribution. • Thus, a score that is located two standard deviations above the mean will have a z-score of +2.00
Z Scores • In addition to knowing the basic definition of a z-score and the formula for a z-score, it is useful to be able to visualize z-scores as locations in a distribution. • Remember, z = 0 is in the center (at the mean), and the extreme tails correspond to z-scores of approximately –2.00 on the left and +2.00 on the right. • Although more extreme z-score values are possible, most of the distribution is contained between z = –2.00 and z = +2.00.
Z Scores • z-score for a sample value in a data set is obtained by subtracting the mean of the data set from the value and dividing the result by the standard deviation of the data set. • NOTE: When computing the value of the z-score, the data values can be population values or sample values. Hence we can compute either a population z-score or a sample z-score.
Z Scores • The Sample z-score for a value x is given by the following formula: • Where is the sample mean and s is the sample standard deviation.
Z Scores • The Population z-score for a value x is given by the following formula: • Where is the population mean and is the population standard deviation.
Example • Example: What is the z-score for the value of 14 in the following sample values? 3 8 6 14 4 12 7 10 • Thus, the data value of 14 is 1.57 standard deviations above the mean of 8, since the z-score is positive.
Example • Dot Plot of the data points with the location of the mean and the data value of 14.
Z Score & Probability • Whatis the probability of finding a value between 100 and 110? How to calculatethis area using z scores?
Reading area undercurve for z=1.55 Z Score Chart 0.9394
Z Score & Probability P=1-0.9394 P=0.0606 0.9394 P=.0606 Probability of z>1.55 (Area in tail) 1.55
Z Score & Probability P=.0606+.0606 P=.1212 -1.55 1.55 Probability of z>1.55 + z<-1.55 (Area in both the tails)
Z Score & Probability P=.5-.0606=.4394 1.55 Probability of z>0 and z<-1.55 )
Example: 50 measures of pollution • Probability value > 45 .4372 P=.3300
Example: 50 measures of pollution • Probabilityfrom 35 to 45 -.5749 .4372 P=.2157+.1700=.3857 P=.5-.3300=.1700 P=.5-.2843=.2157
Sampling • Pros • Saves time • Resources – financial, human • Cons • Not exact value for the population • An estimate or prediction • Compromise on accuracy of findings
Sampling – Terminology • Examples • Averagestudentage in the university • Averageincome of families living in a city • Results of an election • Population or study population (N) • The universitystudents, families living in the city, electors • Sample • The small group of students, families or electorsyou chose to collect the required information
Sampling – Terminology • Sample size (n) • The number of entities in yoursample • Sampling design or strategy • The wayyou select the students, families or electors • Sampling unit or samplingelement • Eachstudent, family or elector in yourstudy • Samplestatistics • Yourfindingsbased on infomrationobtainedfromyoursample
Sampling – Terminology • Population Parameters • Aim of research – findanswers to research question for study population not the sample • Use samplestatistics to estimateanswers to research questions in study population • Estimatesarrivedatfromsamplestatistics – population parameters • Saturation Point • When no new information iscomingfromyourrespondents
Sampling – Terminology • Sampling Frame • A listidentifyingeachstudent, family or elector in the study population
Principles of sampling • Example – Four individuals A,B,C, D • A = 18 years • B = 20 years • C = 23 years • D = 25 years • Averageage • (18+20+23+25) / 4 = 21.5 years • Use a sample of twoindivudals to estimate the averageage of yourstudy population (4 individuals)
Principles of sampling • How many possible combinations of twoindividuals? • A and B • A and C • A and D • B and C • B and D • C and D
Principles of sampling • A+B = 18+20 = 38/2 = 19.0 years • A+C = 18+23 = 41/2 = 20.5 years • A+D = 18+25 = 43/2 = 21.5 years • B+C = 20+23 = 43/2 = 21.5 years • B+D = 20+25 = 45/2 = 22.5 years • C+D = 23+25 = 48/2 = 24.0 years • In two cases – no differencebetweensamplestatistics and population parameters • Difference – Samplingerror
Principles of sampling • Principle I In majority of cases of sampling, therewillbe a differencebetweensamplestatistics and the true population parameterswhichisattribuatable to the selection of the units in the sample
Principles of sampling • Instead of samples of two – take a sample of three • Four possible combinations • A+B+C = 18+20+23 = 61/3 = 20.33 years • A+B+D = 18+20+25 = 63/3 = 21.00 years • A+C+D = 18+23+25 = 66/3 = 22.00 years • B+C+D = 20+23+25 = 68/3 = 22.67 years
Principles of sampling -2.5 to +2.5 -1.17 to +1.17
Principles of sampling • The gap betweensamplestatistics and population parametersisreduced • Principle II The greater the sample size, the more accuratewillbe the estimate of the true population statistics
Principles of sampling • SameExample – Different Data • A =18 years • B = 26 years • C = 32 years • D = 40 years • Variable (age) – markedlydifferent
Principles of sampling • Estimateaverageusing • Samples of two • Samples of three • Difference in the averageage: • Sample size of 2: -7.00 to +7.00 years • Sample size of 3: -3.67 to +3.67 years • Range of differenceisgreaterthanpreviouslycalculated
Principles of sampling • Principle III The greater the difference in the variable understudy in a population for a givensample size, the greaterwillbe the differencebetween the samplestatistics and the true population parameters
Factorsaffecting the inference • Principlessuggestthattwofactorsmay influence the degree of certainity about the inferencesdrawnfrom a sample • Size of sample • Larger the sample size, the more accuratewillbe the findings • The extent of variation in the sampling population • Greater the variation in the study population w.r.t. the chracteristicsunderstudy, the greaterwillbe the uncertainity for a givensample size
Aims in selecting a sample • Achieve maximum precision in yourestimate • Avoidbias in selection • Biascanoccur if: • Non-randomsampling – consciously or unconsciouslyaffected by humanchoice • Sampling frame does not cover the sampling population accurately or completely • A section of sampling population is impossible to find or refuses to cooperate